Strange issue

For the second time in 2 weeks, I have an issue with phones dropping off. They show dropped on the dashboard, the endpoint shows “No Service”. If I dial an internal extension from them, it goes to VM. I can dial out. The condition will go on for a few hours of phones randomly dropping and then settle.

FreePBX 13.0.105. Endpoints are Mitel 6867i, FW 4.1.0.128

What version of asterisk are you running?

Asterisk version 11.19.0. Endpoints are using chan_sip.

No known issues with that asterisk version. Can you ping the IP address of the phones when they’re unreachable? Sounds like a networking issue.

Edit: Wups, there MAY have been issues in asterisk 11.19. Have you tried upgrading?

Went to FreePBX build 6.12.65-32 last night.Asterisk 11.21.2. I’ll see if the problem goes away.

Problem did not go away. And I can ping the endpoint from my workstation while the PBX reports it offline.

Check your logs and see why the unit is appearing to be “offline” from the server.

There is going to be something in there about the unit becoming unreachable or critical packets are not getting responded to. I know there are at least a dozen things that can make this happen, but without access to the pertinent part of the logs, it’s going to be guesswork, and that is usually just a waste of everyone’s (including your) time.

Can you point me to the right logs?

grep -iE “reachable|critical” /var/log/asterisk/full*

1 Like

This all I get for that time frame.
[2016-04-19 16:40:28] NOTICE[2347] chan_sip.c: Peer ‘2750’ is now Reachable. (16ms / 2000ms)
[2016-04-19 16:48:01] WARNING[2347] chan_sip.c: Retransmission timeout reached on transmission [email protected] for seqno 1 (Critical Response) – See https://wiki.asterisk.org/wiki/display/AST/SIP+Retransmissions
[2016-04-19 16:48:01] WARNING[2347] chan_sip.c: Hanging up call [email protected] - no reply to our critical packet (see https://wiki.asterisk.org/wiki/display/AST/SIP+Retransmissions).
[2016-04-19 16:48:09] NOTICE[2347] chan_sip.c: Peer ‘2707’ is now Reachable. (17ms / 2000ms)
[2016-04-19 16:49:29] NOTICE[2347] chan_sip.c: Peer ‘2001’ is now UNREACHABLE! Last qualify: 69
[2016-04-19 16:49:39] NOTICE[2347] chan_sip.c: Peer ‘2001’ is now Reachable. (15ms / 2000ms)
[2016-04-19 16:51:28] NOTICE[2347] chan_sip.c: Peer ‘2718’ is now Reachable. (46ms / 2000ms)
[2016-04-19 16:51:28] NOTICE[2347] chan_sip.c: Peer ‘2710’ is now Reachable. (45ms / 2000ms)
[2016-04-19 16:51:28] NOTICE[2347] chan_sip.c: Peer ‘2715’ is now Reachable. (277ms / 2000ms)
[2016-04-19 16:51:28] NOTICE[2347] chan_sip.c: Peer ‘2000’ is now Reachable. (287ms / 2000ms)
[2016-04-19 16:51:34] NOTICE[2347] chan_sip.c: Peer ‘2707’ is now Reachable. (118ms / 2000ms)
[2016-04-19 16:51:34] NOTICE[2347] chan_sip.c: Peer ‘2704’ is now Reachable. (120ms / 2000ms)
[2016-04-19 16:52:27] NOTICE[2347] chan_sip.c: Peer ‘2732’ is now Reachable. (39ms / 2000ms)
[2016-04-19 16:56:40] NOTICE[2347] chan_sip.c: Peer ‘2808’ is now Reachable. (14ms / 2000ms)

Those messages popup pretty often.

That would indicate a network problem.

This one (from the ones you posted) specifically points to a network issue. The fact that you are getting REACHABLE messages means that all of those were UNREACHABLE on the last check.[quote=“andersonhaulage, post:10, topic:34415”]
[2016-04-19 16:48:01] WARNING[2347] chan_sip.c: Hanging up call [email protected] - no reply to our critical packet (see Home - Asterisk Documentation).
[/quote]

This one even includes a URL to read up on why there are problems.

And none of the possible issues seem to apply. I can also ping the endpoints when they enter the unreachable state. It’s also random enough that it will be a bitch to figure out.

You can ping them, but they are not responding to the “are you there” packets Asterisk is sending.

It has GOT to be a network issue, unless it’s a strange firmware problem in your phones.

They’re all SIP Phones. You’re not using PJ-SIP to connect to them. You are using Chan-SIP.

See if you can find some log entries where the phones are becoming unreachable. I believe it’s got to be something systemic - something in your network is causing this problem. Another important thing to remember is that the phone may on re-register right away, so “couple second” network outage could cause the phone to drop and not reconnect for some relatively large period of time. Use that ‘grep’ command but search for UNREACHABLE and see what pops up.

One final stupid question - these are all local phones, right? Nothing on the Internet, no weird hosted, cloud server issues? Everything is connected to the same network switch in your frame room, right?

Not stupid question at all. All local.

[2016-04-19 05:09:50] NOTICE[2347] chan_sip.c: Peer ‘2715’ is now UNREACHABLE! Last qualify: 274
[2016-04-19 07:00:18] NOTICE[2347] chan_sip.c: Peer ‘2743’ is now UNREACHABLE! Last qualify: 46
[2016-04-19 13:02:35] NOTICE[2347] chan_sip.c: Peer ‘2718’ is now UNREACHABLE! Last qualify: 103
[2016-04-19 13:02:35] NOTICE[2347] chan_sip.c: Peer ‘2715’ is now UNREACHABLE! Last qualify: 130
[2016-04-19 16:31:29] NOTICE[2347] chan_sip.c: Peer ‘2704’ is now UNREACHABLE! Last qualify: 138
[2016-04-19 16:49:29] NOTICE[2347] chan_sip.c: Peer ‘2001’ is now UNREACHABLE! Last qualify: 69
[2016-04-20 00:04:31] NOTICE[2347] chan_sip.c: Peer ‘2706’ is now UNREACHABLE! Last qualify: 181
[2016-04-20 00:04:31] NOTICE[2347] chan_sip.c: Peer ‘allstream’ is now UNREACHABLE! Last qualify: 27
[2016-04-20 00:04:31] NOTICE[2347] chan_sip.c: Peer ‘2746’ is now UNREACHABLE! Last qualify: 27
[2016-04-20 00:04:31] NOTICE[2347] chan_sip.c: Peer ‘2744’ is now UNREACHABLE! Last qualify: 19
[2016-04-20 00:05:58] NOTICE[2347] chan_sip.c: Peer ‘2732’ is now UNREACHABLE! Last qualify: 0
[2016-04-20 00:07:59] NOTICE[2347] chan_sip.c: Peer ‘2716’ is now UNREACHABLE! Last qualify: 0
[2016-04-20 00:11:04] NOTICE[2347] chan_sip.c: Peer ‘2740’ is now UNREACHABLE! Last qualify: 0
[2016-04-20 00:11:04] NOTICE[2347] chan_sip.c: Peer ‘4454’ is now UNREACHABLE! Last qualify: 0
[2016-04-20 00:13:38] NOTICE[2347] chan_sip.c: Peer ‘2738’ is now UNREACHABLE! Last qualify: 0
[2016-04-20 00:13:38] NOTICE[2347] chan_sip.c: Peer ‘2740’ is now UNREACHABLE! Last qualify: 0
[2016-04-20 00:16:46] NOTICE[2347] chan_sip.c: Peer ‘2729’ is now UNREACHABLE! Last qualify: 1069
[2016-04-20 00:18:35] NOTICE[2347] chan_sip.c: Peer ‘2723’ is now UNREACHABLE! Last qualify: 43
[2016-04-20 00:28:24] NOTICE[2347] chan_sip.c: Peer ‘2728’ is now UNREACHABLE! Last qualify: 22
[2016-04-20 00:33:11] NOTICE[2347] chan_sip.c: Peer ‘2712’ is now UNREACHABLE! Last qualify: 167
[2016-04-20 00:34:13] NOTICE[2347] chan_sip.c: Peer ‘2750’ is now UNREACHABLE! Last qualify: 29
[2016-04-20 00:36:13] NOTICE[2347] chan_sip.c: Peer ‘4453’ is now UNREACHABLE! Last qualify: 0
[2016-04-20 00:36:13] NOTICE[2347] chan_sip.c: Peer ‘2734’ is now UNREACHABLE! Last qualify: 0
[2016-04-20 00:36:13] NOTICE[2347] chan_sip.c: Peer ‘2705’ is now UNREACHABLE! Last qualify: 16
[2016-04-20 00:38:53] NOTICE[2347] chan_sip.c: Peer ‘2726’ is now UNREACHABLE! Last qualify: 57
[2016-04-20 00:46:50] NOTICE[2347] chan_sip.c: Peer ‘2714’ is now UNREACHABLE! Last qualify: 101
[2016-04-20 00:52:28] NOTICE[2347] chan_sip.c: Peer ‘4453’ is now UNREACHABLE! Last qualify: 41
[2016-04-20 00:55:21] NOTICE[2347] chan_sip.c: Peer ‘2808’ is now UNREACHABLE! Last qualify: 0
[2016-04-20 00:56:30] NOTICE[2347] chan_sip.c: Peer ‘2732’ is now UNREACHABLE! Last qualify: 0
[2016-04-20 01:02:59] NOTICE[2347] chan_sip.c: Peer ‘2714’ is now UNREACHABLE! Last qualify: 17
[2016-04-20 01:11:04] NOTICE[2347] chan_sip.c: Peer ‘4454’ is now UNREACHABLE! Last qualify: 32
[2016-04-20 01:54:20] NOTICE[2347] chan_sip.c: Peer ‘2720’ is now UNREACHABLE! Last qualify: 108

Really weird part is the time. Nothing should be happening at that time. I was also expecting to see UNREACHABLE in the same quantity and times as the Now REACHABLE messages. Not the case. Not even close.

Are you sure you grepped “reachable” with the case insensitive switch (-i)?

These are all of the unreachables - that was what I was hoping to see.

My train of thought is, once we know why the phones are dropping off, we can look into that and figure out what is going on.

Over on one of the PBX_In_A-Flash forums (which is Asterisk and FreePBX based) there was a thread about this particular part of the output above.

[Peers] become Unreachable when they stop responding within the allotted time to the SIP OPTIONS command. Is the trunk really talking to [the destination]? They have terminated all business.

A couple of things to try:

  1. Change QUALIFY=.[whatever]

There are three options - YES, NO, and [time]. If you are currently using “Yes”, change it to “No”. If you are using “No”, change it to 120.

  1. Are you having any DNS issues, as certainly in some versions of Asterisk, a lack of DNS resolution can cause these effects.

  2. If the problem is during times of inactivity, it could be firewall interaction on the server. Check your system’s firewall status and remember that FreePBX 13 now includes a local firewall.

No DNS issue.
Is there a way to change qualify for all extensions at the same time?
How do I make sure the FreePBX firewall is disabled?

I’m still baffled. I get some endpoints that come up and show lagged at over 3000ms. If I ping them i’m sub 1ms. I don’t think it’s a general network issue.

The PBX is running under ESXi, has been V2V from oVirt. It’s on it’s own vSwitch with a dedicated NIC. 2 CPUs, 2GB RAM. 43 endpoints. I do find the web interface a bit sluggish. RAM is running at greater than 80%. Should I give it more RAM?

It’s a subtle point, but remember that “ping” uses ICMP as it’s protocol for testing. The problem you are seeing is in TCP. I’ve seen lots of cases where a device can ping on ICMP and be otherwise dead.

More memory might make a difference, but this problem screams network issue I’m thinking it’s something really strange in your network topology. Have you double checked to make sure the network isn’t feeding back on itself?

I’m really baffled.