Connected Peer Changes

Hi all.

I made this post/log monitor a while ago to monitor my asterisk logs via email:

Since then, the rest of my team is wondering if we’re chasing ghosts. We get disconnected peers every day, but people aren’t complaining about poor audio quality or dropped calls or anything, and running some UDP tests over the network we get perfect results. The peers that get disconnected seem random… there’s no common hardware involved except “all the switches.” And nobody complains about it, sometimes its in the middle of the night.

So if I’m seeing these errors, how worried should I be? Should I be pushing my network guys to figure out whats going wrong, or does this sort of thing just happen to everyone?

[2016-09-23 10:02:47] NOTICE[2258] chan_sip.c: Peer '138' is now UNREACHABLE!  Last qualify: 85
[2016-09-23 10:02:47] NOTICE[2258] chan_sip.c: Peer '121' is now UNREACHABLE!  Last qualify: 35
[2016-09-23 10:02:47] NOTICE[2258] chan_sip.c: Peer '126' is now UNREACHABLE!  Last qualify: 83
[2016-09-23 10:02:47] NOTICE[2258] chan_sip.c: Peer '130' is now UNREACHABLE!  Last qualify: 80
[2016-09-23 10:02:47] NOTICE[2258] chan_sip.c: Peer '117' is now UNREACHABLE!  Last qualify: 80
[2016-09-23 10:02:57] NOTICE[2258] chan_sip.c: Peer '121' is now Reachable. (12ms / 2000ms)
[2016-09-23 10:02:57] NOTICE[2258] chan_sip.c: Peer '117' is now Reachable. (74ms / 2000ms)
[2016-09-23 10:02:57] NOTICE[2258] chan_sip.c: Peer '126' is now Reachable. (75ms / 2000ms)
[2016-09-23 10:02:57] NOTICE[2258] chan_sip.c: Peer '138' is now Reachable. (82ms / 2000ms)
[2016-09-23 10:02:57] NOTICE[2258] chan_sip.c: Peer '130' is now Reachable. (96ms / 2000ms)
[2016-09-27 09:42:25] NOTICE[2258] chan_sip.c: Peer '119' is now Lagged. (3183ms / 2000ms)
[2016-09-27 09:42:36] NOTICE[2258] chan_sip.c: Peer '119' is now Reachable. (90ms / 2000ms)

[

Let’s focus on extension 119.

At 9:42, the ping time between the phone and the server jumped to over 3 seconds. and the phone went unreachable. Nine seconds later it was fine again (with a 90ms ping time).

<PEDANTIC MODE> We use the term “ping” as a generic reference to the activity the system uses to maintain the connection between your phones and your server. This activity doesn’t use the ICMP protocol</PEDANTIC MODE>

So, for some reason (phone, network, NIC, server, POE Injector, etc.), your system was unable to maintain the quality of connection required for the phone to stay “on” the system.

As I look through your list of entries, I notice the same 10 second delay from disconnected back to connected.

If your network is not causing this, you might need to adjust some of your settings in your SIP settings. Unfortunately, I’m not certain which settings you should set. You are using Chan-SIP for your connections, so it’s not one of the PJ-SIP “edge” cases we’ve seen from time to time. You may be able to get some relief by looking back through some of the older posts that talk about phone dropping the registrations and see if you should reduce the requalify time (for example, I’m not sure) so that the phones get queried for their status more often.

Since it’s not all the phones, hardware non-specific, and appears to be random, I’m going to guess it’s a network issue. At least, that’s where I would start. The other thing that I’d look at is the qualification times you’re seeing. If these phones are all on the local network, I’d be expecting reachability numbers under 20 ms - you are regularly seeing numbers near 5 times that.

It’s really hard to troubleshoot this remotely without more information, but my first inclination would be to have your network guy take a look and see if you might be having a switch that’s dying.

Thanks so much for the response. Good stuff in there. I’d be happy to provide more details, but I’m not thinking the diagnosis will change. It is more than likely, in my opinion, that we have bad network hardware somewhere.

All the phones are Yealink, varying models but mostly the same (T38Ps and T23Gs). I also have another client with the same sort of troubles (constantly dropping qualify), running on all Polycom phones. Symptoms are different and vary in severity, but so far we’ve solved every problem we’ve come up against by optimizing the network somehow:

Adding QoS priority on phone traffic and separating them onto their own vlan really helped. Replacing some really old outdoor cabling also helped. We’re replacing an entire stack of switches (or at least the stack modules) in the near future based on my recommendations and findings, and we’ve made almost zero changes to the PBX (needed to change a thing or two to add qos tags to rtp traffic on all devices/servers).

I’ve done nothing but fight with my people that this is a network issue and not a phone server issue, and thanks to some backup from the community we’re getting to the bottom of things.

1 Like