Phones (Yealink) not registering after PBX reboot

I suspect this to be some NAT issue. Whenever I reboot my PBX, which is in a cloud VPS the phones on site won’t re-register untill I unplug and replug the ADSL router.

When I look in the logs, the only log related I can find is the fail2ban log, which I filtered on one of the extensions (presuming all have the same issue):

[2018-09-29 15:17:11] SECURITY[3104] res_security_log.c: SecurityEvent="ChallengeSent",EventTV="2018-09-29T15:17:11.616+0200",Severity="Informational",Service="SIP",EventVersion="1",AccountID="201",SessionID="0x7fb4fc00af70",LocalAddress="IPV4/UDP/MY.PBX.IP/5160",RemoteAddress="IPV4/UDP/MY.ADSL.IP/5160",Challenge="1dd572b4"

Which tells me the PBX received registration attempts, sends back a challenge but then nothing happens. I’d like some hints on how I could get the router to stop messing here.

UPDATE
I’ve been playing around with this. Tried rebooting the PBX and re-registering the phones without rebooting the DSL router. I’ve noticed that when I change the local port of the phones to for instance 5160 (or anything other than the default 5060 for all that matter), the phones connect. Re-setting them to the default 5060 some time later kept them registered. I think this may be some caching issue in the router?

I’ve seen routers that end up with a ‘poisoned’ NAT association after a registration failure. Then, repeated attempts from the phone prevent the bad association from timing out so it never registers.

An easy thing to try is changing Account → Advanced → SIP Registration Retry Timer for the relevant account(s) from 30 to 1200. This will give the router plenty of time to drop the bad association. The downside is that it can take up to 20 minutes after the system comes back up for the phone to reregister.

Possibly, the trouble only occurs when the router has to translate the source port number, in which case giving each phone a unique setting for Local SIP Port (with none at 5060) should fix the problem.

Router make/model? There may be settings that will prevent this situation.

If you need to reboot the PBX more than a couple of times per year, it may be worth tracking down and fixing the underlying cause.

@Luke1982 Here is what I would suggest to help determine your issue and not have to reboot the PBX to mimic this each time.

  1. SSH to the PBX
  2. fwconsole stop (This will completely shut down Asterisk and make it stop listening)
  3. Leave it stopped for the duration it takes for your PBX to do its reboot. So probably a couple minutes
  4. fwconsole start (this will bring everything backup)
  5. See if phones come back up on their own. Once they go down their “re-try” timer is happening for registrations so give them a minute or so to come back up.
  6. If they don’t come back up, just reboot 1 phone and see what happens.

During all of this you’ll need to have some method of logging and/or view real-time activity on your router to see if there are connections timing out, what traffic you can see where.

But this is an important question to have an answer to. So please do provide that.

Thank you both very much for your efforts in answering.

I’ve actually set the registry retry timer as low as I could set it (I think 30 secs yes) in an attempt to try and make the phones re-register as soon as possible. The first thing I’ll try (on a saturday) is to set this value way up like @Stewart1 suggested to see if that prevents the ‘poisoned’ routing table from keeping its poison.

I have tried giving each phone a different local port, which seemed to work at first, but after another reboot the same thing happened so at first glance it seems more like the port change is more important than which port it ends up at. In this instance, changing the ports back to 5060 solved the issue after the reboot.

@Stewart1 I updated the VPS specs the PBX was running on. Once every couple of months it ran out of RAM, causing the reboot. So I doubled the RAM, hope this solves the necessary reboots but still for my own sanity I’d like to understand this problem.

@Stewart1 and @BlazeStudios, the router make/model: It’s FritzBox from AVM, can’t access the exact model right now since I’m not on site but it’s a default DSL ISP router so very limited in options and debugging I’m afraid.

@BlazeStudios As to your point 5, you mean to say it’s the even in which they try to reregister after a failed attempt? I’ve yet to study the SIP RFC to see understand if re-registering happens always, or only after a failed registering attempt. In any case I never really understood how phones could be reachable behind NAT for incoming calls in the first place. The PBX sends a signal to the phones, but unless this is a response to a request I figured the router firewall would just reject the call. It does work, I just don’t understand how the PBX bypasses the NAT firewall.

As to your point 6, how would a reboot differ from a soft-reregister? I believe you in that it makes a difference, just don’t know how the buts and bolts work.

As for logging, I do see the PBX sending challenges, I guess in response to a phone trying to register. I think its those challenges that never reach the phones behind the NAT. I’ll see if I can get any wireshark info on the router, but if I’m not wrong wireshark cannot sniff anything that is not on the NIC of the device the software is running on.

In normal operation, when a REGISTER request is sent to the server, the router records an association that directs the replies to the proper LAN address and port. When there is no further traffic for some time, the association is deleted. The phone can also send ‘keep-alive’ packets to prevent deletion; if you want to try a long Registration Retry interval, I forgot to mention that you should also set Keep Alive Type to Disabled. In addition, if your extension is set qualify=yes, the probe packets sent by Asterisk also serve to prevent the connection from timing out. As a result, an incoming INVITE looks like another ‘reply’ to the router and is passed to the phone. (Keep Alive shouldn’t be needed if Server Expires is shorter than the router’s timeout.)

My PBX also has a slow memory leak that I never tracked down. It has swap space that starts to fill once memory is exhausted. Upon seeing that, I reboot it manually, typically about every 3 to 6 months. I suspect the trouble is something not directly related to Asterisk or FreePBX; fwconsole restart does not help. I have Yealink phones in four places; none lose registration as a result of a PBX reboot. Routers are Mikrotik, TP-Link, an ISP-supplied Cisco EPC3925, and an ISP-supplied Neufbox.

For packet capture on the LAN side of the router, the capture feature built into the Yealink phone is probably adequate, or you can capture with Wireshark on a PC connected to phone and router via a managed switch or an old dumb 10 Mbps hub. Or, run the phone traffic through the PC by using two bridged NICs.

You won’t be able to capture on the WAN side unless such a feature is built into the router, because the connection between the WAN interface and the DSL modem is completely internal. However, it’s IMO very unlikely that your ISP is causing the trouble, so running tcpdump on the PBX should give you the same as what you would see on the router WAN side.

If you catch Asterisk sending a reply to a port other than the source port of the request, try setting RPort on the phone for the account in question.

Again, thank you so much. I now understand much better how the reaching works. I guess it’s a matter of the router config as to how long the associations are kept around and that desires some tweaking. I think we’re on the right track about this.

Today, out of nowhere some of the phones failed to register. I saw in the dashboard widget that happened yesterday at 23:00 sharp. Re-adjusting the phones’ ports fixed this, but there was no PBX reboot so this must be some router issue. NAT is, by the way, disabled on the phones.

To be clear, the Registration Retry Interval you talked about pertains to the cases when there is a registration failure I guess? I mean no need to re-register when the phone is already registered. Server Expires, I should read this as the time the phone will take to re-register at the PBX I presume. I’ve downloaded the appropriate manuals from Yealink, will take the time to read this first.

All the SIP extensions have been set to:

  • nat=yes (force_rport,comedia)
  • qualify=yes
  • qualify freq = 60 (the default)

What I was thinking about: I have the options to set a port in the extension in the FreePBX GUI. What good will that do? Is that ignored then I enable NAT? I mean the ports don’t match the internal ports on the phone and even if they did, the ports I set in the phone are only valid on the LAN and won’t be exposed to the WAN. Or do the packets contain that internal port for routing?

Thanks for the tcpdump tip. I did some messing around and saw that some things that concerned me, but I’m not sure how to read them:

20:29:56.455453 IP (tos 0x60, ttl 64, id 56376, offset 0, flags [none], proto UDP (17), length 591)
PBX_SERVER.5160 > SITE_IP.sip: [bad udp cksum 0x694b -> 0x0166!] SIP, length: 563
SIP/2.0 401 Unauthorized
Via: SIP/2.0/UDP 192.168.178.31:5060;branch=z9hG4bK103406207;received=SITE_IP;rport=5060
From: "EXTENSION_NICE_NAME" <sip:208@PBX_IP:5160>;tag=2776113901
To: "EXTENSION_NICE_NAME" <sip:208@PBX_IP:5160>;tag=as7d29cd07
Call-ID: [email protected]
CSeq: 23735 REGISTER
Server: FPBX-14.0.3.19(13.22.0)
Allow: INVITE, ACK, CANCEL, OPTIONS, BYE, REFER, SUBSCRIBE, NOTIFY, INFO, PUBLISH, MESSAGE
Supported: replaces, timer
WWW-Authenticate: Digest algorithm=MD5, realm="asterisk", nonce="17b88107", stale=true
Content-Length: 0

Where 192.168.178.31 is the correct LAN IP of this phone.

PBX_IP.5160 > SITE_IP.5063: [bad udp cksum 0x6954 -> 0xc641!] UDP, length 572

Is also one I saw, where I noticed the destination port (5063) is a port on one of the phones. But this is directed to the router. So can I assume that a port I set on the phone will make the router use this same port as it’s WAN port?

Theoretically (won’t really do this since I understand the security risk) I could forward ports on the router to the phones and setup the extensions in FreePBX to use these ports?

UPDATE
Reading the manual for the phones I just discovered I can do a trace route to debug. I’m not on site right now but when I am I’ll be sure to do that (preferably when a phone fails to register) to see what I can find.

UPDATE 2


OK, so this answers my earlier question about the tcpdump. The port you specify will always be the source port in the header. Still not sure if this will also be the WAN port on the router, but it kind of explains the tcpdump where the rport was the port no. I set in the phone: it is simply responding to the port that the phone sent in the header.

Most routers preserve the source port number where possible (the packet sent on the WAN port has the source IP address translated to the public IP, but the source port number is unchanged). However, if two or more devices are using the same source port, the router will assign a different port number for all but one of them.

The 401 Unauthorized response is not necessarily an error, though in this case it probably is. What you need to look at is the Authorization header in the next REGISTER request received from that phone. If the phone ‘heard’ the response, it will use the nonce sent by the PBX, in this case 17b88107. If the response was dropped (or misdirected) by the router, the next REGISTER will be a simple retransmission and will contain the ‘old’ nonce (same as the previous REGISTER).

The meaning of the various registration timers is explained in http://forum.yealink.com/forum/archive/index.php?thread-4246.html .

Yealink phones have a packet capture feature that may prove useful, though if you see repeated REGISTER requests that don’t reflect the nonce sent by the PBX, you can be pretty sure that the phone didn’t see the reply.

If you turn off keep-alive from the phone and set a long SIP Registration Retry Timer, with luck that will let the router forget the bad association and the phone should be able to register again. The qualify packets sent by the PBX are also a form of keep-alive, but when the problem occurs I am hoping that the router is totally dropping them and the bad association will time out. Conceivably, the router will use those packets to keep the association alive even though they aren’t being forwarded, in which case you’ll have to set a long qualify interval or turn qualify off.

Possibly, your router has some firewall-related settings that will help with this issue.

1 Like

In any case, I strongly believe the ‘poisoned association’ theory is correct. I was thinking about this and realized that of course, setting a new local port no. on the phone and re-registering at the PBX set a new WAN port association for this phone on the router. Hence deleting the old ‘bad association’ or at least create a healthy new one.

I checked your suggestion by setting TCPDUMP to show traffic both ways (I used ‘src’ before, switched to ‘host’). In fact the nonce the phone used was the nonce the PBX sent to the phone, so the 401 reached the phone. I keep seeing this message for this phone but ironically this phone is registered. No idea why it gets a 401, it just works right now.

Thanks for the link, some good info there. I’m going to play around with the settings on the phones, rebooting the PBX several times on a saturday when the office is closed so I can do some stressfree testing. I think I understand what you mean by tweaking the settings to make the router delete the bad associations. What I hope to learn is what makes the associations ‘bad’. I mean, what exactly happens? Are the packets rejected by the router when they shouldn’t be? Are the misguided to a wrong LAN IP? I hope I can get the phones to act up at my disposal so I can debug at realtime. Because now, everything works so I’m not sure if what I’m seeing has anything to do with my issue.

This is normal. The phone re-registers at intervals determined by Server Expires. In the absence of trouble, each registration consists of four packets:

  1. Phone sends REGISTER without Authorization header or with a stale one.
  2. PBX sends 401 Unauthorized with a nonce.
  3. Phone sends REGISTER with an Authorization header containing a hash calculated with above nonce.
  4. PBX sends 200 OK.

The question is when registration is failing, are the 401s sent by the PBX reaching the phone?

Ah nice, thanks for the heads up.

Well that indeed is the question. I’ll try to force a registration failure and see what happens at that specific moment. Thanks again for your help so far. I learned a lot.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.