Some help on debugging SIP

This is not a ‘solve my problem’ post. Actually I have a problem, and I’d like some tips on how to debug it. I have the following setup:

  • a FreePBX 13 setup on CentOS 6.9 running on a VPS. Running ‘extended’ modules like the responsive firewall
  • A site where the phones are, which has a Huawei B525 4G modem/router. The reason behind this is that the site has no cable or fibreglass internet available and ADSL speed is really poor. The downside here is that the 4G provider has their own NAT ‘in the air’ and there is no static IP address. I ‘solved’ this by applying a dynamic DNS that updates frequently. The DDNS hostname is in the responsive firewall '‘trusted’ zone.
  • 3 Yealink T41P phones, all firmware up to date and a Yealink W52P DECT station with one handset, also up to date. All connected through chan_sip.
  • The inbound route routes all incoming calls into a timegroup. During business hours and days it routes to a queue that rings all phones. Else it just routes to a message that ends in voicemail.

For the first 8 months or so, this setup worked without a single hickup. Recently, we’ve run into some problems. The phones would not ring on inbound calls. Sometimes just a couple, lately it’s been all or nothing. I’ve dived into the full log and saw that quite frequently some, or all phones would not qualify and therefor would become ‘UNREACHABLE’. I then started to experiment with the qualify intervals and wait times. I modified the intervals and times for some extensions and not for others to see if the ‘fallouts’ would not affect the extensions I modified. This seemed to work at first, but soon that stopped working as well.

I setup the server to send me a tail of the full log grepping the word ‘UNREACHABLE’ so I have a more clear view of what is happening now, I see that the phones all become unreachable at approximately the same time every day. This report does not include the extensions for which I switched off the qualify of course, but in practice they don’t respond as well. Rebooting the Huawei 4G router usually solves it.

I asked the VPS provider to monitor the internet speed of the VPS around the time the phones usually stop responding, they found no weird things. So now I’m kind of stuck. I suspect the 4G internet connection or the Huawei router to be the culprit, or maybe the DDNS system. My main question is, what would be the best approach to REALLY see in-detail what’s going on? Are there any logs I can turn on, or monitoring tools you would recommend? Are there any messages in the full log you recommend I should keep an eye out for? Any tips would be greatly appreciated, thanks in advance.

Look for information on Wireshark. That will give you a lot of good low-level information.

From the server, you can use “sip debug” commands. Log into the console and connect to the Asterisk CLI… Once there, use command completion on chan_sip and PJ-SIP to turn the debugging level up.

Between these two, you should get some good info.

As a matter of “important thing to know” - DDNS resolution only works within a 15-minute (or so) window, so if the connection drops and you get a new connection, the server won’t update the IP of the remote extensions for a period (usually in the 10 minute-or-so range).

IMO your first priority is to find out what’s wrong when the outage occurs. Does the internet connection get lost altogether, e.g. you can’t ping 8.8.8.8 until you reboot? If so: Check that router has latest firmware. Look for any settings in the web interface that may be relevant. If feasible, see whether a SIM from another mobile operator has the same issue. If no luck, you will need another router, or perhaps you could limp along with a script that detects lost connectivity and reboots the router automatically.

Or, is there simply a public IP address change from which (for reasons unknown) the firewall can’t recover? If so: Determine whether the DDNS name updates correctly. If not, fix that (with a different client or settings, etc.) If it does update, see why subsequent registration attempts don’t get through.

Even if there is no other trouble, an IP address change will drop calls currently in progress. The ‘about the same time each day’ is suspicious – I’ve seen ISPs that force an address change after 24 hours. If that’s your problem, I’d expect the outage to be a little later each day, i.e. 24 hours after the reboot. Assuming that you can fix the recovery issue, rebooting at e.g. 02:00 should avoid drops during business hours.

If you’re in a panic to reboot because you don’t want to lose any customer calls, set up the PBX (using Follow Me, setting the Not Reachable destination, etc.) so calls to unreachable extensions get forwarded to the users’ mobile phones. Then, you can see what’s wrong in a relatively leisurely way.

Note that if the public IP changes, the phones are dead until the next registration attempt, even if everything else works as it should. Try setting a short expiry (in your phones, I believe it’s called Server Expires), e.g. 60 seconds (default is one hour).

If all else fails and it’s not cost prohibitive, consider using ADSL just for VoIP and keeping the faster 4G for your other business purposes. Even a slow ADSL line should do 384 kbps upstream, enough for four concurrent calls without compression. Each service would be a backup for the other.

Thank you both for your answers. I’ll be sure to try wireshark and turning up the SIP debugging levels. Both of you were kind of pointing to the DDNS service, which is also what raised my suspicion. I saw a ‘gap’ between the DDNS IP and the real public IP this morning.

The dynamic IP address change frequency is a T-Mobile thing, can’t change that. What I suspect is that somewhere along the line T-mobile changed the frequency with which the public IP gets updated. The DDNS service was a freeby that came along with a netgear Wi-Fi router also installed on site. The DDNS updated whenever the DHCP lease of the Netgear expired. Minimum DHCP lease time is one hour. So basically DDNS service updated once every hour. This worked fine at first, but I suspect T-mobile increased their dynamic IP refresh frequency. I suspect the DDNS ‘gap’ as basically both of you suggested.

BTW: the public IP was way shorter than 24 hours. Used to be around 90 minutes, but like I said, I suspect now it’s even shorter.

Right now, just because there was a lot of stress about not being reachable we did exactly as Stewart1 suggested: Install a second ADSL connection that, while slow, is good enough for telephony and moreover has a static IP address. So now the PC’s all work on the 4G network but the phones work through ADSL.

We installed this all today so I’ll keep this thread alive for anyone else experimenting with 4G internet for PBX’s. All I can say at this point is: make sure you get a static IP.

Some quick answers:

  • The internet does not drop, keeps working. It’s purely the phones that don’t register.
  • The DDNS update freq was as high as I could set it
  • I think you have a very important point when you say: “if the public IP changes, the phones are dead until the next registration attempt”. I could have tried shortening the server expire time on the phones, would surely increase quality of service (e.g. shorter times of non-registered phone). Right now we’re moving to the ‘safe place’, which is ADSL / static IP exactly like you last suggestion.

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.