FreePBX Kicks All Endpoints

ethanberg · March 20, 2018, 4:03am

Hey everyone,

I have a production system running FreePBX 14 with Asterisk 15.2.2 and about 40 extensions utilizing TLS for PJSIP and SRTP for call audio. All of our production systems have the exact same setup and they work beautifully; no problems whatsoever. However this one only works when it wants to.

Like I said, it has about 40 extensions with Yealink phones and Bria Mobile softphones for iOS and Android users on-the-go. Normally the system processes calls without any issues whatsoever. BUT once or twice a day, without any warning, all devices lose their registration and the system stops processing all calls. The system’s admin web UI is still accessible and FreePBX can connect to Asterisk. The same goes for the UCP. The statistics graphs on the FreePBX dashboard show all endpoints going offline simultaneously. Obviously, Asterisk is still running, but it seems to say, “I’m outa here” and leaves everyone in the dust. Both the Firewall module and Fail2Ban show that no IPs have been blocked.

The logs are what really throw me for a loop. Right when Asterisk stops processing registrations and calls, the log fills up with

“res_pjsip/pjsip_transport_management.c: Shutting down transport ‘TLS to IP.IP.IP.IP:Port’ since no request was received in 32 seconds”

It also has a few:

WARNING[2228] pjproject: SSL 6 [SSL_ERROR_ZERO_RETURN] (Read) ret: 0 len: 32000

thrown in there. I’ve replaced all the networking equipment and cables, reformatted the system and started from scratch, loaded new certificates, and verified the certificate settings in Asterisk SIP Settings, all to no avail. The only recourse is to run fwconsole restart which brings the system back up for an indeterminate amount of time before it quits again.

This system is in use by the Air Force Auxiliary for search and rescue mission coordination. It’s critical that I get this figured out, but I can’t seem to find anybody else having this issue in the forums, and I’ve exhausted my personal knowledge. I’m guessing it’s something simple I’m missing and that I’m simply stressed over this enough to where I’m missing something basic. Any help someone can provide would be hugely appreciated.

Thank you!

PitzKey · March 20, 2018, 10:37am

First of all, thank you for your service!

Does this happen to ALL phones? is there a router between these Yealink phones and the PBX?

tonyclewis · March 20, 2018, 12:34pm

Not sure I would be using cutting edge Asterisk 15 here. Downgrade back to stable LTS version 13.

ethanberg · March 20, 2018, 12:51pm

Hi Pitzkey,

I do have a router with phones on both sides of it. They both run into the same problem simultaneously.

Thanks for responding so quickly!
Ethan

ethanberg · March 20, 2018, 12:54pm

Hey Tony,

I had them on Asterisk 13 for the first four months we were experiencing this issue. Switching to 15 surprisingly improved the situation. On 13, it would fail several times per day and the entire machine would lock up when it did, severely enough to where even the local keyboard and console wouldn’t work. It required a physical power disconnect to reset the machine. With Asterisk 15, everything keeps running it just doesn’t accept SIP connections, and the number of occurrences is way down (though still unreliable).

Thank you for the recommendation!
Ethan

PitzKey · March 20, 2018, 2:35pm

I don’t have much experience with PJSIP, but it seems to me like a network configuration issue.

Do you have the FreePBX firewall enabled?

Can you try setting up a phone on the same LAN of the PBX and see if the issue occurs?

EDIT: You may also want to read these two links

ethanberg · March 20, 2018, 2:53pm

Thanks, Itzik! We have phones both inside the local LAN (behind the router) and outside the LAN. They both have the same problem simultaneously. While we are using the FreePBX firewall, the issue persists even if we disable the firewall and Fail2Ban.

My first thought was a network configuration issue too, but I’ve combed through that. Plus resetting the network on the machine doesn’t bring it back up; the only thing that brings Asterisk back online is to restart the Asterisk service. Nothing else has any effect. It’s almost like something is happening that’s triggering Asterisk to close all its connections. The SSL Zero Return error is what keeps jumping out at me. I’m wondering if something is awry with the certificate implementation, or the SSL implementation as a whole (since I’ve already tried a new certificate). It seems like SSL stops processing traffic, so the phones lose their connection.

I feel like I’m rambling. Am I making any sense? Hahaha!

Thanks again!
Ethan

PitzKey · March 20, 2018, 3:08pm

Sorry, but troubleshooting SSL issues is out of my scope, i’m not saying this is the issue, but i don’t have a lab machine now where i can try to reproduce this.

I’m sure there are people here who have experience with this.
(Again i’m not saying that this is the issue)
Did you have a chance yet to read the two links?

As always, paid support is always available, click the link on top.

ethanberg · March 20, 2018, 3:30pm

Totally understand! Heck, if this were in my scope I’d hopefully have it fixed already! I’ve bookmarked the two links so I can read them on my lunch break.’

Seriously considering it, but since this is all non-profit, I’m trying to avoid it. Gotta do what I gotta do, though.

netphoneusa · March 20, 2018, 9:12pm

Two questions for you?

Why did you choose pjsip over chan sip
Why TLS over UDP?

ethanberg · March 21, 2018, 12:04am

Hi Andrew,

We chose PJSIP because of the need for multiple endpoints to be registered to one extension. The system is rapidly growing and expanding, so creating chan_SIP extensions for each of what will soon be hundreds of devices spread across the state and then programming and maintaining Follow Me settings for each one wouldn’t be practical.
The IT officer for the Wing wanted the SIP traffic and RTP streams to be encrypted since the majority of the phones are external and staff members can’t keep their smartphones connected to a VPN all the time, especially since they’re personal devices and can’t be corporately managed. So we had to open SIP up to the public and secure it as best we could with TLS and SRTP.

Thanks, Ethan

ethanberg · March 21, 2018, 2:40am

At a loss, I’ve just transitioned back to Asterisk 13 to see what happens. Thanks!

ethanberg · March 30, 2018, 1:42am

Hey everyone,

Transitioning back to Asterisk 13 didn’t help, nor did changing back to chan_sip. That being said, we have made some progress in tracing back the issue. Right when the system disconnect, the logs note:

openvpn: Wwrite UDP: Network is unreachable (code=101)
avahi-daemon[590]: Withdrawing address record for x.x.x.x on eth0.
avahi-daemon[590]: Leaving mDNS multicast group on interface eth0.IPv4 with address x.x.x.x.
avahi-daemon[590]: Interface eth0.IPv4 no longer relevant for mDNS.

There aren’t any hardware problems; I’ve been able to confirm that on multiple machines on complete separate networks. The only way to recover is a full reboot. The curious thing is that I’ve found this exact issue in bug reports for CentOS, RedHat, Debian and Ubuntu. Most, if not all, have confirmed the bug and listed it as critical. So a lot of people are having this issue on many different distributions, and now it’s plaguing several phone systems that are relied upon to help save lives. I only have one that isn’t exhibiting this behavior and I’m thinking that’s because it hasn’t been updated since July. I’m finding that most of the reports of this bug from other people start around August/September.

I know avahi-daemon and openvpn are just reporting the issue and aren’t the cause, but I don’t know where to go from here. I have to get these machines working reliably.

Any ideas? (PLEASE say yes!)
Thanks!
Ethan

da1 · April 13, 2018, 5:20pm

Hello Ethan,
I actually experienced this exact issue recently, though im not sure if my cause is the same.
I too was hammered by “res_pjsip/pjsip_transport_management.c: Shutting down transport ‘TLS to ipaddress:5061’ since no request was received in 32 seconds”

It would hit literally every TLS phone out of the blue, repeating the above error for each device, kicking them all off repeatedly, in what I can only describe as a “TLS storm” haha. I did not ever see the “SSL_ERROR_ZERO” you described though. So may be a different cause.

It’s a long story, but the short of it is, the root cause was one troublesome phone out there. After some exhaustive troubleshooting, I could reliably reproduce this “TLS storm” on demand using any phone.

Register phone on UDP 5060 across a VPN
Swap phone to use TLS 5061 across WAN, keeping the same ext/pass
Watch as literally every TLS phone experiences that “Shutting down transport TLS” error.
I’m using FreePBX 14, Ast 14, and SSLV23 for my TLS.

I factory reset the difficult phone, updated it’s firmware, and now the problems gone. I suspect it was having issues and accidentally swapping between UDP and TLS somehow. You mentioned you swapped literally everything on the system side (OS, hardware, cables, etc), have you ever tried swapping or reseting your phones and endpoints?

The only other thing I could think of would be to redo your SSL cert (delete it, new csr, regen from your host, import) or try a new one entirely since you’re seeing that SSL error, which I did not.

ethanberg · April 23, 2018, 3:49pm

Hi @da1 and @PitzKey

I’m actually glad to say the situation seems to be resolved another way. There seems to be an inherent bug in the DHCP client that almost all distributions of Linux are using. After pulling all my hair out and making sacrifices to the forum gods, I started seeing reports of this exact same issue across Ubuntu, Debian, centos/RHEL, Fedora, everything. It turns out that when it comes time to renew the DHCP lease, the client fails, the address gets released, and the system becomes disconnected from the network. This issue first hit Ubuntu in August 2017 and has strangely been progressing through distributions ever since. The workaround is to disable DHCP and manually input a static IP address. Obviously servers should almost always have static addresses, but my personal policy has always been to keep the server on DHCP (depending on the environment) and have the DHCP server give a static IP reservation. It’s just easier to manage all the IPs from one place instead of on each individual machine. But obviously I can no longer do that until this issue gets fixed so by disabling DHCP on the PBX and giving it a static address, we’ve had no further issues on any of our affected PBXs.

Thank you all so much for your help, especially @tonyclewis and his awesome team!

PitzKey · April 23, 2018, 3:53pm

Great job!
Idk why I didn’t think of using wireshark, you would probably see it there… Whatever. At least you have it fixed now.

system · April 30, 2018, 3:53pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.