FreePBX systems randomly blocking incoming network traffic

dobrosavljevic · October 12, 2022, 2:05pm

This one is going to be tough to write out because the issue is extremely intermittent and at this point I am unable to reliably replicate it but for a few years now we’ve been experiencing a situation where a FreePBX system (we’ve also experienced it on PBXact hardware) would block all incoming network traffic and the fastest way to get the issue resolved is to have somebody on site power cycle the hardware that’s running FreePBX at which point everything starts working again.

If we try to access the GUI using the management webpage we get the following error:

Forbidden

You don’t have permission to access / on this server.

If we try to access through SSH we get the following error:

kex_exchange_identification: read: Connection reset by peer

Network traffic is blocked even if we try to access the system from the local network.

I should be clear that any network traffic to the system stops when it’s in this state, not just management traffic but all incoming calls start failing when this happens as well.

This also happens randomly during any part of the day. We’ve woken up to systems needing to be restarted and we’ve seen systems go down in the middle of the day as well.

I’ve let this slide for a few years now given that it happens so rarely and that I am unable to reproduce the problem but at this point we have quite a few systems out there now and one would get into this state at least once a month. Again, to be clear, this is not isolated to a single unit, pretty sure a lot of the units out in the field have experienced this at one point or another and as mentioned above we’ve also run into this with PBXact hardware. This is extremely problematic as physical access is needed to recover from this state and sometimes the people that are on site are not knowledgeable enough to reset the hardware that we need them to reset.

I looked at the /var/log/messages and /var/log/secure logs and they seem to just stop around the time that the systems hit this state. Not sure what other logs I could pull to figure out what’s going on.

Any assistance with this would be greatly appreciated.

davidg · October 12, 2022, 2:17pm

I’ve seen systems lock up and stop replying to traffic because the traffic hitting the server caused a big bite into swap space. Are there any big network spikes when this occurs?

Under settings > advanced settings > chan pjsip settings > allow transports reload
I’ve had systems stop working because this was configured as yes. It should be no. But the log file will typically have some notes in there.

Other helpful logs:

/var/log/asterisk/full*
/var/log/asterisk/freepbx_security.log
/var/log/asterisk/freepbx.log

dobrosavljevic · October 12, 2022, 3:10pm

Just for the record for anyone looking at this in the future the setting is under “Advanced SIP Settings” menu instead of Advanced Settings. In my case this is set to No by default on all the deployed systems.

I don’t think this a workload issue as this has happened randomly to systems that process thousands of minutes a month and it’s happened to systems that have a couple of extensions and take/make a handful of calls a year at the same rate.

Maybe this is a clue but looking at the asterisk logs in addition to the other system logs it seems like logging just stops around the time the outage seems to occur on the system and doesn’t resume until after the power reset.

The weird thing is that the system is clearly responding to network connection requests still, they are error messages but they are responding.

dicko · October 12, 2022, 3:36pm

I would suggest you install sysstat as it digests the system logs into 10 minute ‘snapshots’, burgeoning problems in variously memory load network and disk io are easier to spot.

man sysstat
man sar

Is there any commonality of routers or other network ‘thingies’ as currupt network streams could explain most of your symtoms

dobrosavljevic · October 12, 2022, 4:00pm

The reason it took me so long to try and seek assistance is because I wasn’t sure if there was a common hardware stack that we were implementing that was causing this and there was no good way to rule it out (other then moving to other random hardware at some clients) so we lived with it. At this point we have enough diversity in the stack to be at least somewhat confident that we don’t think it is.

As mentioned prior, we’ve experienced the problem/symptom on a PBXact Sangoma system that’s implemented in a network that’s using completely different networking hardware from what we typically install for a client.

I guess if there is any commonality that I can think of it’s that it seems to happen to physical systems only. We’ve got a few virtual deployments as well and I can’t recall that any of them experienced this issue since deployment.

As far as deploying additional packages, I may do this if nothing else yields a good clue. I mostly hesitate because we’d need to deploy this to quite a few systems to be able to make it useful as I have no way of predicting which system will run into this issue at what time.

dicko · October 12, 2022, 4:39pm

That’s the beauty of sysstat, it runs quietly in the background rotating it’s logs every 30 days(ish) (by default) so needs zero maintainance

If redhat based

https://www.server-world.info/en/note?os=CentOS_7&p=sysstat&f=1

dobrosavljevic · October 12, 2022, 6:08pm

I guess the other commonality is that we did use the FreePBX ISO for these systems so they are all CentOS.

I’ll give this a shot to see if it can shed any more light on the issue.

Thank you.

dobrosavljevic · October 12, 2022, 6:29pm

Actually, it looks like this is installed and running by default on the FreePBX ISO installs so we did have this running on all these machines all along it seems.

Now to figure out how to dig into them to get any useful info.

dobrosavljevic · October 12, 2022, 6:33pm

And it looks like that log collection for sysstat also stops around the time of the issue occurring and the stats look pretty boring at the time of the “crash”, no different then any other time during the day with 96% idle time on the CPU and no real spikes in memory usage either.

dicko · October 12, 2022, 8:46pm

Then I would concentrate on /var/log/syslog (/var/log/messages) as they reflect, as close as possible, kernel events, /var/log/auth for ssh failures.

`/var/log/messages` vs `/var/log/syslog` , `/var/log/messages` is the syslog on non-Debian/non-Ubuntu systems such as RHEL or CentOS systems usually. `/var/log/messages` instead aims at storing valuable, non-debug and non-critical messages. This log should be considered the “general system activity” log.
`/var/log/syslog` in turn logs everything, except auth related messages.

look for OOM (out of memory) or other signs in the immediate period after systat goes pear shaped.

dobrosavljevic · October 12, 2022, 8:58pm

Been checking the messages logs and only thing we got before logging stopped was a bunch of:

sshd[8768]: WARNING: 'UsePAM no' is not supported in Red Hat Enterprise Linux and may cause several problems.

I’ve changed that setting in sshd to correct that config, but really nothing indicating that the system is about to crap itself.

dicko · October 12, 2022, 9:03pm

Little reset here, how are you installing FreePBX? on RHEL or otherwise ? Let’s have the whole skinny and /var/log/auth.log might be diagnostic.

dobrosavljevic · October 12, 2022, 9:14pm

The Sangoma hardware that it’s happened to is preinstalled and the hardware that we provide we use the FreePBX ISO, so it’s running CentOS.

There is no auth.log, pretty sure that /var/log/secure on CentOS and again pretty boring stuff right before it goes out. Just a bunch of repeating entries:

Oct 11 14:19:19 pbx runuser: pam_unix(runuser:session): session opened for user asterisk by (uid=0)
Oct 11 14:19:21 pbx runuser: pam_unix(runuser:session): session closed for user asterisk

dicko · October 12, 2022, 9:18pm

I will bow out now, as I don’t do RH anymore But PAM (Pluggable Authentification Modules) is kinda instrumental in Linux security at a very basic level probably not good to disable it. Did you restart the ssh service after your edit?

dobrosavljevic · October 12, 2022, 9:19pm

Yea, ssh has been restarted after the config change. Appreciate the effort so far.

dicko · October 12, 2022, 9:21pm

I hever seen that behavior but good luck . . .

dvsatech · October 18, 2022, 11:22pm

I have had the same issue but in my case it was the firewall blocking the IP due to too many login failures (I assume because the IP gets added to the blocked list EVEN THOUGH ITS IN THE SAFE NETWORK List AND THE WHITELIST!).

dobrosavljevic · October 18, 2022, 11:54pm

I do not think that’s the case here, mostly because logging completely stops at the same time as well and nothing is logged to the fail2ban log either.

dicko · October 19, 2022, 12:07am

I suspect that the iptable’s chains are wrongly ordered.

iptables -L -n --line-numbers

can help identify what happens and when

dvsatech · October 19, 2022, 12:15am

yeah, in my case I actually get an email with the fail2ban notice so your probably right.