Our pbx has been acting up at exactly 80 minute intervals where all the phones will drop offline and isymphony will quit working. The pattern I see is that there is a new entry in /var/log/messages at each instance and it’s resolving a URL. The phones are offline for anywhere from 10 to 40 seconds and any current calls are dropped.
Nov 14 10:43:43 bkssfree php: /sbin/iptables -A fpbxhosts -s xx.xx.xx.xx/32 -j zone-trusted
Interesting that the only time that the phones show unreachable in the /var/log/asterisk/full log where this is not also an entry in the messages log is at 5:22. Only one of the URLs is not on DynDNS, it uses the No-IP service via my home internet connection. That IP is also the only one that does not appear in the messages log…
Anyone having similar issues?
Is there a limit to how many trusted networks you can put into the firewall? (ours currently has 56 entries)
System Firewall 13.0.42
This time I was ready for it. I did the following:
-3 browser tabs open
b) external IP
c) internal IP (VPNd in to the data center)
had putty open
had winscp open
had a call going from my cell phone (was in the conference room just listening to hold music)
had isymphony open
Every connection broke, with the (sort of) exception of the internal IP, but when I tried to click on something it had logged me out and I was able to load the login screen again, but by that point the phones were back up.
I think it’s DNS related but clearly it’s taking the whole system down, not just kicking people out that are coming from those URLs.
Upgrading the firewall module has had no effect. Uninstalling and reinstalling the firewall module had no effect. Also tried removing all URLs from the firewall to no avail. In the ticket they declared it fixed… I’m going to just rebuild a couple of these PBXs this evening and hope for the best.
Firewall won’t stop existing traffic. I think that something else is going on, and at a guess it’s DNS related.
My guess is you have something in Asterisk (a trunk or something) that has a hostname in it, and that is failing to respond. Asterisk (in chan_sip, and in pjsip prior to 13.11) will lock up temporarily when a DNS lookup is happening.
it feels like a dns issue, but what has us tearing our hair out is when it does occurs we lose all network connection. we can’t ssh or https into the system. all calls drop. if the asterisk stall can cause the network to stall, then that would explain it. we do have one trunk that uses a url, which i have just disabled. we will see if disabling this trunk changes things.
the other reason we says it feels like a dns issue is that we also have a Trixbox pro installation that experiences something similar and as you may remember Trixbox pro uses url’s for just about everything, although in this case the trunks themselves are using ip authentication and have actual ip addresses in them.
That’s interesting. When you say you can’t SSH, do you mean the connection establishes, but then SITS THERE? Or do you get a connection timed out? Which exact error are you getting? (The first: DNS, Second: Lower level network issue)
ssh times out. as does http/https even when using the ip address (no url). we have at least two nic (one external and one internal) and the pbx becomes unreachable via both NIC’s during this time. it is as if the network has been shut down on the pbx. it is always brief, perhaps a minute or so. we do have a couple of url’s in both the firewall as well as in the fail2ban white list. i have not looked at how fail2ban handles the url’s. but what i do know is that this is a weird one. and to make matters even fuzzier, we have not had a problem all day so far (knock on wood). the Trixbox i mentioned and this FreePBX instance are running in different pools in the virtual environment and are sharing nothing except that they both use the google as one of their dns servers.
the problem is that there is no error reported. the pbx behaves as if both NIC’s were down. we can access the console directly but any attempt to access the pbx via a NIC does not connect. if you using winscp, it simply says search for host, if you use putty it never brings the log in prompt, if you use http or https via a browser the error message is dependent on the browser - you get something like the url/ip is not responding, or it took too long, etc. both nic’s have static ip addresses, the external has the gateway address and the internal uses a route statement.
as i said it is as if something stalls the network interfaces on the pbx but only for a minute or so. i did not have a lot of hair to begin with and i have a lot less now.
i am half tempted to simply rebuild the pbx from the ground up and only bring across the system recordings to see if that fixes the issue. a lot of work, and its brute force, but at this point i am about out of things to try. last night i uninstalled the firewall and then reinstalled and reconfigured it.
my only data point that i have, is after a reboot the fwconsole start script fails i think because of an issue with fail2ban. i know that if, after a reboot, i, from the console (not putty) restart fail2ban and then run fwconsole start, everything works. if i don’t restart fail2ban, the start script says “unable to resolve host name” immediately after the validating license information for xxxxxxxxxxxxxxx (the deployment id). tonight i am going to see if i can capture the output of the fwconsole start in the rc.d/rc.local file and perhaps that will give me a clue. i don’t know if the two things are related, but i do know that restarting fail2ban sometimes can take more than a minute. so my thinking (as wishy washy as it is) is that when either the firewall or fail2ban refreshes things it temporarily blocks all access to the machine via IP tables. normally this is not a huge deal unless it takes an extended amount of time to rebuild all the tables. i am not smart enough on IPtables yet to know if part of the refresh process could block traffic on open connections. voice calls go silent when the problem occurs. open ssh sessions timeout.
here is the routing table in the pbx.
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
xx.xxx.xxx.32 * 255.255.255.240 U 0 0 0 eth0
192.168.5.0 * 255.255.255.0 U 0 0 0 eth1
10.100.100.0 * 255.255.255.0 U 0 0 0 tun0
link-local * 255.255.0.0 U 1002 0 0 eth0
link-local * 255.255.0.0 U 1003 0 0 eth1
192.168.0.0 192.168.5.1 255.255.0.0 UG 0 0 0 eth1
default xx.xxx.xxx…33 0.0.0.0 UG 0 0 0 eth0
This is the question I’m trying to get you to answer.
If you read back to what I said originally:
You really need to be specific.
You’re IMPLYING that it connects, but just sits there without prompting for a login prompt, rather than coming up with an error from Putty saying ‘connection timed out’ or ‘connection refused’ or something like that.