DNS look ups causing trouble with firewall? Possibly DynDNS related

Our pbx has been acting up at exactly 80 minute intervals where all the phones will drop offline and isymphony will quit working. The pattern I see is that there is a new entry in /var/log/messages at each instance and it’s resolving a URL. The phones are offline for anywhere from 10 to 40 seconds and any current calls are dropped.

Nov 14 10:43:43 bkssfree php: /sbin/iptables -A fpbxhosts -s xx.xx.xx.xx/32 -j zone-trusted

Interesting that the only time that the phones show unreachable in the /var/log/asterisk/full log where this is not also an entry in the messages log is at 5:22. Only one of the URLs is not on DynDNS, it uses the No-IP service via my home internet connection. That IP is also the only one that does not appear in the messages log…

Any ideas?
Anyone having similar issues?
Is there a limit to how many trusted networks you can put into the firewall? (ours currently has 56 entries)

relevant info

FreePBX 13.0.190.2
PBX Firmware:10.13.66-17
System Firewall 13.0.42

Ironically this stopped happening right after I posted this thread. This has happened in the past as well for a few hours and then stopped.

This time I was ready for it. I did the following:

-3 browser tabs open
a) URL
b) external IP
c) internal IP (VPNd in to the data center)

  • had putty open
  • had winscp open
  • had a call going from my cell phone (was in the conference room just listening to hold music)
  • had isymphony open

Every connection broke, with the (sort of) exception of the internal IP, but when I tried to click on something it had logged me out and I was able to load the login screen again, but by that point the phones were back up.

I think it’s DNS related but clearly it’s taking the whole system down, not just kicking people out that are coming from those URLs.

http://issues.freepbx.org/browse/FREEPBX-13411 maybe?

(if it is, it was published today, at least on the “edge” track…)

Thanks for the heads up. Just tried upgrading the firewall so we’ll see what happens, but it’s really weird that this is only affecting a couple of the systems.

Hi!

No problem!

It’s timing related, I am not that surprised that the results are unpredictable…

Good luck and have a nice day!

Nick

Upgrading the firewall module has had no effect. Uninstalling and reinstalling the firewall module had no effect. Also tried removing all URLs from the firewall to no avail. In the ticket they declared it fixed… I’m going to just rebuild a couple of these PBXs this evening and hope for the best.

Hi!

@xrobau went on an hunch to fix this so either his hunch was wrong or it’s another problem altogether…

Good luck and have a nice day!

Nick

Firewall won’t stop existing traffic. I think that something else is going on, and at a guess it’s DNS related.

My guess is you have something in Asterisk (a trunk or something) that has a hostname in it, and that is failing to respond. Asterisk (in chan_sip, and in pjsip prior to 13.11) will lock up temporarily when a DNS lookup is happening.

it feels like a dns issue, but what has us tearing our hair out is when it does occurs we lose all network connection. we can’t ssh or https into the system. all calls drop. if the asterisk stall can cause the network to stall, then that would explain it. we do have one trunk that uses a url, which i have just disabled. we will see if disabling this trunk changes things.

the other reason we says it feels like a dns issue is that we also have a Trixbox pro installation that experiences something similar and as you may remember Trixbox pro uses url’s for just about everything, although in this case the trunks themselves are using ip authentication and have actual ip addresses in them.

That’s interesting. When you say you can’t SSH, do you mean the connection establishes, but then SITS THERE? Or do you get a connection timed out? Which exact error are you getting? (The first: DNS, Second: Lower level network issue)

ssh times out. as does http/https even when using the ip address (no url). we have at least two nic (one external and one internal) and the pbx becomes unreachable via both NIC’s during this time. it is as if the network has been shut down on the pbx. it is always brief, perhaps a minute or so. we do have a couple of url’s in both the firewall as well as in the fail2ban white list. i have not looked at how fail2ban handles the url’s. but what i do know is that this is a weird one. and to make matters even fuzzier, we have not had a problem all day so far (knock on wood). the Trixbox i mentioned and this FreePBX instance are running in different pools in the virtual environment and are sharing nothing except that they both use the google as one of their dns servers.

ssh times out.

What, exactly, is the error? Can you copy and paste it please?

The other potential issue is that if both interfaces are using DHCP, they could be fighting over the default route.

the problem is that there is no error reported. the pbx behaves as if both NIC’s were down. we can access the console directly but any attempt to access the pbx via a NIC does not connect. if you using winscp, it simply says search for host, if you use putty it never brings the log in prompt, if you use http or https via a browser the error message is dependent on the browser - you get something like the url/ip is not responding, or it took too long, etc. both nic’s have static ip addresses, the external has the gateway address and the internal uses a route statement.

as i said it is as if something stalls the network interfaces on the pbx but only for a minute or so. i did not have a lot of hair to begin with and i have a lot less now.

i am half tempted to simply rebuild the pbx from the ground up and only bring across the system recordings to see if that fixes the issue. a lot of work, and its brute force, but at this point i am about out of things to try. last night i uninstalled the firewall and then reinstalled and reconfigured it.

my only data point that i have, is after a reboot the fwconsole start script fails i think because of an issue with fail2ban. i know that if, after a reboot, i, from the console (not putty) restart fail2ban and then run fwconsole start, everything works. if i don’t restart fail2ban, the start script says “unable to resolve host name” immediately after the validating license information for xxxxxxxxxxxxxxx (the deployment id). tonight i am going to see if i can capture the output of the fwconsole start in the rc.d/rc.local file and perhaps that will give me a clue. i don’t know if the two things are related, but i do know that restarting fail2ban sometimes can take more than a minute. so my thinking (as wishy washy as it is) is that when either the firewall or fail2ban refreshes things it temporarily blocks all access to the machine via IP tables. normally this is not a huge deal unless it takes an extended amount of time to rebuild all the tables. i am not smart enough on IPtables yet to know if part of the refresh process could block traffic on open connections. voice calls go silent when the problem occurs. open ssh sessions timeout.

here is the routing table in the pbx.

Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
xx.xxx.xxx.32 * 255.255.255.240 U 0 0 0 eth0
192.168.5.0 * 255.255.255.0 U 0 0 0 eth1
10.100.100.0 * 255.255.255.0 U 0 0 0 tun0
link-local * 255.255.0.0 U 1002 0 0 eth0
link-local * 255.255.0.0 U 1003 0 0 eth1
192.168.0.0 192.168.5.1 255.255.0.0 UG 0 0 0 eth1
default xx.xxx.xxx…33 0.0.0.0 UG 0 0 0 eth0

This is the question I’m trying to get you to answer.

If you read back to what I said originally:

You really need to be specific.

You’re IMPLYING that it connects, but just sits there without prompting for a login prompt, rather than coming up with an error from Putty saying ‘connection timed out’ or ‘connection refused’ or something like that.

i will double check next time it happens to verify if we get the screen but no log in prompt

If that’s what is happening, then one of the machines that is meant to be doing DNS is not responding at all. Not connection refused, not host unreachable, nothing is being received.

Remove all DNS entries apart from 127.0.0.1 from that machine (if it’s distro) and let dnsmasq handle it. Otherwise, only have 8.8.8.8 in there.

our current dns settings are very basic

127.0.0.1
8.8.8.8

but as i said i have to test next time it happens to verify if we actually get the screen or if the connection times out

using 127.0.0.1 as a nameserver requires that you have either dnsmasq or bind working properly , does that apply?

dnsmasq will cache the request . not necessarily bind.

What are you using for name resolution?

dnsmasq, with the default config as install by the distro. identical to all the other systems we have running.