We have HA configured on PBX Firmware:6.12.65-20, PBX Service Pack:1.0.0.0, Asterisk 13.0.0, HA Module 12.0.1.1.
Both nodes freepbx-a and freepbx-b are showing a state of WFConnection / NodeDown, respectively.
When we go to the node management panel and click the Online button to bring freepbx-b online, there are no errors and the message returned is “freepbx-b has been set to Online”.
However, nothing changes. It still appears that the second node is offline and if we refresh the page, the Online button is still there, as though we did not change anything.
I would open a support ticket with us at support.schmoozecom.com. Someone can take a look. Also make sure you have the latest HA installed from FreePBX module admin as we fixed a few bugs found yesterday in FreePBX 12 HA module that only effect 12 systems.
In the meantime, I believe I have found a clue. Running pcs status on the freepbx-a node, the following is returned:
[root@freepbx-a ~]# pcs status
Cluster name:
Last updated: Tue Nov 25 12:54:34 2014
Last change: Tue Nov 25 09:07:55 2014 via cibadmin on freepbx-a
Stack: cman
Current DC: freepbx-a - partition WITHOUT quorum
Version: 1.1.10-14.el6-368c726
2 Nodes configured
20 Resources configured
Online: [ freepbx-a ]
OFFLINE: [ freepbx-b ]
Full list of resources:
spare_ip (ocf::heartbeat:IPaddr2): Started freepbx-a
floating_ip (ocf::heartbeat:IPaddr2): Started freepbx-a
Master/Slave Set: ms-asterisk [drbd_asterisk]
Masters: [ freepbx-a ]
Stopped: [ freepbx-b ]
Master/Slave Set: ms-mysql [drbd_mysql]
Masters: [ freepbx-a ]
Stopped: [ freepbx-b ]
Master/Slave Set: ms-httpd [drbd_httpd]
Masters: [ freepbx-a ]
Stopped: [ freepbx-b ]
Master/Slave Set: ms-spare [drbd_spare]
Masters: [ freepbx-a ]
Stopped: [ freepbx-b ]
spare_fs (ocf::heartbeat:Filesystem): Started freepbx-a
Resource Group: mysql
mysql_fs (ocf::heartbeat:Filesystem): Started freepbx-a
mysql_ip (ocf::heartbeat:IPaddr2): Started freepbx-a
mysql_service (ocf::heartbeat:mysql): Started freepbx-a
Resource Group: asterisk
asterisk_fs (ocf::heartbeat:Filesystem): Started freepbx-a
asterisk_ip (ocf::heartbeat:IPaddr2): Started freepbx-a
asterisk_service (ocf::heartbeat:freepbx): Started freepbx-a
Resource Group: httpd
httpd_fs (ocf::heartbeat:Filesystem): Started freepbx-a
httpd_ip (ocf::heartbeat:IPaddr2): Started freepbx-a
httpd_service (ocf::heartbeat:apache): Started freepbx-a
PCSD Status:
Error: no nodes found in corosync.conf
And from freepbx-b:
[root@freepbx-b ~]# pcs status
Cluster name:
WARNING: no stonith devices and stonith-enabled is not false
Last updated: Tue Nov 25 13:03:48 2014
Last change: Tue Nov 25 10:12:21 2014 via crmd on freepbx-b
Stack: cman
Current DC: freepbx-b - partition WITHOUT quorum
Version: 1.1.10-14.el6-368c726
2 Nodes configured
0 Resources configured
Node freepbx-a: UNCLEAN (offline)
Online: [ freepbx-b ]
Full list of resources:
PCSD Status:
Error: no nodes found in corosync.conf
We have also already tried manually setting freepbx-b to unstandby using pcs and the cluster repair script procedures from the HA wiki.
This is pretty cool. What’s happened is that the freepbx-a and freepbx-b machines are PARTLY firewalled. The installer for ‘join a cluster’ just does some basic tests - trying to SSH and Ping between the hosts, but doesn’t try everything.
I’m not sure what’s going to happen when you remove the firewall that’s blocking connectivity between the machines. I’d, honestly, suggest that you don’t even try. It’s possible that the ‘blank’ cluster (on -b) may overwrite the full cluster (on -a). It’s UNLIKELY, but as it’s a newer cluster, the timestamps and serial numbers may conflict, and … well, let’s not.
So. I’d suggest just completely reinstalling -b, or, if that’s going to be annoyingly difficult, you can open a commercial support ticket and I’ll manually unravel the cluster from -b. Then, remove the firewall between the machines and reinstall -b, and have it rejoin the cluster.
I would like to write some more checks to validate the network connectivity in more ways, but, they talk amongst themselves in so many different ways that it would be a massive undertaking to validate them all!
Okay, reinstall is done. Rejoined to the cluster, went to HA management panel and ran cluster health checks. Returned one error with DRBD symlinks on other node, and requested the following be run on b:
After doing so, all checks pass with green, just as in the screenshots in my first post. Just as before, I click “Online” to bring b into the cluster, and just as before I am told it is now online.
Going back to the status page [also checked using pcs status], we are at square one.
Thanks, both for the quick replies! No iptables or other configs have been altered on either node outside the initial setup. We made sure to only configure static IP assignment, module updates and HA.
There is no physical firewall between the nodes (they are in a lab VLAN). I will check with our networking team to make sure there are no ACLs present which might be interfering.
Is there a place in the documentation or wiki that lists the required ports in this type of configuration?
There is, and the answer is ‘no filtering at all, whatsoever, not even a little bit that you think won’t hurt’.
There is at least multicast, and potentially non-IP traffic flowing across that link. It must be 100% open. That’s why we recommend a physical cable, so no-one can accidentally add a filter that they think won’t hurt.
Edit: You do raise a good point, it’s not explicitly spelt out in the documentation. It is now!
Fortunately the servers are physically close together and the phrase “as close to a raw networking cable as possible” set off a light bulb.
As soon as I connected the nodes using a crossover cable on the eth1 [internal] interfaces, both are showing online and pcs confirms the cluster is in a healthy state.
Thanks for making the wiki edit, as it directly resolved our issue in this case.