HA Node Management Not Working

ds_scalar · November 25, 2014, 4:11pm

We have HA configured on PBX Firmware:6.12.65-20, PBX Service Pack:1.0.0.0, Asterisk 13.0.0, HA Module 12.0.1.1.

Both nodes freepbx-a and freepbx-b are showing a state of WFConnection / NodeDown, respectively.

When we go to the node management panel and click the Online button to bring freepbx-b online, there are no errors and the message returned is “freepbx-b has been set to Online”.

However, nothing changes. It still appears that the second node is offline and if we refresh the page, the Online button is still there, as though we did not change anything.

Detailed screenshots of the above can be found here.

tonyclewis · November 25, 2014, 4:15pm

I would open a support ticket with us at support.schmoozecom.com. Someone can take a look. Also make sure you have the latest HA installed from FreePBX module admin as we fixed a few bugs found yesterday in FreePBX 12 HA module that only effect 12 systems.

ds_scalar · November 25, 2014, 4:35pm

Yes, we have 12.0.1.1 installed. All modules are patched to date and as mentioned above we are on build 6.12.65-20.

We already had the bug you mentioned take down one of our nodes during HA setup, so we will reference that case on the new ticket.

ds_scalar · November 25, 2014, 6:00pm

In the meantime, I believe I have found a clue. Running pcs status on the freepbx-a node, the following is returned:

[root@freepbx-a ~]# pcs status
Cluster name: 
Last updated: Tue Nov 25 12:54:34 2014
Last change: Tue Nov 25 09:07:55 2014 via cibadmin on freepbx-a
Stack: cman
Current DC: freepbx-a - partition WITHOUT quorum
Version: 1.1.10-14.el6-368c726
2 Nodes configured
20 Resources configured


Online: [ freepbx-a ]
OFFLINE: [ freepbx-b ]

Full list of resources:

 spare_ip       (ocf::heartbeat:IPaddr2):       Started freepbx-a 
 floating_ip    (ocf::heartbeat:IPaddr2):       Started freepbx-a 
 Master/Slave Set: ms-asterisk [drbd_asterisk]
     Masters: [ freepbx-a ]
     Stopped: [ freepbx-b ]
 Master/Slave Set: ms-mysql [drbd_mysql]
     Masters: [ freepbx-a ]
     Stopped: [ freepbx-b ]
 Master/Slave Set: ms-httpd [drbd_httpd]
     Masters: [ freepbx-a ]
     Stopped: [ freepbx-b ]
 Master/Slave Set: ms-spare [drbd_spare]
     Masters: [ freepbx-a ]
     Stopped: [ freepbx-b ]
 spare_fs       (ocf::heartbeat:Filesystem):    Started freepbx-a 
 Resource Group: mysql
     mysql_fs   (ocf::heartbeat:Filesystem):    Started freepbx-a 
     mysql_ip   (ocf::heartbeat:IPaddr2):       Started freepbx-a 
     mysql_service      (ocf::heartbeat:mysql): Started freepbx-a 
 Resource Group: asterisk
     asterisk_fs        (ocf::heartbeat:Filesystem):    Started freepbx-a 
     asterisk_ip        (ocf::heartbeat:IPaddr2):       Started freepbx-a 
     asterisk_service   (ocf::heartbeat:freepbx):       Started freepbx-a 
 Resource Group: httpd
     httpd_fs   (ocf::heartbeat:Filesystem):    Started freepbx-a 
     httpd_ip   (ocf::heartbeat:IPaddr2):       Started freepbx-a 
     httpd_service      (ocf::heartbeat:apache):        Started freepbx-a 


PCSD Status:
Error: no nodes found in corosync.conf

And from freepbx-b:

[root@freepbx-b ~]# pcs status
Cluster name: 
WARNING: no stonith devices and stonith-enabled is not false
Last updated: Tue Nov 25 13:03:48 2014
Last change: Tue Nov 25 10:12:21 2014 via crmd on freepbx-b
Stack: cman
Current DC: freepbx-b - partition WITHOUT quorum
Version: 1.1.10-14.el6-368c726
2 Nodes configured
0 Resources configured


Node freepbx-a: UNCLEAN (offline)
Online: [ freepbx-b ]

Full list of resources:



PCSD Status:
Error: no nodes found in corosync.conf

We have also already tried manually setting freepbx-b to unstandby using pcs and the cluster repair script procedures from the HA wiki.

xrobau · November 25, 2014, 8:08pm

This is pretty cool. What’s happened is that the freepbx-a and freepbx-b machines are PARTLY firewalled. The installer for ‘join a cluster’ just does some basic tests - trying to SSH and Ping between the hosts, but doesn’t try everything.

I’m not sure what’s going to happen when you remove the firewall that’s blocking connectivity between the machines. I’d, honestly, suggest that you don’t even try. It’s possible that the ‘blank’ cluster (on -b) may overwrite the full cluster (on -a). It’s UNLIKELY, but as it’s a newer cluster, the timestamps and serial numbers may conflict, and … well, let’s not.

So. I’d suggest just completely reinstalling -b, or, if that’s going to be annoyingly difficult, you can open a commercial support ticket and I’ll manually unravel the cluster from -b. Then, remove the firewall between the machines and reinstall -b, and have it rejoin the cluster.

I would like to write some more checks to validate the network connectivity in more ways, but, they talk amongst themselves in so many different ways that it would be a massive undertaking to validate them all!

–Rob

ds_scalar · November 25, 2014, 8:22pm

Reinstalling the distro on b as we speak. Will edit this reply once it’s back online and rejoin is attempted…

ds_scalar · November 25, 2014, 9:05pm

Okay, reinstall is done. Rejoined to the cluster, went to HA management panel and ran cluster health checks. Returned one error with DRBD symlinks on other node, and requested the following be run on b:

mv /var/lib/php/session /var/lib/php/session.fix && ln -s /drbd/httpd/session /var/lib/php/session
mv /etc/httpd /etc/httpd.fix && ln -s /drbd/httpd/etc /etc/httpd
mv /var/www /var/www.fix && ln -s /drbd/httpd/www /var/www
mv /tftpboot /tftpboot.fix && ln -s /drbd/httpd/tftpboot /tftpboot

After doing so, all checks pass with green, just as in the screenshots in my first post. Just as before, I click “Online” to bring b into the cluster, and just as before I am told it is now online.

Going back to the status page [also checked using pcs status], we are at square one.

tonyclewis · November 25, 2014, 9:32pm

But it should be online already. You have some type of firewall between these 2 systems? Something is stopping them from communicating it seems.

xrobau · November 25, 2014, 10:45pm

You still haven’t removed whatever’s filtering data between the two machines.

Is there possibly some iptables rules on -a that you’ve accidentally added?

ds_scalar · November 26, 2014, 2:11am

Thanks, both for the quick replies! No iptables or other configs have been altered on either node outside the initial setup. We made sure to only configure static IP assignment, module updates and HA.

There is no physical firewall between the nodes (they are in a lab VLAN). I will check with our networking team to make sure there are no ACLs present which might be interfering.

Is there a place in the documentation or wiki that lists the required ports in this type of configuration?

xrobau · November 26, 2014, 4:25am

There is, and the answer is ‘no filtering at all, whatsoever, not even a little bit that you think won’t hurt’.

There is at least multicast, and potentially non-IP traffic flowing across that link. It must be 100% open. That’s why we recommend a physical cable, so no-one can accidentally add a filter that they think won’t hurt.

Edit: You do raise a good point, it’s not explicitly spelt out in the documentation. It is now!

http://wiki.freepbx.org/display/FCM/FreePBX+HA-Setting+up+the+Master+and+Slave+Nodes

ds_scalar · November 26, 2014, 12:58pm

Fortunately the servers are physically close together and the phrase “as close to a raw networking cable as possible” set off a light bulb.

As soon as I connected the nodes using a crossover cable on the eth1 [internal] interfaces, both are showing online and pcs confirms the cluster is in a healthy state.

Thanks for making the wiki edit, as it directly resolved our issue in this case.

ds_scalar · November 26, 2014, 1:10pm

Had another quick question… There is documentation in /etc/sysconfig/network that explicitly forbids hostname changes, lest we break the cluster.

Is there a supported means of changing node hostnames?

tonyclewis · November 26, 2014, 4:02pm

No hostnames must be left alone or it breaks HA setup.

xrobau · November 26, 2014, 8:25pm

As tony said, no. Don’t change the hostname. They need to be set to that.

Why do you want to change the hostname?

–Rob