HA Node Management Not Working

We have HA configured on PBX Firmware:6.12.65-20, PBX Service Pack:, Asterisk 13.0.0, HA Module

Both nodes freepbx-a and freepbx-b are showing a state of WFConnection / NodeDown, respectively.

When we go to the node management panel and click the Online button to bring freepbx-b online, there are no errors and the message returned is “freepbx-b has been set to Online”.

However, nothing changes. It still appears that the second node is offline and if we refresh the page, the Online button is still there, as though we did not change anything.

Detailed screenshots of the above can be found here.

I would open a support ticket with us at support.schmoozecom.com. Someone can take a look. Also make sure you have the latest HA installed from FreePBX module admin as we fixed a few bugs found yesterday in FreePBX 12 HA module that only effect 12 systems.

Yes, we have installed. All modules are patched to date and as mentioned above we are on build 6.12.65-20.

We already had the bug you mentioned take down one of our nodes during HA setup, so we will reference that case on the new ticket.

In the meantime, I believe I have found a clue. Running pcs status on the freepbx-a node, the following is returned:

[[email protected] ~]# pcs status
Cluster name: 
Last updated: Tue Nov 25 12:54:34 2014
Last change: Tue Nov 25 09:07:55 2014 via cibadmin on freepbx-a
Stack: cman
Current DC: freepbx-a - partition WITHOUT quorum
Version: 1.1.10-14.el6-368c726
2 Nodes configured
20 Resources configured

Online: [ freepbx-a ]
OFFLINE: [ freepbx-b ]

Full list of resources:

 spare_ip       (ocf::heartbeat:IPaddr2):       Started freepbx-a 
 floating_ip    (ocf::heartbeat:IPaddr2):       Started freepbx-a 
 Master/Slave Set: ms-asterisk [drbd_asterisk]
     Masters: [ freepbx-a ]
     Stopped: [ freepbx-b ]
 Master/Slave Set: ms-mysql [drbd_mysql]
     Masters: [ freepbx-a ]
     Stopped: [ freepbx-b ]
 Master/Slave Set: ms-httpd [drbd_httpd]
     Masters: [ freepbx-a ]
     Stopped: [ freepbx-b ]
 Master/Slave Set: ms-spare [drbd_spare]
     Masters: [ freepbx-a ]
     Stopped: [ freepbx-b ]
 spare_fs       (ocf::heartbeat:Filesystem):    Started freepbx-a 
 Resource Group: mysql
     mysql_fs   (ocf::heartbeat:Filesystem):    Started freepbx-a 
     mysql_ip   (ocf::heartbeat:IPaddr2):       Started freepbx-a 
     mysql_service      (ocf::heartbeat:mysql): Started freepbx-a 
 Resource Group: asterisk
     asterisk_fs        (ocf::heartbeat:Filesystem):    Started freepbx-a 
     asterisk_ip        (ocf::heartbeat:IPaddr2):       Started freepbx-a 
     asterisk_service   (ocf::heartbeat:freepbx):       Started freepbx-a 
 Resource Group: httpd
     httpd_fs   (ocf::heartbeat:Filesystem):    Started freepbx-a 
     httpd_ip   (ocf::heartbeat:IPaddr2):       Started freepbx-a 
     httpd_service      (ocf::heartbeat:apache):        Started freepbx-a 

PCSD Status:
Error: no nodes found in corosync.conf

And from freepbx-b:

[[email protected] ~]# pcs status
Cluster name: 
WARNING: no stonith devices and stonith-enabled is not false
Last updated: Tue Nov 25 13:03:48 2014
Last change: Tue Nov 25 10:12:21 2014 via crmd on freepbx-b
Stack: cman
Current DC: freepbx-b - partition WITHOUT quorum
Version: 1.1.10-14.el6-368c726
2 Nodes configured
0 Resources configured

Node freepbx-a: UNCLEAN (offline)
Online: [ freepbx-b ]

Full list of resources:

PCSD Status:
Error: no nodes found in corosync.conf

We have also already tried manually setting freepbx-b to unstandby using pcs and the cluster repair script procedures from the HA wiki.

This is pretty cool. What’s happened is that the freepbx-a and freepbx-b machines are PARTLY firewalled. The installer for ‘join a cluster’ just does some basic tests - trying to SSH and Ping between the hosts, but doesn’t try everything.

I’m not sure what’s going to happen when you remove the firewall that’s blocking connectivity between the machines. I’d, honestly, suggest that you don’t even try. It’s possible that the ‘blank’ cluster (on -b) may overwrite the full cluster (on -a). It’s UNLIKELY, but as it’s a newer cluster, the timestamps and serial numbers may conflict, and … well, let’s not.

So. I’d suggest just completely reinstalling -b, or, if that’s going to be annoyingly difficult, you can open a commercial support ticket and I’ll manually unravel the cluster from -b. Then, remove the firewall between the machines and reinstall -b, and have it rejoin the cluster.

I would like to write some more checks to validate the network connectivity in more ways, but, they talk amongst themselves in so many different ways that it would be a massive undertaking to validate them all!


Reinstalling the distro on b as we speak. Will edit this reply once it’s back online and rejoin is attempted…

1 Like

Okay, reinstall is done. Rejoined to the cluster, went to HA management panel and ran cluster health checks. Returned one error with DRBD symlinks on other node, and requested the following be run on b:

mv /var/lib/php/session /var/lib/php/session.fix && ln -s /drbd/httpd/session /var/lib/php/session
mv /etc/httpd /etc/httpd.fix && ln -s /drbd/httpd/etc /etc/httpd
mv /var/www /var/www.fix && ln -s /drbd/httpd/www /var/www
mv /tftpboot /tftpboot.fix && ln -s /drbd/httpd/tftpboot /tftpboot

After doing so, all checks pass with green, just as in the screenshots in my first post. Just as before, I click “Online” to bring b into the cluster, and just as before I am told it is now online.

Going back to the status page [also checked using pcs status], we are at square one.

But it should be online already. You have some type of firewall between these 2 systems? Something is stopping them from communicating it seems.

You still haven’t removed whatever’s filtering data between the two machines.

Is there possibly some iptables rules on -a that you’ve accidentally added?

Thanks, both for the quick replies! No iptables or other configs have been altered on either node outside the initial setup. We made sure to only configure static IP assignment, module updates and HA.

There is no physical firewall between the nodes (they are in a lab VLAN). I will check with our networking team to make sure there are no ACLs present which might be interfering.

Is there a place in the documentation or wiki that lists the required ports in this type of configuration?

There is, and the answer is ‘no filtering at all, whatsoever, not even a little bit that you think won’t hurt’.

There is at least multicast, and potentially non-IP traffic flowing across that link. It must be 100% open. That’s why we recommend a physical cable, so no-one can accidentally add a filter that they think won’t hurt.

Edit: You do raise a good point, it’s not explicitly spelt out in the documentation. It is now!


1 Like

Fortunately the servers are physically close together and the phrase “as close to a raw networking cable as possible” set off a light bulb.

As soon as I connected the nodes using a crossover cable on the eth1 [internal] interfaces, both are showing online and pcs confirms the cluster is in a healthy state.

Thanks for making the wiki edit, as it directly resolved our issue in this case.

1 Like

Had another quick question… There is documentation in /etc/sysconfig/network that explicitly forbids hostname changes, lest we break the cluster.

Is there a supported means of changing node hostnames?

No hostnames must be left alone or it breaks HA setup.

As tony said, no. Don’t change the hostname. They need to be set to that.

Why do you want to change the hostname?