Downtime for System Admin updates to a High Availability (HA) cluster

Boy, I wish everyone would weigh in with these kinds of things to watch out for on this thread - maybe even a Wiki article listing the ways you can commit cluster Hari-Kari!

I really like having it as an option for the paranoid, but I wish the maintenance were a little less fraught - I have complete spare machines for all my installed units and dead-2-working for any customer is no more than the drive time to get there with a replacement but some places have to have this as a line-item to check off.

I have no fear of working on my stand-alone boxes - there is nothing I haven’t seen and can’t fix, but the cluster is a whole different beast.

Keep posting and I will too!

We actually have two cluster environments (2 servers each) and upgraded them both from 6.12.65 to 10.13.66. The first upgrade on two servers initially seemed to have gone well (we had to kill that process during the upgrade due to a pacemaker bug, but that was expected), but the upgrade on one of the other machines went south. First server was good, but on the other one we had kernel panic, later unwanted node reboots on cluster fail over, etc., but we eventually ended up fixing most of it.

However, since the upgrade, even on those machines where it seemed to have gone well, I feel that HA is now broken and prone to problems. Not very badly broken, but enough for it to be annoying.

Now sysadmin distro upgrades often fail unpredictably and cause unwanted cluster fail overs, requiring manual intervention to bring the cluster back up.

Another problem now is, that on cluster fail overs (not always), some services like Asterisk and httpd won’t start automatically, I get errors, which I have to clear first from command line, and then things will start.

In short, these upgrades messed things up for us a bit, but we are hopeful we can sort it out.