[Resolved] HA fencing test kills both machines

I did this:

“You can test a real world crash that requires fencing by killing the proccess called ‘cib’ on the standby node. A command such as ‘killall cib’ should immediately trigger a fence event, and will allow you to validate your fencing configuration.”

And it killed both boxes instead of only one. I’m using IPMI. Obviously, it’s working for both, a little too well it seems. Can anyone help with this?

Turns out he had the wrong IP addresses for the IPMI nodes. (-A had -B’s IPMI address, and vice versa)

I must admit although I using the HA module, and its working very well, I still have not enabled fencing. I was worried about things like this happening. I have had split brain problems a couple off time so will enable fencing once I’m happy everything is stable. I do get problems with our overnight backups freezing the severs briefly and the HA stuff trying to failover etc.
I can say that when we did have a major problem in one of our server rooms, the FreePBX HA failed over perfectly, and very useful feature.

1 Like

That’s why we encourage people to check these things, by doing a test failure :sunglasses:

That’s because you DON’T have fencing set up. When you enable fencing, HA is able to automatically resolve split brain issues, 99 times out of 100.

Using Distro 6.6 and FreePBX 13 and I am unable to test fencing, when I kill the cib process it gets restarted intermediately so nothing happens… I’m guessing HA has been improved to be more resilient to crashes of the cib process?

What would be the best way to test fencing now?

Thanks,

G

That doesn’t sound right. Killing the CIB process should fence the machine. If you do a ‘pcs status’, do you see the fencing tasks there?

Yes I do see them:

fence_b (stonith:fence_ipmilan): Started freepbx-a
fence_a (stonith:fence_ipmilan): Started freepbx-b

After I do the “killall cib” the cib process shows immediately again with a different pid started by pacemakerd