High Availability Node powers down instead of going into standby

avayax · February 5, 2016, 2:53am

Having a big problem with our High Availability cluster.

Upgraded both nodes from 10.13.66-1 to 10.13.66-7.

Then got an error on the GUI: sqLSTATE[HY000]: General error: 126 Incorrect key file for table ‘./asterisk/module xml.MYI’; try to repair it.
We managed to repair that.
After I was done I tried to put the current master on standby, the slave was online.
Then what happened was that the new master was online, but the slave node instead of going into in standby just powered down.
This is now happening every time.
I also find that this file always gets removed after above happens:
/etc/sysconfig/network-scripts/ifcfg-em1:0

Could it be there is a fencing operation pending? We don’t have fencing configured.

No idea what is happening.

Summary:
When I put one of my servers from master into standby, it doesn’t go to standby but powers down instead and /etc/sysconfig/network-scripts/ifcfg-em1:0 (where my 172.27.91.1 lives) gets removed.

Can you help?

avayax · February 5, 2016, 2:32pm

Can anybody help with this problem?

avayax · February 8, 2016, 2:00pm

Rob Thomas, can you help us on this issue?

In short we have this problem:
When we put freepbx-b node from master into standby, it doesn’t actually go into standby, but powers down.
File /etc/sysconfig/network-scripts/ifcfg-em1:0 (where my 172.27.91.1 lives) gets removed.
We do not have fencing configured.

What could be my problem?

Marbled · February 8, 2016, 3:41pm

avayax, if you want to bring this thread to Rob’s attention then you better do this:

@xrobau

This should, normally, notify Rob that someone mentioned him in this thread…

Good luck and have a nice day!

Nick

xrobau · February 8, 2016, 8:10pm

em1 is not a standard networking interface name. It should be eth*… Have you enabled biosdevnames?

avayax · February 10, 2016, 1:22am

I didn’t intentionally enable biosdevnames.
It worked great under 12 for almost a year. HA worked fine under the em1 device. The device name is a string supplied to the GUI when the cluster is first created, so I would think it shouldn’t matter what its called. It’s left in a config file, and the secondary IP interface is constructed by appending a “:0” to whatever device is specified as the interface for clustering.

Is there’s anything in your 13 code that’s stepping on any device not called “eth”?

To me it seems that the problem is that the machine is acting like there are stale fencing events that are only forcing it off the secondary subnet and forcing a power down when putting it in standby after being master.
It’s not 100% consistent though. Sometimes just a reboot with no device removal, other times full power down and device removal.

The real problem is the powerdown. And only on freepbx-b.

xrobau · February 10, 2016, 1:24am

I totally agree. But, because the network interface is called ‘em1’, it means that other stuff has changed. I don’t know what that other stuff is. I’m reasonably sure that changing the name alone will not break it, but, you’re having problem with whatever ELSE has been changed.

In this situation, I’d just reinstall and rejoin the -b machine to the cluster.

avayax · February 10, 2016, 1:51am

Looking at the /var/log/messages, it seems that after a standby event (only on freepbx-b, not freepbx-a) pacemaker forces a fencing event.
We don’t have fencing configured though.

So, how do we clear out these pending fencing events? Is that a bug in pacemaker?

xrobau · February 10, 2016, 2:18am

I honestly don’t know. However, I still stand by my previous statement. SOMETHING has been done to that machine, and the best thing to do is just reinstall. That’ll fix all your problems, and it’ll be back at a known good state.

xrobau · February 12, 2016, 12:03am

I’m going to have to eat my words here and admit I’m wrong. I’ve been involved in a commercial support ticket today with a similar issue, and they had a Dell machine that had ex/pxx interfaces from a distro install. I didn’t think that was possible, but, apparently it is 8-\

So. Let’s assume that the machine is perfectly untouched, and everything is correct. There’s only two ways the machine can shut down. The first is via IPMI/DRAC, with fencing. The second is the sysadmin command ‘shutdown’, which runs a shutdown command from the shell.

Can you check your logs to see if it is actually HA shutting the machines down? If it’s in your DRAC/IPMI logs, then we know where to point the finger.

If it is HA, I’d be interested to see if you’ve accidentally got the fencing configuration around the wrong way. This has happened to a couple of people. You should be able to go into fencing, and turn off the standby machine, and the standby machine should turn off.

That… that I have no idea about. There’s no way that should be removed. Nothing ever deletes network configurations.

It might be worth opening a commercial support ticket and I can investigate further.

avayax · February 12, 2016, 12:17am

Thanks for looking into this.
Actually, the commercial support ticket you worked on today was most likely from me. I was talking to Matt from Schmooze and he said he was gonna talk to you.

I will check the IDRAC logs. We have IDRAC 8 on our servers and don’t have fencing configured.

xrobau · February 12, 2016, 12:34am

Then it’s impossible for HA to shut it down. I’m stumped as to what could be doing it. Anyway, if there’s a support ticket, if I haven’t already looked into it, I’m sure I’ll get pulled in shortly!

–Rob

avayax · February 12, 2016, 4:51pm

These are pacemaker logs. Don’t know if there is anything unusual or if it helps. This should have been around the time we did experience the problems.

Feb 05 20:39:59 [3469] freepbx-b pacemakerd: info: crm_signal_dispatch: Invoking handler for signal 15: Terminated
Feb 05 20:39:59 [3469] freepbx-b pacemakerd: notice: pcmk_shutdown_worker: Shuting down Pacemaker
Feb 05 20:39:59 [3469] freepbx-b pacemakerd: notice: stop_child: Stopping crmd: Sent -15 to process 3482
Feb 05 20:46:58 [3469] freepbx-b pacemakerd: error: cfg_connection_destroy: Connection destroyed
Feb 05 20:46:58 [3469] freepbx-b pacemakerd: notice: pcmk_shutdown_worker: Still waiting for crmd (pid=3482, seq=6) to terminate…
Feb 05 20:46:58 [3469] freepbx-b pacemakerd: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Feb 05 20:46:58 [3469] freepbx-b pacemakerd: error: mcp_cpg_destroy: Connection destroyed
Feb 05 20:46:58 [3469] freepbx-b pacemakerd: info: crm_xml_cleanup: Cleaning up memory from libxml2

avayax · February 12, 2016, 10:30pm

There are repeated /var/log/messages log entries on freepbx-b going back to Feb 3:

Feb 3 03:24:02 freepbx-b fence_pcmk[57637]: Requesting Pacemaker fence freepbx-a (reset)
Feb 3 03:24:02 freepbx-b stonith_admin[57638]: notice: crm_log_args: Invoked: stonith_admin --reboot freepbx-a --tolerance 5s --tag cman
Feb 3 03:24:02 freepbx-b stonith-ng[3392]: notice: handle_request: Client stonith_admin.cman.57638.b8799b8b wants to fence (reboot) 'freepbx-a’
with device '(any)'
Feb 3 03:24:02 freepbx-b stonith-ng[3392]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for freepbx-a: 51fd26b3-9d41-4
cd3-ac9b-7606cfcf5149 (0)
Feb 3 03:24:02 freepbx-b stonith-ng[3392]: error: remote_op_done: Operation reboot of freepbx-a by freepbx-b for stonith_admin.cman.57638@freep
bx-b.51fd26b3: No such device
Feb 3 03:24:02 freepbx-b crmd[3396]: notice: tengine_stonith_notify: Peer freepbx-a was not terminated (reboot) by freepbx-b for freepbx-b: No s
uch device (ref=51fd26b3-9d41-4cd3-ac9b-7606cfcf5149) by client stonith_admin.cman.57638
Feb 3 03:24:02 freepbx-b fence_pcmk[57637]: Call to fence freepbx-a (reset) failed with rc=237

This to me suggests that freepbx-a is trying to force a reboot/shutdown of freepbx-b, which is failing. It’s reporting the command "stonith_admin --reboot freepbx-a --tolerance 5s --tag cman
“ as being executed on freepbx-b. It may be failing because fencing is not currently configured. The reported error is “no such device” as though there’s some fencing/admin device that’s not configured.

xrobau · February 16, 2016, 7:20pm

This is fixed with a combination of the new HA Module (13.0.7) and an update to the Sysadmin RPM. (The sysadmin RPM caused the problem)

Simply update the HA Module, and run a cluster check. That will stop it from happening, immediately.

There’s no need to upgrade sysadmin RPM (but, you can if you want - yum update sysadmin) as the only change is “don’t have this bug in you”

This only affected machines that had upgraded distro versions 6.12 to 10.13 (or, possibly, older versions of 10.13 to 10.13-7). If you haven’t upgraded your distro version, or, you installed Distro version 10.13 to start with, you do not have this problem. This is not related to the FreePBX version.

Still, it’s always a good idea to run a health check, just to make sure!

–Rob

avayax · February 20, 2016, 12:41am

That issue is fixed. Node doesn’t power down anymore.

We saw another issue though.
When moving freepbx-a from slave to master, Asterisk wasn’t properly starting. I had to do an amportal restart.
Status page on HA showed an error that Asterisk wasn’t running:

2016-02-19 19:53:50: asterisk_service on freepbx-a was not running
asterisk_service can fail 1 more time(s) until it is blocked on freepbx-a

I can’t replicate this consistently, it happened twice.
I could go to the Asterisk CLI, but couldn’t issue any commands there.