High Availability - Almost Fixed, SSH Key Issue

I’ve been having issues ever since a series of extended power failures - the primary node refused to start asterisk, so we’ve been running on the backup server ever since. After much troubleshooting, swearing, replacing hard drives, gnashing of teeth, reinstalls, and threats to throw the server out into the middle of the road and cheer as semi trucks drive over it, I’ve managed to get the HA resources to stay up. (More on that later.)

The long remaining issue from what I can tell is that when I run a cluster health check, the following line is displayed:

Verify other node's SSH Key Critical system error. Unable to continue

I’ve looked in the wiki & can find nothing about this. Wondering if anyone has any ideas of clues of where I need to update the keys, or how to get the system to do this automatically?

For those who might run into this same issue with HA, here’s what happened with mine & how I finally fixed it. While it appears simple, it took me over a month to figure all this out on my own… there’s precious little available online if you don’t know exactly what you’re looking for & I got few responses to the threads I started here. (A link provided by @dicko was closest to helping, but was missing vital information, which I am going to provide here.) Ran into another issue along the way - the failed server would turn on for 3-5 seconds and then go off. Thankfully it fixed itself - literally I came in one Monday and it behaved normally. This is a good thing as there are still 0 responses to the thread I started asking for documentation on the board of the PBXact 100 appliance… I guess by this that you’re not supposed to work on or troubleshoot these when they have problems. :frowning:

Apparently there is metadata kept for each resource by DRBD. If this gets corrupt, you will be able to mount the device as a regular file system. However, while drbd will mount it at boot, it will unmount it a few seconds later. (If you’re fast enough logging in, you can see it do this.) Because drbd won’t keep the resource mounted, it causes problems for the rest of the system. In my case, httpd would not stay mounted, so asterisk was unable to start. There was nothing in any of the log files that pointed me in the right direction. Even blowing away & recreating the file system on the underlying device did not resolve the issue… because it was the metadata kept by drbd that was corrupt, not the file system.

The following are the commands & what they do for how I got the drbd resource to stay mounted. This may not work for your system & I don’t know if it causes data loss, though it did not in my case. Use at your own risk… all commands were executed as root. (Which I know is going to thrill some people! :wink: :laughing: )

On the failed node:

drbdadm invalidate httpd     #Tells drbd to not trust the data on the drive.
drbdadm create-md httpd     #Creates the meta data - need to type yes twice to confirm.
drbdadm secondary httpd     #Ensures the resource on the failed node is secondary; probably not needed.
drbdadm disconnect httpd     #Disconnects the resource from the cluster.
drbdadm connect --discard-my-data httpd     #Reconnects the resource to the cluster.
drbdsetup status     #Shows what drbd is doing at the present time.

The final command showed that drbd was trying to connect to the active node. On the main node:

drbdadm primary httpd     #Make sure that the resource is primary on the good node; also probably not necessary, but won't hurt.
drbdadm connect httpd     #Allows the failed peer to connect & begin replicating the file system.

The file system began synchronizing & I could monitor the progress with:

service drbd status

Even though the other resources showed as Connecting and would probably be OK if I’d simply executed the commands in the second block on the active node, I repeated all of them for each resource - asterisk, mysql, & spare. (I did NOT want to get done & have a mysterious problem caused by corrupt metadata!)

Once I figured out all of these and allowed the nodes time to sync up, things began working. The only issue remaining is the aforementioned ssh key problem… I know not why the ssh key would have changed as this is a bitwise clone of the original drive that was in the device & functional before the power failure. :thinking:

I have a hunch that there is a simple command that’s run by the HA setup utility that’s run when the cluster is first configured, but I’ve no idea what that is or how to figure it out. Really hoping someone here is going to know how to fix this last remaining error.

Aaaand it’s fixed. :slight_smile: :slight_smile: :slight_smile:

Solution was extremely difficult… ssh from each box to the other one & accept the fingerprint when prompted. :laughing:

Hope the above book I wrote ends up helping someone.

Looks like I spoke too soon… :frowning: :cry:

A bunch of stuff came up & I didn’t have a chance to do a real-world failover to the repaired system until a few days ago. Even though DRBD shows all 4 resource groups as being fully synchronized with the running secondary, the httpd & asterisk resource groups (drives & processes) will not start when I either put the running secondary into Standby mode or simulate a failure by pulling the ethernet cables from the failed node. Even if I reboot it after a fresh drive sync, it refuses to start these processes. I have a hunch that asterisk will not start because httpd will not, but this is little more than a hunch. There’s nothing in any log file that I’ve been able to find to indicate what/where the problem is.

The only idea I have at this point is to go in at night, shut down the secondary, do a bitwise clone of the drive in it, then shove this drive into the primary and somehow adjust the licensing / identity of it by hand so it looks like the actual machine it is. Has anyone ever tried this or have anything else I could try before going to such an extreme measure??? :question:

I’ve spent over 2 months working on this & while it’s certainly in better shape now than it was, it’s still incapable of processing calls. Admitting I’m officially stumped right now… cloning the backup & then adjusting things (if that’s even possible) is clearly a hack solution to work around whatever the problem is, not to find & fix it. :frowning_face:

As it is a paid for and expensive commercial closed source module, I would suggest you open a commercial ticket, because few in these fora are in any position to help you here.

Then HA module is all but dead. There is a reason it is unavailable beyond v13. Advanced Recovery has taken its place.

It might be ‘almost dead’ but at it’s price, I hope is not also ‘almost unsupported’ for those poor suckers that bought it . . .

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.