For those who might run into this same issue with HA, here’s what happened with mine & how I finally fixed it. While it appears simple, it took me over a month to figure all this out on my own… there’s precious little available online if you don’t know exactly what you’re looking for & I got few responses to the threads I started here. (A link provided by @dicko was closest to helping, but was missing vital information, which I am going to provide here.) Ran into another issue along the way - the failed server would turn on for 3-5 seconds and then go off. Thankfully it fixed itself - literally I came in one Monday and it behaved normally. This is a good thing as there are still 0 responses to the thread I started asking for documentation on the board of the PBXact 100 appliance… I guess by this that you’re not supposed to work on or troubleshoot these when they have problems.
Apparently there is metadata kept for each resource by DRBD. If this gets corrupt, you will be able to mount the device as a regular file system. However, while drbd will mount it at boot, it will unmount it a few seconds later. (If you’re fast enough logging in, you can see it do this.) Because drbd won’t keep the resource mounted, it causes problems for the rest of the system. In my case, httpd would not stay mounted, so asterisk was unable to start. There was nothing in any of the log files that pointed me in the right direction. Even blowing away & recreating the file system on the underlying device did not resolve the issue… because it was the metadata kept by drbd that was corrupt, not the file system.
The following are the commands & what they do for how I got the drbd resource to stay mounted. This may not work for your system & I don’t know if it causes data loss, though it did not in my case. Use at your own risk… all commands were executed as root. (Which I know is going to thrill some people! )
On the failed node:
drbdadm invalidate httpd #Tells drbd to not trust the data on the drive.
drbdadm create-md httpd #Creates the meta data - need to type yes twice to confirm.
drbdadm secondary httpd #Ensures the resource on the failed node is secondary; probably not needed.
drbdadm disconnect httpd #Disconnects the resource from the cluster.
drbdadm connect --discard-my-data httpd #Reconnects the resource to the cluster.
drbdsetup status #Shows what drbd is doing at the present time.
The final command showed that drbd was trying to connect to the active node. On the main node:
drbdadm primary httpd #Make sure that the resource is primary on the good node; also probably not necessary, but won't hurt.
drbdadm connect httpd #Allows the failed peer to connect & begin replicating the file system.
The file system began synchronizing & I could monitor the progress with:
service drbd status
Even though the other resources showed as Connecting and would probably be OK if I’d simply executed the commands in the second block on the active node, I repeated all of them for each resource - asterisk, mysql, & spare. (I did NOT want to get done & have a mysterious problem caused by corrupt metadata!)
Once I figured out all of these and allowed the nodes time to sync up, things began working. The only issue remaining is the aforementioned ssh key problem… I know not why the ssh key would have changed as this is a bitwise clone of the original drive that was in the device & functional before the power failure.
I have a hunch that there is a simple command that’s run by the HA setup utility that’s run when the cluster is first configured, but I’ve no idea what that is or how to figure it out. Really hoping someone here is going to know how to fix this last remaining error.