High Availability - One Node Failed, Increase Drive Size

RealRuler2112 · September 8, 2021, 3:47pm

We had a massive storm yesterday and ended up with a brownout situation, then the power failed completely for several hours. The UPSs were drained and things shut down. This morning, the PBX was in a split brain condition, with mysql running on the primary and everything else on the secondary. Unfortunately, asterisk refuses to start on the primary.

So far, I’ve tried looking for corrupt files, file system problems, power cycling it, swearing at it, looking at config & log files, etc - nothing has helped. I’m to the point where I’m making no progress & am pretty much out of ideas of what to try. Unless someone has suggestions, I’m at the point of reinstalling the one node, then rebuilding the cluster.

With all the little tweaks I’ve made since getting the system, many of which are not stored on the mirrored drive space, I’d rather not wipe the drive… I know that I’m going to miss something and end up being bitten by it. Both PBX systems currently have 480 gig SSDs in them - I replaced the 240 gig that they came with almost immediately as they were nowhere near large enough to store the amount of call recordings that sales told me they would. (A sales person exaggerate? NEVER! ) Even after doubling the size, they still only barely hold the 3 months management wants.

Because I’m looking to replace the drive in one already, I was thinking it’d be a good opportunity to increase the size of the drives in both systems. I cannot find a matching 480 gig drive locally, but 1T drives are readily available. Unfortunately, I do not know how the reinstall would go having the new drive be 1T while the currently active node is 480 gig.

My assumption is that I’d install PBXact on the new drive, join the cluster, let it sync, then have the freshly installed system take over as the active node and repeat the whole process for the other system. Does anyone have any experience with this or input for me? I read through the VG resizing document on the wiki, but that assumes you’re starting with a healthy cluster and that the hardware on the two machines exactly matches.

RealRuler2112 · September 9, 2021, 8:16pm

Given the lack of response, I’m going to assume that nobody has done this or remembers details about replacing drives in an HA setup with larger ones. Found matching drives & ordered some, both for this replacement & to have spares should something like this happen in the future; going to just do a straight replacement.

Really wish I didn’t have to do a reinstall, but ‘unknown error’ isn’t a whole heck of a lot to go on…

dicko · September 9, 2021, 10:27pm

As a very expensive commercial module, one would expect Sangoma to continue to provide support, in that absence and putting on my very old and tattered Elastix hat (How’s it hanging @danardf) , DRBD with two nodes is a recipe for a burgeoning disaster as you have found.

When we first did this in Elastix a couple of centuries ago (for free), you always needed a recipe to recover, this link

should help. Having resolved the split brain condition, you can follow the recipes to replace and resync the secondary block storage with a larger device (you are not limited to any FileSystem here, check with LinBit).

Moving on from here, adding a non storage ‘quorum’ resolver (could be a Raspberry PI) for your DRBD system would be a good idea

Good Luck. but basically any two node HA system can but ultimately fail and need manual resolution, no matter how much you paid for it

franckdanard · September 10, 2021, 5:48am

Hey Dicko.
That was the good time

microchipmatt · September 10, 2021, 10:18am

@RealRuler2112. Sounds like a tough situation. I know this isn’t helpful now and I don’t know how large your environment is. I’m in education with 1000 staff 6000 students and 30 sites. I Almost went HA across our large geographic locations (think the north and oil) but ultimately decided against it when I saw all the heartbeat and syncing involved…matched versioning etc. Instead I went with daily dumps to the cloud that are restored on my secondaries from my primary to match data on secondary failovers. More forgiving with freepbx 14 and 15, with the way backup and restore was reworked in freepbx 15. My restore file with all config settings, database data and greetings for 31 sites and currently 3000 voice messages is only 1.3 gb…not sure how that all fits in a 1.3 gb backup but it does compressed, and then rclone for duplication to and from the cloud from my primaries to my secondaries is my friend.it should be noted my secondaries are at different locations with dns used in the wait for failover. All primaries and secondaries look like the same identity to the trunk provider through the magic of virtual nat pointers. Only the trunk provider is allowed to communicate with them.

@dicko, in another lifetime I once looked at and considered (T) - HAAst But my wallet and my skill level ran away screaming when I saw the price and configuration manual, which also matches your very truthful statement of how a HA system can fail no matter how much you pay for it.

system · October 11, 2021, 10:18am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.