High Availability Cutover-File Corruption Problem

Since our last cutover between nodes on my HA Cluster, our drbd file systems have experienced internal data corruption.
So far I have seen 4 corrupted voice mail files, causing a Whoops error on the FreePBX GUI when accessing VoiceMail Admin.

Here was the error (in dmesg) at our last cutover:

EXT4-fs (drbd1): warning: mounting fs with errors, running e2fsck is recommended
EXT4-fs (drbd2): warning: maximal mount count reached, running e2fsck is recommended
EXT4-fs (drbd4): warning: maximal mount count reached, running e2fsck is recommended
EXT4-fs (drbd3): warning: maximal mount count reached, running e2fsck is recommended

How do I solve this and how do I run e2fsck properly?

@xrobau

drbd isn’t a file sytem, it is a block device on which sits file system(s) and unfortunately that WILL occasionally happens to ext4 on distributed/networked block devices and due to the way ComedianMail does it’s “Shell sort” on voicemail files as they are reallocated to “read/deleted”, that will often cause corruption as you “cutover” at an inconvenient point in time especially on a busy system, any IO error in the voicemail directory (google “bit-rot”) will cripple the voicemail/comedianmail subsystem from that point forward until the underlying FS is repaired.

I suggest this recipe:-

Disconnect the secondary node so corosync/heartbeat doesn’t try to be "helpful"
Stop all “replicated” services, your asterisk system is now DOWN!
Dismount any filesystem(s) that DRBD hosts.
Repair all affected partitions/filesystems.
Make sure that the Primary DRBD subsystem is still primary/active.
Remount all unmounted filesystem(s)
Restart all the stopped services, your system should now be UP again!
Reconnect the secondary node.
Wait while you ‘watch cat /proc/drbd’ .
Check that “cutover” still works :wink:

Many prefer using zfs (maybe btrfs) over ext4 to ameliorate (self heal) the problem you just encountered, but that’s another story :slight_smile:

1 Like

That’s odd. It’s meant to run fsck every time it changes. You shouldn’t see those errors at all. Looking at the latest Pacemaker code, it appears that someone has the logic the wrong way round:

That should be ‘true’, not ‘false’.

I’ll see what I can do about getting a patch in place.

Edit, later:
In the interim, I’ve created a ticket so I don’t forget about it 8)

http://issues.freepbx.org/browse/FREEPBX-12437

1 Like