Since our last cutover between nodes on my HA Cluster, our drbd file systems have experienced internal data corruption.
So far I have seen 4 corrupted voice mail files, causing a Whoops error on the FreePBX GUI when accessing VoiceMail Admin.
Here was the error (in dmesg) at our last cutover:
EXT4-fs (drbd1): warning: mounting fs with errors, running e2fsck is recommended
EXT4-fs (drbd2): warning: maximal mount count reached, running e2fsck is recommended
EXT4-fs (drbd4): warning: maximal mount count reached, running e2fsck is recommended
EXT4-fs (drbd3): warning: maximal mount count reached, running e2fsck is recommended
How do I solve this and how do I run e2fsck properly?
drbd isn’t a file sytem, it is a block device on which sits file system(s) and unfortunately that WILL occasionally happens to ext4 on distributed/networked block devices and due to the way ComedianMail does it’s “Shell sort” on voicemail files as they are reallocated to “read/deleted”, that will often cause corruption as you “cutover” at an inconvenient point in time especially on a busy system, any IO error in the voicemail directory (google “bit-rot”) will cripple the voicemail/comedianmail subsystem from that point forward until the underlying FS is repaired.
I suggest this recipe:-
Disconnect the secondary node so corosync/heartbeat doesn’t try to be "helpful"
Stop all “replicated” services, your asterisk system is now DOWN!
Dismount any filesystem(s) that DRBD hosts.
Repair all affected partitions/filesystems.
Make sure that the Primary DRBD subsystem is still primary/active.
Remount all unmounted filesystem(s)
Restart all the stopped services, your system should now be UP again!
Reconnect the secondary node.
Wait while you ‘watch cat /proc/drbd’ .
Check that “cutover” still works
Many prefer using zfs (maybe btrfs) over ext4 to ameliorate (self heal) the problem you just encountered, but that’s another story