High Availability?

As always, thank you. I’m going to owe you a beer after this.

Looks like things are moving forward smoothly on my end. There was some troubleshooting with the symlinks but got everything working nicely. I do still owe you that beer. =)

Ah!! HA beer :wink: how cool would that be?

If you’ve got the pacemaker+corosync+drbd recipe working, can you share the love? Everywhere I turn I find scripts that don’t work or are older than dirt.

wait for los, he’s almost there .

(I gave you the basic method. you will need to fish from there :wink: )

No problem, as dicko said…I was close to getting it working right. I’ve finally hit it right on and have it documented to the wazoo for recovery if necessary.

Let me see if I can write something up. =)

Sorry it took me a bit, but we’re really hitting a busy time right now as we’re trying to prepare for a move.


Here’s something to get you started:

  • When installing FreePBX Distro, make sure you pick advanced and setup an extra partition equal to or MORE than the DRBD partition you want to use
  • I recommend the “Clusters from Scratch” PCS and LCMC versions
  • Setup CMAN with the guide here: http://clusterlabs.org/quickstart-redhat.html
  • For DRBD, You’ll need to install elRepo, which will give you access to the rpm
  • Once you have that all setup by following the Clusters from Scratch guides and the CMAN guide (make sure it’s pacemaker + CMAN + DRBD instead of the normal pacemaker + corosync + DRBD as RHEL as Corosync is going by the wayside), you’ll need to isolate the files used by FreePBX into the DRBD drive. Essentially they are the following folders and files:


-Then you need to remove the originals files and create symlinks with their original names and locations…and if I remember correctly…there’s one symlink in either var/www or var/lib/mysql that is a relative reference that must be reestablished to the correct location, too.

  • The last step, STONITH, threw me for a loop for a bit…and that was because I was using the CLI without noticing that there was a hardcoded default action that would run for the fencing type I used…and whenever I tried to force a DIFFERENT action for each device, it wouldn’t accept it. I don’t know what specific fencing you’d be using, but if you run into the type of problem that you cannot get it to do anything BUT one specific action…I have the solution.

Now, this may be fairly simple to someone who knows their way around things, but it took me about a week of constant trial to get things about right, and then tiny tweaks over a few more weeks to make sure I had things right.

As I said, I have complete documentation, but I would need to boil it down to something more tutorial based. I can answer more questions though.

Throw them out and I hope this helps at least somewhat.

And dicko, if you see anything way off, yell at me. =)

Looks pretty good for a CentOS system, some “state” will be in /var/cache for some systems, make sure your MTA doesn’t get confused, depending on your network and vlans, watch out for arping problems. . . .

I see nothing in /var/cache that seems to have any effect on asterisk/mysql/httpd other than perhaps rpcbind. Do you think it would be prudent to add the whole directory to the drbd drive as well?

Also…I must be missing something because I don’t see the import of MTA in this situation…

Lastly…I don’t believe I should be running into much in the way of arping problems…but I want to try something as right now the cluster is running across a switch over the client-facing IP and i’d like to set it up so that the cluster runs directly between the nodes via a crossover-link with ONLY the client-facing ips acting as destinations for the floating-ip.

It depends on what you add, for example aastra xml scripts will use /var/cache/aastra for its state, hylafax /var/lib/hylafax etc.

The problem with handling mail depends on how it is configured, if inbound is in use then the postfix (or other MTA) state needs to be preserved and the service also controlled. If only outbound then less of a problem, but you should use an internal mailserver so mail will still work on the secondary plane.

For lower level access to ones hardwares, I always add a “management” interface, by necessity occasionally a vlan if you are eth(n) poor, you can use the drbd link if you are careful.

We don’t tend to use aastra or hylafax…so I don’t see that being an issue.

As for the mail…we use postfix, but it’s only outbound for the voicemails that I know of…anything else that comes to mind?

And I do have one problem I’m running into in trying to figure out the whole cluster communication vs client (lcmc/external management) communication.

I’ve setup the cluster so that it uses eth1 as communication ( and, respectively, names given in /etc/hosts), but I cannot, for the life of me, get lcmc to see that pacemaker is running on the hosts. I keep getting “wait” on lcmc. Do you think this has anything to do with this configuration change?

I suppose I can try using pcs only…but it’s been a real headache with the lack of documentation that seems to be out there…

Well, we differ, I use Debian, you use CentOS, I have for a long let lcmc do the low level driver installs basically because it does it better than I did :slight_smile: and have never had a problem with it recognizing what it did, you seem you prefer another method.

As to doccumentation, you are doing a great job, me, I just do it of the top of my head each time.

Haha, really, I do not prefer another method. LCMC is really…really nice, it’s just giving me a headache right now! =)

pcs is NOT something I want to move to. I just cannot, for the life of me, figure out why LCMC doesn’t see pacemaker as running.

I am using the same or a similar configuration in testing some HA/FBPX installs. Also not able to get LCMC to see the installation, although I’m fairly certain it’s done correctly. I’ll be installing some new VMs to test using the LCMC package installation, see if that makes a difference.

Just curious, what hardware are you using? Are you using RAID at all?

You know…I figured it out about a day after I posted that…and it was such a simple stupid thing, too!

LCMC doesn’t like having two different names associated with the cluster nodes. Even though pacemaker/cman is running…LCMC will not accept /etc/hosts’ and cluster.conf’s naming conventions for the internal ethernet link if the hostname of the system is ANY different. So…the key was making sure that I just ran a "hostname " and LCMC miraculously got rid of the “wait”!

I actually posted this to the LCMC sourceforge issues page and closed the ticket as well.

As for hardware, one node is using RAID1 (two drives) and the other is just a straight single drive setup.

The nodes are both supermicro systems with Rhino T1 cards with the primary node having a Rhino passive failover installed as well.

Hey Los, what all do you have setup for services on LCMC?
I’m still muddling around trying to get my relocated directories, but that’s the next thing I’m wondering about.

The ‘tree’ that I created in LCMC was essentially this:

IPAddr2 (floating ip)-> Mounted drive <- DRBD/Linbit
Asterisk MySQL httpd(apache)

So…if that doesn’t look like much…the floating IP and LINBIT start first, shared drive is ordered next and colocated with them both, and then Apache, MySQL and Asterisk are colocated/ordered from them, with Dahdi following up after Asterisk.

Oh…and do not forget STONITH devices if you have the means to use them.

You can now buy it from Schmooze for 1500 bucks if you hurry :wink:

Way at the beginning of this thread the ominous “slit brain” came up

If you are a real dude, you will seriously mess with your system by pulling cables, unplugging hardware, resetting networks etc. Call it a "stress test"
but sooner or later you will get slit, make sure you cover that in your split-brain recovery process. . .

“Slit Brain” is that some fascination with pornography?