High Availability?

izrunas · April 30, 2013, 8:54am

So there are a lot of posts in a variety of places relating to questions of clustering and high availability… but none are very current.

What have users of the FreePBX distro found to be good solutions for either clustering or creating high-availability systems?

I looked at this commercial solution: http://www.generationd.com/?target=HAAST and found it to be too expensive for what it appears to be.

And also at this “simple” solution: http://linuxnotes.us/archives/268 but am concerned about how “long ago” it was written–plus I share commenters’ concerns about the archaic method of the “dump and restore” method of keeping the second server’s MySQL up to date.

On the other hand, creating a distributed call center environment looks even more appealing: http://www.tmcnet.com/tmc/videos/default.aspx?vid=5821&title=Asterisk-Powered+Distributed+Call+Centers#

So here I am seeking input from other FreePBX users. Which way makes the most sense given the current state of technology?

Thanks!

dicko · April 30, 2013, 3:48pm

I personally use pacemaker/corosync for the HA bit and DRBD/iscsi to hold the persistent “state” of the machine, The open-source project from Linbit:-

http://www.drbd.org/

and “Linux Cluster Management Console”

http://lcmc.sourceforge.net/

to set up and manage it with a pretty GUI.

Given those two sites I think all your questions are covered.

izrunas · April 30, 2013, 9:20pm

Although I must admit some hesitancy about DRBD (memories of many lost days trying to solve split brain issues)…

Will the use of DRBD allow for no interruptions to calls in progress on the server in the event of a failure/switchover? Or do all calls drop, but the system returns to use in 20 seconds?

Thanks.

dicko · April 30, 2013, 9:35pm

Asterisk is a B2BUA so yes the calls will drop, but in my experience the Primary/Secondary state will voip functionally transition in the time it takes for asterisk to start if you set it up right, a maximum of two or three seconds.

Split brain problems are largely installer induced and there are scripts for automatically resolving that state, but anyway corosync is better than heartbeat while talking to drbd in that area. As suggested you can always use iscsi if you remain leery.

mustardman · May 6, 2013, 5:16pm

Best solution I have seen so far is the high availability backup and restore. Relatively simple and gets the job done. Assuming it works as advertised…I haven’t tried.
http://www.freepbx.org/news/2010-05-30/high-availability-backup-and-restore

JessicaRabbit · May 7, 2013, 1:13am

For a brute force failure response I suggest identical back up hardware and disk imaging software ( I use Acronis ) which boots from CD. The system has to be taken down and a new DVD created periodically. In case of software or hardware failure, reboot one of the machines, pop in the Acronis CD select restore, put the DVD in and in a very few minutes you have a running system.

This is the low tech 95% solution. For those who need more, God bless you and good luck.

dicko · May 7, 2013, 3:12am

@JessicaRabbit

Maybe you can save some bucks" and time, use the open-source MondoArchive. No cost, no downtime at all.

Have a cron job backup the system to a usb thumb-drive ready to boot into and do a hardmetal restore onto almost anything, an 8G one will be fine, you don’t need to backup the ephemeral and often large, vmail, monitor files and logs (that belongs somewhere else).

Do another cron job to backup the image offsite as you do for the above ephemeral stuff for when your machine gets hit with gamma rays.

I find it an ideal tool for migrating a working system to a pre-built “real” HA system with only a few seconds of downtime.

los · August 28, 2013, 9:14pm

Hi Dicko,

Since you seem to have good success in getting pacemaker/drbd working with lcmc for FreePBX, I was wondering if you’d be able to answer a couple questions.

I know this is a bit of an old post, so I apologize.

Background:

We’ve recently realized that our backup situation for our asterisk server wasn’t going to cut it if there really was an outage and we would like to get something setup for HA. Because of this we purchased a Rhino single port failover card that essentially acts as a relay to send a T1 signal to one T1 card on one system or to another on another system if the first system fails. However, we realized that this doesn’t solve the problem of “what does it do when it gets there?” and I’ve been searching around for solutions on this matter.

This is why I thought the pacemaker/drbd/lcmc setup would be the best solution:

DRBD is the mirroring between the servers
pacemaker manages the handover and communication
lcmc makes the process MUCH simpler

Upon trying to get a couple test servers setup, I ran into a couple problems:

DRBD requires a shared logical volume and (forgive me for being obtuse - I’m only a Jr admin and may be way out of my league in even attempting this…but I’m doing whatever I can to get this feather in my cap) the FreePBX distro installs as nearly an embedded system in CentOS. When setting services for pacemaker to monitor, does it use the DRBD shared logical volume as a sort of swap space and mirroring space (like, if asterisk was running, does everything it depends on essentially copy into that LV and then get mirrored for the other system to copy into its normal filesystem via Pacemaker?) or is it ONLY the LV that gets mirrored and everything subsequently must be installed in that space?
— Or, to put it more simply…how would someone with a currently running production system setup DRBD to mirror to another system?
Did you have any problems in using CMAN instead of Corosync in LCMC? I don’t think I’m running into any…but there seems to be only CLI documentation on it and no information on how LCMC works with ‘pcs’ commands

Please forgive my ignorance if I sound rather ‘new’ to this. =) I really appreciate any help.

dicko · August 28, 2013, 9:49pm

The trick is to isolate (and then symlink) all of the “state” of the asterisk/mysql/http/logging/mail/tftp/cache/etc. services on the vanilla system to the drbd based filesyste. Even without a working HA system, it should work flawlessly before you go further.

When you add the other machine, and arrange for corosync to handle the change of IP/gateway/arp as the services are stopped on one node and started on the other, the downtime will be in a few seconds. (calls will usually drop on TDM, use a SIP proxy behind your Asteri and you probably wont notice a thing)

This works easily for network based phone service but as you notice, TDM/analog services are harder:-

Digium, Sangoma and Rhino solutions are basically physical “changeover relays”, Redfone and Xorcom do it with way more slickiness by either layer 2 or USB streams that can “redirect” almost instantaneously without bringing down the underlying T1 local loops.

There are a library of primitives in corosync to handle almost all the services you will need to monitor/switch.

los · August 28, 2013, 10:08pm

Alright, creating the symlinks should be fairly simple as long as I know what to look for…(and you kind of spelled that out for me with the specific directories, so thank you).

Our T1 setup is for the incoming PRI and our actual phone system is going to be all digital. Right now it’s not and it would have been extremely troublesome to work with this setup with the analog cards.

Correct me if I’m wrong, but this shouldn’t be too difficult to work around since we’ll be using it for system failures or direct shutdowns initially since the card isn’t doing much other than allowing for an input. However, if we are able to integrate the rhino switchover module into asterisk (for failures in the filesystem), I believe the config could be locally stored via symlink to the replicated system and have the asterisk service dependent upon the module starting first, correct?

Also…regarding the CMAN issue…I’m assuming you’ve had to do no fancy workarounds to make sure things are working as they should?

dicko · August 28, 2013, 10:12pm

(A small caveate, many commercial “licenses” are keyed to your hardware, so before buying or installing such services, it sometimes helps to do so after your base installation is working, mac address spoofing which helps arp problems when switching will often “fool” the licensing authorities into granting access to both machines without having to buy the services twice when you only use them once)

dicko · August 28, 2013, 10:24pm

I seriously suggest you look into Xorcom, they have hardware solutions that cover TDM/FXO/FXS and even door relays, all will seamlessly migrate at the same time.

If all your connections are layer three then there will be no problems, recalcitrant layer 2 connections can bitch and whine for a while if the arp table changes, but they usually “get over it”, so no, I rely on the corosync primitives.

dicko · August 28, 2013, 10:31pm

I should have said “either” not “both”, as they are technically the same machine so double charging for one service on one machine I would consider onerous.

los · August 28, 2013, 11:50pm

Thanks for that heads-up. I hadn’t even considered if our service provider would have any awareness of our hardware. I don’t believe they will, but it’s definitely something to check on.

Also, thanks for the Xorcom info, I’ll consider that as well. If this turns out to be more of a headache than anticipated…do you think that something less “quick” might be the better solution or do you believe this is the best failover solution for Asterisk available today?

dicko · August 29, 2013, 12:43am

It is both easy to do and only costs the redundant hardware and it is pretty instantaneous, all other solutions I have seen have the same hardware requirements but are in no way automatic nor instantaneous. So all in all it’s a no-brainer decision to me

los · August 29, 2013, 1:08am

Fantastic. I’ll keep pushing forward then. Thanks for the help! Though, I might be back to ask a question or two if I run into roadblocks. =) Thanks!

dicko · August 29, 2013, 1:20am

Be aware of course that it still suffers from that “single point of failure”, i.e. your network connection, look into perhaps bgp and a distributed SIP proxy for a more robust solution.

los · August 29, 2013, 9:11pm

Thanks again for all of your help Dicko. I’ve isolated what I believe to be the correct information and moved it into the DRBD device and symlinked them to the vanilla filesystem. This seems to work as advertised except for one small detail:

Unless I have the file system mounted at startup, I need to manually run DRBD, then mount it, then start the services. But, it looks like I could avoid that by mounting it via the UUID. However, DRBD doesn’t seem to be able to connect to the volume (in use?) in that situation. How did you do this type of filesystem acrobatics? (Again forgive me if I sound obtuse.)

dicko · August 29, 2013, 9:41pm

corosync and drbd should be always started. Between them they should take care of who is primary and mounted, and what services are started on the active system and in what order, look at the filesystem primitive, it should do it all for you.

I strongly suggest you do NOT use UUID, even to the point of replacing them with good old fashioned /dev/sdx both in fstab and grub.

dicko · August 29, 2013, 9:50pm

If you are going in small steps and HA is “not yet” then in /etc/rc.local you can add for example

mount /dev/drbdN /state

to temporarily synthesise a working active HA system where drbdN is primary on that machine.