High Availability (HA) FreePBX 2020

dicko · August 21, 2020, 3:35pm

Cheapo suggestion, get two $5 droplets and a floating IP from DO. Move your DNS nameservers to DO

Set up an A record for your pbx pointing at your floating IP and SRV records for sip using udp and tcp and the port of your choice for both Adrop and Bdrop

Install doctl, set up an api key. and initialize doctl with it

cd /usr/local/bin;curl -sL https://github.com/digitalocean/doctl/releases/download/v1.46.0/doctl-1.46.0-linux-amd64.tar.gz | tar -xzv

cd ~
#have your API key handy and paste it
doctl auth init

(

extra credit section :-
You can add ‘completions’ to your .shellrc file for most folks that will be ~/.bashrc and/or ~/.bash_profile

source <(doctl completion bash)

or if you use zsh (good idea IMHO) ~/.zshrc

source <(doctl completion zsh )

)

Get your droplet ids and flloating ip and remember them

doctl compute droplet list
doctl compute floating-ip list

Assign that floating ip to one of the droplets and Install your FreePBX of choice and set it up as you care.

Make sure everything is working.

snapshot the machine and use that snapshot to create the second droplet. Apart from the MachineID and the IP address they will be identical

when all done then test it with

doctl compute floating-ip-action assign fl.oat.ing.ip  1000001
# you are using droplet 1000001
doctl compute floating-ip-action assign fl.oat.ing.ip  1000002
# what do you know, using the other one.

Now you need to define what a ‘failure is’ Easily if you can no longer ping the floater , but what if it responds but has gone rogue , maybe asterisk isn’t running but maybe it has been penetrated and is making thousands of call from Reykjavik to Palestine, this needs further discussion, no matter how you do high/medium/lowish-Availibility

You also need to keep the machine without the floating address attached synced to the one that is active ,

rsync -a othermachine:/var/spool/asterisk /var/spool/asterisk

will do that.

We can add backup and restore on an as needed basis , daily might work but with the new notepad mysql feature I would build a mysql trigger to watch an injected ‘note’ set by the switch script.

We are left with the “State” of the machine, which is in the sqlite3 astdb table. This is a little tricky as Asterisk is written thread safe but not multiuser safe in its database. so ideally you could wrap that up in a "rasterisk -x ‘database query …’ " export and corollary import, but given that complexity and the likelihood of a ‘switch’ I personally just apologise for a missed CF or DND and blame cloudflare

Whoops I was going to paste a few lines, but it “Topsy’d” In 2013 I was running 3 Proxmox boxes with zfs over glusterfs and dozens of FreePBI that would flawlessy move between any machine on command or on corosync detected failure. Pros) it worked, Cons) it was expensive in maintainance time, power and hardware (and everything went to hell at 00:00 until cronjobs are fixed, but that is another story). So by 2014 I had moved to this kinda solution, Pros) its cheap, switching takes a second or two and IP authed trunks and domain authed extensions wo’t ever know, moreover total disaster recovery is never more than 20 minutes away. Cons) Everything in one DO datacenter.

So to harden all that always have SRV pointing to something that works, try and fing VSP’s who will honor your SRV records for IP auth, Add BGP routing if possible.

(Another JM2CWAE)

arielgrin · September 3, 2020, 7:53pm

@mmoo9154 Dear Mark: I just have finished installing a FreePBX 15 Distro and manually creating a 2 node cluster. I have it working and have tested it with some simulated failures like stopping mysqld with mysqladmin shutdown, putting one node on standby, shutting down a node with shutdown now, but by no means it is a completely tested solution, as I didn’t simulate a “hard” failure, like disconnecting the power cable or the network cable. I’m not sure what would happen in those cases, because I don’t want to risk breaking the lab at this particular moment.

If you think it might be useful for you, I could compile some kind of step-by-step guide with the steps I followed. Just let me know, as it is quite a long list of steps.

Again, I have not tested this enough to guarantee a behavior that would be 100% acceptable for a production environment, but at least with my limited testing, the cluster behaved as expected without any occurrences of a split-brain situation.

dicko · September 3, 2020, 8:03pm

It is exactly “hard fails” that cause the schizophrenia, hard fail one, then the other, when the the one that first failed gets back on line before the second one, then you will wish you had a third node

arielgrin · September 3, 2020, 9:20pm

I have disabled stonith so a failed node should not be restarted.

TheWebMachine · September 5, 2020, 8:37pm

{shameless plug warning} We have an HA Clustering solution for our AWS FreePBX, if you’d like something more “ready-made.” Automatic failure detection and fail-over support, near real-time sync between nodes, all leveraging AWS services (EFS, RDS, Elastic IP), Unison, and our own custom cluster monitoring agent. We don’t charge extra for HA and you’d have our support team available to you.

PitzKey · September 6, 2020, 10:09am

If you are looking for hot hot high availability, AFAIK there isn’t.
There isn’t a system which can recover live calls on the secondary node, or instantly sync queue members, voicemails and other non system-admin stuff that are constantly changing.

We have been using the Warm Spare solution:

FreePBX 14 and older - https://wiki.freepbx.org/display/F2/Warm+Spare+Setup
FreePBX 15 - https://wiki.freepbx.org/pages/viewpage.action?pageId=185631299

It works great, but as mentioned, when failover happens queue logins are messed up and voicemails left during failover is an issue.

Keep in mind, some phone brands, like Yealink, will have a constant registration to both SIP Servers and will only process calls in the secondary once the primary doesn’t respond, so failover is almost instantly. (Technically, incoming calls will work on both servers all the time) VS is you use Sangoma phones, it’ll only start the registration process to the secondary once the primary isn’t responding, and usually takes longer until all phones are up on the secondary.

The new HA from Sangoma works very similar to the Warm Spare setup, and I’m quite disappointed that it is missing the instant synchronization of the mid day user changes.

I think that a warm spare would work awesome together with what @arielgrin mentioned, and I am also willing to test amd contribute if I can to get it working.

arielgrin · September 7, 2020, 8:07pm

After simulating several “soft” failures with my 2-node cluster, I was confident enough to reconfigure the cluster to a 3-node cluster, so I’m now in the process of reading how to enable quorum and stonith.

dicko · September 8, 2020, 6:49am

look at glusterfs for your ‘set it and forget it, we are all masters’ data storage in a +2way cluster, its far more robust than drdb can ever be.

arielgrin · September 8, 2020, 10:35am

Thanks for the tip @dicko
In my case, I would be using the FS over drbd anyway. Do you think there is a difference between GFS and EXT4 when using drbd? Or were you suggesting maybe GFS on a NAS ?

dicko · September 8, 2020, 2:13pm

i was happiest with Proxmox over gfs over zfs , it gives great flexibility with good ‘self healing’ properties and is blazingly fast in a lan, gfs’ geo replication also comforting as my lan was in Southern California. Luckily I never needed it, but it works well.

Where I to do it over, I would probably use Google or Amazon for storage.

arielgrin · September 8, 2020, 2:25pm

In my case it was a proof of concept and an exercise for possible future implementations, none of my current clients are willing to pay for a second node, let alone a third.

dicko · September 8, 2020, 2:36pm

Indeed, a $5/month VM with a $1/month backup gives about 15 minute time to recover. The ability to grow them or easily move them around the world (off line) makes me very happy. the provider I chose are better than 4nines the few hard restores have been pebcac. I use inotifywait on all my /var/spool/asterisk/'s and rsync all changes to one bigole machine

arielgrin · September 8, 2020, 2:47pm

Unfortunately here in Argentina, it is not common to host your “own” VoIP server on the cloud. Clients who would like to do that, would buy it as SaaS directly from a telco provider.

billsimon · September 8, 2020, 3:13pm

I often see people join discussions about AWS, Azure and other cloud platforms and note that they are more expensive than other popular VPS services, but this – being able to leverage the real “cloud” infrastructure, and not just virtual machines – is where it pays off.

TheWebMachine · September 8, 2020, 6:26pm

Oh sure, you can do it on the cheap…if you’re willing and able to build it all from scratch yourself and comfortable cutting a few corners along the way (libA conflicts with libB on osC so we used libX instead, and so on). However, if you’re in a line of business other than building and maintaining servers all day, you might be fine spending the couple extra dollars on something that’s stable and just works out of the box, built by experts who already did all the work for you, without compromise, on an infrastructure designed for the task. That’s where we come in…and we include fast industry-leading support for that higher price of admission.

Alas, to each their own. Options are good.

system · October 9, 2020, 6:26pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.