There are a number of very old links from 2010 and 2013 talking about creating a high availability FreePBX cluster. But, they all basically say they don’t work with FreePBX 14 or 15 or the SNG7 based image.
Does anyone out there have an HA configuration on FreePBX 15 or 14 working today? If not, OMFG!!
If you do, or if you have heard of one, please tell us about your configuration and your experiences with it.
I’d love to find a solution that would fail over quickly and would not drop existing calls. If that’s not possible, a configuration that could detect failure and fully cutover in 5 seconds or less (ideally much less).
Have you considered building a high availability setup using this module in part: Announcement: New Advanced Recovery module for FreePBX ?
it may be obvious but the PBX itself cannot be the only contributor to a high-availability solution. A robust network, servers using VMware or a versatile cloud setup, phones that can be configured to have failover proxies or SRV records, etc., all contribute to the solution.
I’m in the process of manually setting up a Corosync-Pacemaker-DRBD HA cluster for FreePBX 15 Distro. Of course this is just a lab right now, but I can share my findings with you if you are willing to try, since this is a very involved process, it is not a case of just clicking NEXT on a graphical interface.
Just a heads up, Some of us have done all this several years ago, (some didn’t charge $thousnds but the recipes are still ‘out there’) , BUT If you don’t include a third node for your ‘quorum’ which can easily be a Pi Zero , sooner or later you will go ‘split-brain split-brained’ to the point of ‘very hard to recover’
(BTDT, and with the advent of $5/month cloud based systems It’s hard to justify this route)
@dicko, I was going to mention in my OP that you had done this 2010 and mentioned it again in 2013. It just wasn’t clear that this was still the best approach.
Your tip to setup 3 nodes makes complete sense.
Absolutely clear, @billsimon. But, having an HA instance of Asterisk/FreePBX is a kinda essential piece of the puzzle.
I’d love to hear more about the other pieces it sounds like you have figured out. I encourage you and anyone else to post info about real life HA configuraton you have in place.
It worked in 2010 , it still worked in 2013, It would still work in 2020 BUT two node HA will never work robustly, and there are so many better ways using SRV and real proxies ten years later, a B2BUA can never be really HA, every call will drop on failure that’s just reality.
Best you can do is
A) choose SRV for your phones reistration
B) provide all SRV served services (can even be FreePBX)
C) arrange for all SRV peers to sync status when they can
@arielgrin, you read my mind. Based on a very old post from @dicko, I was going to go that path. I’d love to hear how it went.
I do not need a “click next” GUI. I’m just hoping to hear from actual implementations that can take a catastrophic failure and keep on chugging.
In my experience, if it hasn’t been tested/tried, it dosn’t work. Do the phones keep working when you
sudo shutdown now the vm? I’m hoping someone out there (looking at you, @dicko) can confirm they’ve done this. I’d love to hear what the failover recovery looks like in real life.
When we’re don ewith this thread, we should use it to update the pathetic FreePBX High Availability section in the documentation. That’s where tips like dicko’s use a three node cluster unless you want split-brain hell tip should go, imho.
Sorry for the st00pid question, but what does SRV refer to? Is that the configuration server? Or is that the SIP server (aka Asterisk)?
I do understand that the calls are going to drop, especially if the connection is mediated by a central server. Of course, I always VoIP/SIP was intentionally designed to keep the server out of the loop except for directory services and optional call status/control. The RTP was sopposed to be directly between endpoints in order to minimize latency and minimize the load on the SIP server.
What’s more interesting is how long it actually takes for the system to recognize failure and cut over to one the the standby servers.
Also, I’d LOVE to hear what kind of heartbeat/system integrity checks are used IRL.
We had a situation where the FreePBXHosting colo facility on the west coast had a network issue that increased ping latency from 30ms to over 150ms and dropped about 15% of the packets. It made the phone system sound like crap, but the system was never cut out because it never dropped 2 packets in a row.
It would be cool of there was a corosync-pacemaker that measured call quality.
I will replay sub-vocum that FreePBX HA was a steal from those 10 year old posts, It cost lots from them and never worked, I never heard of any rebate program . . . .
If you want to go HA with anything, it is not technically hard but committing everything to a two node DRBD will eff you up sooner or later
@dicko, who has a full cloud PBX with six sigma SLA for $5/mo?
SRV records help that when your extension look for SIP server, the SRV record will return the best place to go to. This is best used for HA
digitalocean and Vultr come close, Care to share your suggestion?
I’m thinking about going with 4 nodes: 2 cloud based, and a couple VMs on my in-house hardware in Arizona and Alabama since I get those for free. I figure the likelyhood of 2 different cloud hosts dying is zero. The 2 local copies would be to keep split-brain hell. Does that make sense to you?
Then that’s good, use a real SIP Proxy like kamailio in-front of your FreePBI and you are getting close
DO and Vultr have VM images for that kind of price, but they don’t have PBX as a service (that I know of).
If I understand you, this is exactly what I want to implement. DO is one of the VM hosts I use.
D’oh! Wasn’t think DNS records. Totally get what you’re saying.
Vultr does allow importing ISO, DO allows you to snake in an OVA, once you snake that effer in you can clone from it
You want real DNS manipulation, look at moving your nameservers to digitalocean and installing doctl
My experience is that A records get to 188.8.131.52 closely without the TTL time you set