New PIAF Setup Having Issues

After searching the forum and coming up empty handed, I’m reaching out to the community in hopes you may be able to help.

My organization employed an Asterisk integrator to help implement a VoIP solution in our environment within a condensed time line. Unfortunately, we are experiencing endless headaches with the new system and receiving limited concrete feedback from our integrator. I’m reaching out to the forum to bypass the integrator as our confidence in them is gone.

Having hired the integrator, I’m not fully fluent in PIAF, FreePBX or Asterisk, so please forgive my ignorance. I am however in need of help. I am a very experienced Linux and network engineer, so my background should help me figure this out with your help pretty quickly.

Here’s the just of it (Two server cluster):

Each server spec:
1 x AMD Opteron™ Processor 6136
8 Gigs of memory
HP ProLiant DL385 G7

Heartbeat Version = 2.1.4
DRBD Links Version = 8.3.8

PBX in a Flash Version = 1.7.5.6
FreePBX Version = 2.8.1.4
Running Asterisk Version = 1.8.4.1
Asterisk Source Version = 1.8.4.1
Dahdi Source Version = 2.4.1.2+2.4.1
Libpri Source Version = 1.4.11.5
Operating System = CentOS release 5.6 (Final)
Kernel Version = 2.6.18-238.9.1.el5 - 64 Bit

Separate Queue Metrics Cluster running Queue Metrics 1.7.1

The system “crashes” sporadically when changes are applied via the FreePBX WebUI. After clicking Apply Changes, and Continue with Apply, the WebUI hangs. Existing calls are dropped, and new calls are not established. Even dialing a *65, mapped to “Speak your extension number” doesn’t work. This condition exists until a forceful restart is performed at the CLI as follows (this is obtained through the root history on our PBX system, I haven’t run these commands myself):

amportal stop
amportal kill
amportal start

Based on my review, the system isn’t generating core dumps when it crashes, but it is running -vvvg.

This hadn’t happened the last 6-7 times we applied the configuration, and prior to that it happened three or four times. We had resigned ourselves to coordinating minor changes with our integrator, but they made an unannounced change today which brought our 24x7 call center down in peak hours. Needless to say, my confidence in the PIAF stability is low as a result of the integrators approach.

I’m completely in the dark and looking to be enlightened. Can anybody suggest where I might start with this, and how I might move forward cautiously. I’m very much a RTFM guy, but I’ve been clearly directed by the owner of our company to get this resolved ASAP, and I’m not that fast of a reader.

Thanks in advance for any assistance you may be able to provide.

I can’t help you, but I suggest you start a thread on the PIAF forum as well. http://pbxinaflash.com/forum

Lorne

Thank you for the suggestion, I’m registering now.

During a talk with my integrator, I’ve learned that Asterisk isn’t actually crashing, SIP peers are losing their registration. I can’t find references to failed PEER connections, but these errors exist which our integrator believes may be the cause:

[2011-08-25 11:39:39] WARNING[8783] res_musiconhold.c: chdir() failed: Permission denied
[2011-08-25 11:39:40] WARNING[32593] config.c: Unknown directive ‘#callevents=yes’ at line 1 of /etc/asterisk/sip_general_custom.conf

Any thoughts?

Are you using FOP2? If so, you would have “callevents=yes” but you should not have “#callevents=yes”. Looks like someone used the wrong character to comment out the line.

The problem with SIP peers unregistering can be caused by losing access to DNS.

Thanks for the reply!

We are using FOP2, and they’ve removed the misconfiguration, and set it in freepbx. I don’t believe loss of peering is DNS related due to the fact that the system is fully stable until we’re making changes in FreePBX. No issues occur unless we make changes, and even then the system doesn’t consistently experience issues during changes.

I spoke with our integrator today, and they are calling the callevents syntax error a smoking gun, causing asterisk to hang when attempting to parse the invalid configuration. I believe the theory is that since Asterisk is hung up on that directive, it’s SIP peering is timing out. No DNS issues exist, we use our three main DNS servers and if there were issues experienced there, they would show up in more then phones.

We will be cleaning up the log errors this Monday when we apply the configuration again during the wee-hours. Unfortunately we won’t know if it’s fixed immediately since the issue is intermittent. Any further guidance is appreciated!

So, we’ve taken the steps toward stability, hopefully. I made the following changes to eliminate some concerning errors in the logs:

  • remove #callevents=yes from sip_general_custom.conf
  • remove echo cancellation from chan_dahdi.conf (we’re using a redfone which performs hardware echo cancellation, this was done due to warnings in the logs)
  • enabled a module generating errors in the logs
  • chmod 775 /var/lib/asterisk/mohmp3 and /var/lib/asterisk/mohmp3/moh to prevent permissions errors in the logs

Per our integrator’s theory, the first change was the most significant and may have resolved the stability issues when applying the configuration. I was able to make and apply changes four times without an outage. I’m cautiously optimistic though, we were able to make 6 changes previously and then experienced an outage on the 7th change. Any further thoughts are appreciated!

Quote:
“remove #callevents=yes from sip_general_custom.conf”

According the the FOP2 install directions for PIAF, this line is required, though it should not have the # character at the beginning. I am not sure what will happen without it. More detail here: http://pbxinaflash.com/forum/showthread.php?t=6890

Lorne

First off we added the callevents into the SIPSettings module in 2.9 so enable the Call Events there. You need that enabled if you want to see hold events for FOP2 or iSymphony.

Secondly I have verified that putting the wrong callevents into sip.conf does not cause asterisk on reloads to hang as it ignores what it can not parse.

I think you have a much greater issue and using heartbeat and DRBD has been known to cause lots of asterisk reload issues with FreePBX support customers. We have spent hours and hours in FreePBX on this and have never gotten DRBD to run stable in the past.

Tony,

Thanks for the reply and potential concrete explanation of the cause. If DRBD and heartbeat don’t jive, how is a HA asterisk implementation achieved? My preference was to do synchronized configuration as opposed to replicated storage, this provides a fail over option, even at the risk of the configs not being the same, the phones would still ring. Do you know if any kind of isolation occurred in reference to when a cluster may hang or not hang when DRBD and heartbeat are in play?

Thanks!

Tony,

I’ve been searching the web and can’t come up with documentation suggesting that Linux-HA with DRBD is unstable for PIAF/Freepbx. Are you able to point me in the right direction? Thanks!

If you have even minimal experience with Linux, you might consider firing the integrator and installing PBX In A Flash yourself. Their web-site includes a lot of well-written, easy to understand documentation, and I suspect that a fresh install would solve your troubles (or allow you to figure out where in the configuration process that you went wrong).

Alternatively, you could buy a very nice system from Schmooze.com, or try the FreePBX Distro (which you can download here).

P.S. I don’t work for Schmooze.com and don’t have a financial relationship with them, but they are involved in the development of FreePBX.

Finally, it sounds like you need more assistance than you are going to get via a forum. Have you considered trying FreePBX’s paid support? PBX In A Flash also offer paid support at their web-site.