Asterisk crashes every 3-5 days - related to queue application?

inc0gnito · May 2, 2020, 3:59am

We’re running FreePBX 14.0.13.6, Asterisk 13.29.2 on device & user mode.
Single Xeon 4110 machine w/ 32GB of RAM. Debian 9.9.
Standalone FPBX & Asterisk install from source.

On average, about 60 chan_sip devices online at any one time, 15-40 concurrent channel usage.
One outlier for us is that we have approximately 180 queues setup on this instance.

What happens is that every 3-5 days (usually depending on total call volume or more laterally - time elapsed) Asterisk “freezes” and becomes unresponsive. What we can see very different with all our other installs over the years is that this install exhibits the following issues :

over the course of a day, no matter what - we can see that the manager stasis application continues to hit 3000 scheduled tasks a day. We can see this in intervals starting from 5 minutes to every 5-10 seconds.

WARNING[99873]: taskprocessor.c:1110 taskprocessor_push: The ‘stasis/m:manager:core-00000006’ task processor queue reached 3000 scheduled tasks again.

In the hour or so before Asterisk begins crashing, the log gets filled with the following entries every second -

chan_sip.c: Autodestruct on dialog ‘[email protected]:5060’ with owner SIP/digium-gw-000030e9 in place (Method: BYE). Rescheduling destruction for 10000 ms

This doesn’t just apply to the Digium G200 PRI trunk. It happens to all SIP trunks.

This happens until the channels start hitting 300-400 channel utilization with the phantom channels and as a result Asterisk starts becoming unresponsive.

If we try to go into the CLI at that time, when the system is close to becoming fully unresponsive - by typing “asterisk -rvvv”, it loads into CLI and then instantly kicks me back to the shell terminal. The last 3 lines would look as follows -

Connected to Asterisk 13.29.2 currently running on CALLCENTRE01 (pid = 99852)
CALLCENTRE01*CLI>
root@CALLCENTRE01:/usr/src#

There are no other logs present to indicate any other error of sorts besides the Autodestruct you see above.

When the above happens, current calls continue to go on and complete, however all new calls, after going through the IVR will simply hang there without any sound, meaning it doesn’t hit the queues.

We usually have to manually kill the asterisk process and do a fwconsole start for things to go back to normal.

Previously we were running Asterisk 16.6.1 with the exact same issues. We downgraded to 13.29.2 because we thought it may have been an issue with 16 (most of our installs are still on 13 without issues) but they continue to persist.

The typical number of calls handled before asterisk crashes is about 25,000 to 50,000 and 3-4 days on average.
I know it’s not an issue of call handling as we have asterisk instances handling 50k calls a day without issue.

I’m starting to think it may be due to the sheer number of queues that we have (180), that’s the only logical explanation I can think of. I’ve read some reports that the queue application may be unstable at scale, but have yet to read anything that’s truly validated.

How can I begin diagnosing this issue?

mattf · May 2, 2020, 6:36am

Sounds like a potential deadlock of some sort. If it’s in the queue application, you’re probably out of luck getting anyone in Asterisk land to diagnose it (app_queue doesn’t see a lot of love these days). One thing you might try since you’ve already tried Asterisk 16 and seen no difference is migrating to chan_pjsip.

It’s a good idea to do regardless (since chan_sip is marked as deprecated in Asterisk land as of Asterisk 17), and if the deadlock is in the chan_sip code, this might allow you to work around the issue.

Matthew Fredrickson

cynjut · May 4, 2020, 1:47pm

We’ve seen messages like this in the past. One place that can cause this is the use of a lot of MWI and “hints” processed on the machine. Phones Lagged and Unreachable is an example. PJSIP Become unresponsive is one in PJ-SIP. If you search for “scheduled tasks” in the search icon (top of screen) you’ll find lots of pointers on ways this has been solved for other problems. While yours isn’t exactly the same, it could be something similar.

system · June 4, 2020, 1:47pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.