We’re running FreePBX 14.0.13.6, Asterisk 13.29.2 on device & user mode.
Single Xeon 4110 machine w/ 32GB of RAM. Debian 9.9.
Standalone FPBX & Asterisk install from source.
On average, about 60 chan_sip devices online at any one time, 15-40 concurrent channel usage.
One outlier for us is that we have approximately 180 queues setup on this instance.
What happens is that every 3-5 days (usually depending on total call volume or more laterally - time elapsed) Asterisk “freezes” and becomes unresponsive. What we can see very different with all our other installs over the years is that this install exhibits the following issues :
- over the course of a day, no matter what - we can see that the manager stasis application continues to hit 3000 scheduled tasks a day. We can see this in intervals starting from 5 minutes to every 5-10 seconds.
WARNING[99873]: taskprocessor.c:1110 taskprocessor_push: The ‘stasis/m:manager:core-00000006’ task processor queue reached 3000 scheduled tasks again.
- In the hour or so before Asterisk begins crashing, the log gets filled with the following entries every second -
chan_sip.c: Autodestruct on dialog ‘[email protected]:5060’ with owner SIP/digium-gw-000030e9 in place (Method: BYE). Rescheduling destruction for 10000 ms
This doesn’t just apply to the Digium G200 PRI trunk. It happens to all SIP trunks.
This happens until the channels start hitting 300-400 channel utilization with the phantom channels and as a result Asterisk starts becoming unresponsive.
If we try to go into the CLI at that time, when the system is close to becoming fully unresponsive - by typing “asterisk -rvvv”, it loads into CLI and then instantly kicks me back to the shell terminal. The last 3 lines would look as follows -
Connected to Asterisk 13.29.2 currently running on CALLCENTRE01 (pid = 99852)
CALLCENTRE01*CLI>
[email protected]:/usr/src#
There are no other logs present to indicate any other error of sorts besides the Autodestruct you see above.
When the above happens, current calls continue to go on and complete, however all new calls, after going through the IVR will simply hang there without any sound, meaning it doesn’t hit the queues.
We usually have to manually kill the asterisk process and do a fwconsole start for things to go back to normal.
Previously we were running Asterisk 16.6.1 with the exact same issues. We downgraded to 13.29.2 because we thought it may have been an issue with 16 (most of our installs are still on 13 without issues) but they continue to persist.
The typical number of calls handled before asterisk crashes is about 25,000 to 50,000 and 3-4 days on average.
I know it’s not an issue of call handling as we have asterisk instances handling 50k calls a day without issue.
I’m starting to think it may be due to the sheer number of queues that we have (180), that’s the only logical explanation I can think of. I’ve read some reports that the queue application may be unstable at scale, but have yet to read anything that’s truly validated.
How can I begin diagnosing this issue?