The recorded messages are choppy after four or five hours of active system use with our users. We experience the choppiness with external calls and internals calls such as accessing vm from an extension. If I issue a fwconsole restart then it’s good until about that same time frame. It is Freebpx 16 that also runs the Queuemetrics uniloader and unitracker to connect to our Queuemetrics running on a separate server.
I have narrowed it down to something with our users connections (20-30 phones and 30 more users on microsip) or the usage of queuemetrics and ami commands throughout the day (login, logout, and pausing in queues). Never seeing any problems with overall cpu, memory, disk latency, or jitter. The system has no issues when there isn’t extra load impacting it such as holidays and weekends but only if I perform the fwconsole restart beforehand. If I don’t then the audio keeps clipping.
This is a recent issue within the last few weeks and looking for some guidance on pinpointing the problem. Has anybody else experience similar issues or is there other things I can use to see specifically what is causing the impact to asterisk to correct it.
We considered this but it’s just the recordings that are affected and the call with a person. I have taken pcap’s and looked at those as well but not seeing anything useful. It feels more like asterisk specifically is getting overwhelmed when the issue happens and doesn’t clear itself until restart. I was hoping there something else that can be done inside asterisk to help pinpoint the cause.
If audio quality issues are noted on internal calls (and assuming that the extensions are on the same LAN as the FreePBX), then that’s more isolated behavior. Taking the outside world, SIP trunking, ISP latency, etc. out of the picture. Wondering for a test if you can temporarily disable Queuemetrics and see if the problem still presents itself.
The choppy audio itself is not the primary diagnostic signal here for me; what is materially relevant is the time-based degradation and the immediate recovery after restarting Asterisk.
That pattern strongly indicates a progressive resource exhaustion or internal queue buildup.
The fact that performance degrades after several hours and is fully reset by fwconsole restart suggests that some internal subsystem is accumulating state (events, tasks, sessions, or descriptors) and not draining it efficiently over time.
That said, I would start with threadpool pressure and queue backlogs and then moving at OS level if required.
If taskprocessor queues show increasing depth or high watermarks over time, that is a clear indication that Asterisk is falling behind on internal event processing. As per my understanding, this can eventually impact the whole environment.
I am finding messages in the logs which if I understand correctly is showing the Asterisk is having issues. I am wondering from these event messages does it pinpoint to so we can resolve it?
Did you proceed with the change? Have you observed any improvements?
We are currently working with a very similar scenario, so any feedback would be valuable.
From what you describe, it seems you have already reached the practical upper limit here.
To better understand your setup:
How is your AMI configured?
How many AMI users are currently connected?
Which events are they subscribed to?
Additionally, are you using ARI? If so, how is your architecture designed around it?
Have you been able to identify the origin of the bottleneck?
You can still try to tune stasis.conf, but it is worth noting that many of these events are intrinsic to design and cannot be easily reduced (or removed) in production.
Running version 20.17 now and we are still going past the high water mark on a couple of stasis/m:channels. One is at 625 and the other is at 606. I am also still experiencing the audio choppiness on the recordings only but maybe took about an hour longer.
We have 14 managers configured in AMI. Some I believe came with the setup like the srtapi. The active AMI users are five admin connections and one firewall from localhost, Nagios, and Queuemetrics. All the users are subscribed to all events.
It looks to me like I have ARI enabled to be used since the HTTP(S) server is enabled and the ARI status show enabled. I do not see any active applications using it or dialplans.
It would be nice if it was easier to pinpoint the source causing this. I know going through the logs I also have a lot of pjsip state subscription failures from extensions that do not exist anymore and are still in templates for phones. Also, cell phones that put into phones or Microsip as contacts.
Here is a little more detailed background: We run with six active queues with about 15-20 agents logging into daily via Queuemetrics via AMI. Uniloader runs on the pbx server sending data from the queues back to Queuemetrics. We process on average 400-600 calls daily which includes outgoing and incoming. We have been running this configuration for about two years now and we became aware of the choppiness around the beginning of February.
Is that high water issue the probable cause of the audio jitters?
Isn’t the fact that the ‘statis/m:channel:all’ processors are getting past the high water mark an indication that ARI is being used like maybe or FreePBX configuration with IVR’s, queues, etc.?
Is my issue with the state subscriptions on pjsip endpoints contributing to the issues? Or is this purely something with problems with the system not being able to keep up with the call flows of coming into the IVR, playing various messages, ringing extensions in queues, etc.? If that is the case, then what would be logical steps to correct because it doesn’t appear like giving the system more resources would be helpful.
It’s just not really possible to do something to just have it point out precisely what is going on. There’s a lot more nuance to things.
So it wasn’t happening before? If not, then something has most likely changed to starting causing it. An increase in number of calls, some kind of underlying hardware failure. Identifying that would isolate it.
It’s a symptom, not a cause.
Stasis is used internally for lots of things, not just ARI. AMI? Built on stasis.
It’s something to do either with the hardware, your usage patterns, or the third party app. Eliminating things can result in isolating the underlying cause and once that is determined suggestions can be provided.
In our case, high water marks occur when a call enters a FreePBX queue. We are running several external AMI clients that consume specific events. However, reducing the number of AMI clients has not significantly decreased the pressure on the core topics (but yes on stasis/manager:core), particularly the channel-related ones.
Our current hypothesis is that these warnings are triggered by the burst of events generated by FreePBX queue logic. That said, we have not yet confirmed whether the same behavior would occur in the absence of AMI consumers, or with more aggressively filtered event streams. It is also unclear whether this is a root cause or simply an amplification effect.
In addition to the warnings mentioned above by other users, we also observed stasis/pool-control warnings, which disappeared after tuning Stasis parameters.
What remains unclear is why the internal event processing system is unable to drain these queues, given that we are not observing any obvious bottlenecks at the operating system or infrastructure level.
However, it’s also true that our queues are drained correctly after bursts.
@eruzek Is it still the “stasis/m:channel:all” taskprocessor experiencing high water? What’s your complete “core show taskprocessors” output?
@quaquanic A few comments for your usage. In recent versions “stasis/pool-control” doesn’t exist anymore[1]. Secondly I made a change[2] which should have substantially reduced the messages ending up at app_queue on its subscriptions[3] so the “stasis/p:channel:all” ones shouldn’t hit high watermark anymore.
The problem with manager is that it’s not asynchronous, and so the implementation has to handle that even if your own usage is really just for events. Messages from inside of Asterisk mostly come in via stasis, get turned into AMI messages, and go into another queue inside of manager (shared by all connections). Each connection then iterates through the manager queue individually - having to temporarily lock it. Once all connections are past a message, the message is then disposed of.
When executing an action the connection is not iterating through the queue so they can accumulate, then once the action is done it starts working through events.
The queue in manager itself can end up being hammered quite a lot, and individually locked by many threads constantly. It’s a prime area for contention, to the degree that tasks can accumulate on the taskprocessor while it is trying to access the queue of events to add a new one.
You might ask… why wasn’t this a problem in the past?
Before stasis there wasn’t an asynchronous publishing of messages, instead all the logic happened when you sent an AMI message from elsewhere in Asterisk and so things naturally got backpressure since they needed to wait on that shared manager queue resource which slowed things down. The manager queue itself also doesn’t have logging or high water mark stuff, so it could grow invisibly if it wanted (well - until memory ran out). Since taskprocessor has logging, it’s more evident.
I’ve got manager ideas bouncing around my head for the future after I finish up my hints work. The hints work will expose functionality in stasis to allow the backpressure that occurred before, so manager could be switched over which would eliminate the manager taskprocessor and have things go directly into the manager queue itself, eliminating a good portion of the stasis part. The con is that it might just end up pushing the problem more into the manager queue.
I dunno yet, it’ll probably be a vacation thing in a few months where I prototype things and see.
That one and the “stasis/m:cache_pattern:1/channel” are the only two triggered in the logs that I am seeing.
Restarted Asterisk again last night and for now disabled Nagios from running AMI commands and monitoring the server. I am seeing a slight improvement in the audio quality at times, but I don’t know if I would have experienced that before if I checked enough.
At this point it feels like I have two choices. Started going through our current install and disabling unused modules, cleaning up unused extensions, and reviewing our Queuemetrics install for any improper settings. The other option is to migrate to a new FreePBX 17 instance which we will end up doing anyways because we really want to get our FreePBX instances off of the CentOS7.
I will gladly accept any more insight you might have on things to look at or try to resolve with our current setup because I don’t want the problem to follow to the new server. Thanks!
I don’t really have any more insight. It comes down to trying to isolate what the source is, which is easiest by having an environment you can replicate in and disabling things.