Taskprocessor Always High Water / Locking Up

We have a decently busy PBX, average of 60 concurrent calls, about 200 extensions, however no matter how high we scale the hardware, or adjust the thread pools we continue to receive high water warnings, and have had the system lock up and be unable to make outbound calls (and when checking the task processor the queues continue to fill but are not draining). We have always had some taskprocessor warnings, but here and there at most, then recently after an update to the latest version it started being more frequent and then the lockups started.

The lockups / issues only affect outbounds, inbounds will continue to work, what we will is the outbound calls for the agents show “Ring” state, and be stuck there, with no CID shown, and they will keep trying to call til all 3 of their concurrent call limit is used and then it just locks like that and those calls remain stuck in “Ring”

This is how it looks all the time, generally we’ll see Max Depths of 1600+ across p:channel:all with 10+ processors for this:

The Warnings:

[2024-04-26 13:56:32] WARNING[29477][C-00000e8e]: taskprocessor.c:1225 taskprocessor_push: The 'stasis/m:manager:core-00000006' task processor queue reached 3000 scheduled tasks again.
[2024-04-26 13:56:32] WARNING[29474][C-00000e8e]: taskprocessor.c:1225 taskprocessor_push: The 'stasis/p:channel:all-0000247f' task processor queue reached 500 scheduled tasks again.
[2024-04-26 13:56:32] WARNING[29477][C-00000e8e]: taskprocessor.c:1225 taskprocessor_push: The 'stasis/p:channel:all-00001e9d' task processor queue reached 500 scheduled tasks again.
    --     -- LazyMembers debugging - Numbusies: 0, Nummems: 26
    --     -- LazyMembers debugging - Numbusies: 0, Nummems: 8
    --     -- LazyMembers debugging - Numbusies: 0, Nummems: 8
    --     -- LazyMembers debugging - Numbusies: 0, Nummems: 20
    --     -- LazyMembers debugging - Numbusies: 0, Nummems: 8
    --     -- LazyMembers debugging - Numbusies: 0, Nummems: 8
    --     -- LazyMembers debugging - Numbusies: 0, Nummems: 8
    --     -- LazyMembers debugging - Numbusies: 0, Nummems: 8

Machine Stats:
16 Cores, 32 GB RAM, SSD Storage (Hosted on Azure, F16s v2)
FreePBX 16.0.40.7
Asterisk 16.30.0

CPU/Memory Averages @ 1min never peak 20% on PBX or DB

Output of - top -p pidof asterisk -n 1 -H -b: top output - FreePBX Pastebin

We use an external MySQL database however it is local to the PBX with < 1ms latency to the PBX, under minimal load at all times and is scaled well over requirements.

The only realtime that we use is for queuelog and we have about 6 queues, with basic configurations.

Call Detail Record (CDR) settings
----------------------------------
  Logging:                    Enabled
  Mode:                       Simple
  Log calls by default:       Yes
  Log unanswered calls:       No
  Log congestion:             No

  Ignore bridging changes:    No

  Ignore dial state changes:  No

* Registered Backends
  -------------------
    cdr_manager
    Adaptive ODBC

ODBC DSN Settings
-----------------

  Name:   asterisk-phonemanager
  DSN:    MySQL-asterisk-phonemanager
    Number of active connections: 3 (out of 25)
    Logging: Disabled

  Name:   asteriskcdrdb
  DSN:    MySQL-asteriskcdrdb
    Number of active connections: 4 (out of 5)
    Logging: Disabled

Our stasis.conf:

[threadpool]
;
; For a busy 8 core PBX, these settings are probably safe.
;
initial_size = 20
idle_timeout_sec = 20
;
; The notes about the pjsip max size apply here as well.  Increasing to 100 threads is probably
; safe, but anything more will probably cause the same thrashing and memory over-utilization,
max_size = 100

Our pjsip_custom.conf: (currently, we’ve tried adjusting this up and down with no affect)

[system]
type=system
;
;  <other settings>
;

; Sets the threadpool size at startup.
; Setting this higher can help Asterisk get through high startup loads
; such as when large numbers of phones are attempting to re-register or
; re-subscribe.
threadpool_initial_size=50

; When more threads are needed, how many should be created?
; Adding 5 at a time is probably safe.
threadpool_auto_increment=5

; Destroy idle threads after this timeout.
; Idle threads do have a memory overhead but it's slight as is the overhead of starting a new thread.
; However, starting and stopping threads frequently can cause memory fragmentation.  If the call volume
; is fairly consistent, this parameter is less important since threads will tend to get continuous
; activity.  In "spikey" situations, setting the timeout higher will decrease the probability
; of fragmentation.  Don't obsess over this setting.  Setting it to 2 minutes is probably safe
; for all PBX usage patterns.
threadpool_idle_timeout=120

; Set the maximum size of the pool.
; This is the most important settings.  Setting it too low will slow the transaction rate possibly
; causing timeouts on clients.  Setting it too high will use more memory, increase the chances of
; deadlocks and possibly cause other resources such as CPU and I/O to become exhausted.
; For a busy 8 core PBX, 100 is probably safe.  Setting this to 0 will allow the pool to grow
; as high as the system will allow.  This is probably not what you want. :)  Setting it to 500
; is also probably not what you want.  With that many threads, Asterisk will be thrashing and
; attempting to use more memory than can be allocated to a 32-bit process.  If memory starts
; increasing, lowering this value might actually help.
threadpool_max_size=100

We have ran this system, generally unchanged for 3 years on a VM half the size, then last Thursday it just started locking up, and I cannot find a source…

What I’ve tried:

  1. Migrating the system to a new physical host
  2. Scaling the VM Size (it ran on a VM half this size for the last 3 years)
  3. Migrating the DB to a new physical host
  4. Scaling the DB Size
  5. Adjusting the threadpools for stasis/pjsip up and down
  6. Optimizing the database tables, purging old records in the cdr
  7. Logical dump and restore of entire db to completely new setup

Please if someone can offer a new place to look, this is crushing our business :pray:

Is this 60 call attempts per minute or 60 answered calls per minute? There is a big difference.

This would be attempted or completed outbound calls, 60 concurrent calls generally.

So when this is happening, how many incoming calls are coming into the PBX? Asterisk doesn’t differentiate between “this is my provider trunk” and “this is my phone” when it comes to the endpoints. They are just endpoints. If you’re attempting 60 outbound calls, that’s 60 outgoing channels but at the same time, a incoming call hitting a queue with 32 members and calling all 32 members at once is going to generate 32 outgoing channels. Again, to Asterisk that is just 92 outgoing channels…it doesn’t matter if the destination is a phone or a provider’s server.

When this issue is happening are outgoing calls generated from queues still working? Can the internal extensions call each other when this issue is happening? Basically, is outgoing calls to your provider seem to be the only thing having this issue?

It would also be helpful to see a full verbose output and pjsip logger output of a call that is failing during this time.

We have an average of 2 inbound calls per minute over a whole day, that 60 cpm is also a daily average. The taskprocessor however is flooded at all times, even start of day with low volume, it’s like every call is generating 10k tasks. The lockup happens randomly, and without change in volume, the cdr aggregator taskprocessor will fill up, along with the p:manager:core and p:channel:all and they just stop processing, and we’re forced to restart the PBX.

Also, we have no custom modules installed other than the stock ones (queues, ring groups, voicemail, etc) and fop2.

When this issue is happening are outgoing calls generated from queues still working? Can the internal extensions call each other when this issue is happening? Basically, is outgoing calls to your provider seem to be the only thing having this issue?

So inbounds to queues still work, so do any currently connected calls, however any outbound attempts lockup in a “Ring” state, and it looks like this:

We have another service (non-FreePBX) connected to the same trunk and same provider in the same physical location as this PBX and their connectivity is unaffected during these times, so we’re confident we’ve ruled out the provider.

Since it’s random, I haven’t been able to gather much in the form of logs, besides finding the taskprocessors full, but I will try for sure.

Is there any way to dump the taskprocessors to find out what all of these things are, or to find out what’s filling them up?

So if 314 is making 4 outbound calls at once, how is 314 making said outbound calls?

It’s not actually, this is what happens during the failure event. Essentially, they get dead air when they try to call, so they hangup and make another call, and then another one, Asterisk never lets go of the calls, they just get stuck in Ring state, and then the max concurrent extensions limit is hit (3) and they just sit there forever til we restart the PBX.

So we need to see the logs I requested so we can see what happens during the call.

I’m curious as to how that extension made three calls that failed, channels stayed up but a fourth call got out. What is interesting is that the three stuck channels are stuck in a Gosub call, that could be the PSTN Gosub, the outbound callerid Gosub or the hangup Gosub…the next time this happens not only get the logs of a failed call but also give us the output from the Asterisk CLI of core show channels concise

The 4th call is an inbound, inbounds continue to work just fine, the first 3 are failed stuck outbounds, that are not connected to the agents softphone, and are caused by them making 3 call attempts and hanging up, as each one is dead air when this happens and they hangup and try again, the PBX never lets them go, they just sit there stuck in “Ring”, this then stops them from making any new outbounds til the PBX is restarted as they’re at their 3 concurrent call limit from this.

I have put together some scripts to dump eventq, pjsip, and full log ready to fire when it occurs again.

Another strange item of note being spammed is

    --     -- LazyMembers debugging - Numbusies: 0, Nummems: 7

Everything online says this comes from vqplus which we do not have installed on the machine.

LazyMembers is a FreePBX based patch to Asterisk. The FreePBX distros have this patch in Asterisk so you will still see things like this but it doesn’t technically “work” until you install VQ Plus to configure it. It’s something your stuck with and really, I’ve seen it cause more problems than it solves since it impacts every one using basic queues on the distro.

Ah, thanks for explaining that. I appreciate your insights thus far!

Do you know any way to find out what is flooding these taskprocessors?

I dumped manager show eventq every 1s for a short wile (maybe 10-20 seconds) and compiled unique event breakdown:

Here are the counts for each event type found in the file:

Event Count
VarSet 21282
Newexten 9284
RTCPSent 466
RTCPReceived 380
NewConnectedLine 177
NewCallerid 172
QueueMemberStatus 157
Newchannel 150
Newstate 125
DeviceStateChange 109
Cdr 109
DialState 104
DialBegin 93
Hangup 90
ExtensionStatus 71
MixMonitorStart 61
BridgeLeave 50
DialEnd 45
BridgeEnter 42
SoftHangupRequest 39
HangupRequest 40
LocalBridge 36
AgentCalled 36
ChallengeSent 23
SuccessfulAuth 25
BridgeDestroy 25
VarSet 21
BridgeCreate 21
DTMFBegin 8
DTMFEnd 7
AgentRingNoAnswer 6
AgentComplete 4
MusicOnHoldStop 3
QueueCallerJoin 2
MusicOnHoldStart 2
ChanSpyStop 1
AgentConnect 1
ChanSpyStart 1
Unhold 1
QueueCallerLeave 1