TaskProcessor Crash?

GSnover · April 22, 2021, 6:21pm

Customer box (100% PJSIP) stopped accepting calls this morning - looking through the logs, I am seeing a lot of warnings about stasis/m:

taskprocessor.c: The ‘stasis/m:channel:all-00000103’ task processor queue reached 500 scheduled tasks again.

It’s always the 102 or 103 processor - Here are their stats after the box has been up for about 2 hours:

Processor Processed In Queue Max Depth Low water High water
stasis/m:channel:all-000000fd 5 0 2 450 500
stasis/m:channel:all-00000100 4 0 2 450 500
stasis/m:channel:all-00000101 349791 0 981 450 500
stasis/m:channel:all-00000102 349790 0 1092 450 500
stasis/m:channel:all-00000107 1 0 1 450 500

Just noticed that 103 is not running now? What do these processors do?

Found this: PJSIP stops accepting calls and console flooded with WARNING messages - General Help - FreePBX Community Forums

But it’s not the same processor and I have made sure there are no loops and already had gotten rid of all the extraneous stuff as suggested.

I also looked here: Asterisk Task Processor Queue Size Warnings ⋆ Asterisk

But I kind of need CDR’s - Do I need CEL? Some threads say kill it, but other threads say eventually CDR is going away so I don’t know what to believe on that account.

This is the first actual production CRASH I have had in probably 8 years - freaked me out! Asterisk 18.2.2/FreePBX15 - everything current with all modules and packages up to date.

fwconsole restart said it could not stop Asterisk so that is why I rebooted.

They have no Queues - Only Ring Groups. They are using Sangoma Connect with 9 users and 25 regular extensions.

They do do 3 things weird:

All Calls Ring All Phones - Crazy, I know but they insist.
They do All-Pages (every extension) about 15 times an hour after they park a call.
Two extensions are a BLF on every single phone there.

Machine is a Hyper-V machine with 8G and 4 Processors - CPU never gets above 5% in top.

Any ideas? I can’t have it crashing…

GSnover · April 22, 2021, 9:00pm

The plot thickens - It looks like a user (that is in the incoming ring group) set something on their Sangoma Connect app that caused the calls to hang up - As soon as we took her phone out of the Ring Group, happiness was achieved! As soon as I figure out what she did (and why???) I will post back so other people can avoid it.

GSnover · April 22, 2021, 10:06pm

Here is what I think is the relevant part of the Logs for the failure:

&PJSIP/122/sip:[email protected]:37599;transport=TCP;rinstance=5CEE98C9;x-ast-orig-host=10.xx.x.x:37599
&PJSIP/122/sip:[email protected]:15115;x-ast-orig-host=192.168.xx.xx:0
&Local/90122@zulu-call
&Locb(func-apply-sipheaders^s^1),") in new stack
[2021-04-22 14:37:51] WARNING[26186][C-000001b6] app_dial.c: Dial argument takes format (technology/resource)
[2021-04-22 14:37:51] VERBOSE[26186][C-000001b6] app_macro.c: Spawn extension (macro-dial, s, 33) exited non-zero on ‘PJSIP/WLC-00001e47’ in macro ‘dial’

I don’t know how to read that - but it seems like something is mis-formatted - I know it’s to do with extension 122 because as soon as I took it out, happiness ensued.

billsimon · April 22, 2021, 10:14pm

Curious why you marked the issue “solved” in post 2.

How many extensions do you have in the ring group? It looks like a very long dial string might have gotten truncated.

GSnover · April 22, 2021, 11:25pm

I marked it solved because it’s not a TaskProcessor problem - I am thinking truncation too. How do I find out what the limit is on a Dial Line?

billsimon · April 22, 2021, 11:37pm

It was just a guess. I would start by posting logs in their actual form and not like halfway through a log line.

GSnover · April 23, 2021, 1:46am

For sure in case anyone else sees this post - It was too many phones/devices in a Ring Group - with PJSIP extensions, the device handle is much longer, and then you add in the extra handles of the Connect softphones, and whammo - hangs up the call quickly.

I wish I could figure out the exact length of the truncation, but the error message doesn’t say where the error was - just that it was malformed.

Now as to why the Task Processor is logging queue overload so often, that I guess is a different problem.

BlazeStudios · April 23, 2021, 12:36pm

Honestly, everything you just show so far has a single extension being dialed it just has multiple contacts including the dumbass 90XXX extra extensions for Zulu/Connect.

So we probably should see this for real, which all the extensions/contacts being loaded because so far this shows 1 extension in the ring group being dialed.

PitzKey · April 23, 2021, 12:37pm

Josh Colp explained the Task Processor here:

While that helped me get a better understanding what it does, I still don’t fully understand how it works, how to troubleshoot, or how to prevent it from getting overloaded and when it is necessary to tweak it.

I just wish one of the AstriCon speakers will pick this subject

jcolp · April 23, 2021, 12:41pm

The problem is that there isn’t an answer to that. Taskprocessors are a core tool, and it’s everything that uses them that can cause overloads and they’re used everywhere. How to troubleshoot/prevent/etc is therefore dependent on what is using the taskprocessors and how they’re being used which encompasses… well… loads of Asterisk. I wish I could write a wiki page about this kind of thing, but I haven’t been able to - at least one that actually helps.

billsimon · April 23, 2021, 12:50pm

It would be nice if we could directly dial an AOR and not have to expand it into a list of contacts.

BlazeStudios · April 23, 2021, 1:12pm

Then how would you call all the contacts? Also, you can by just doing dial(PJSIP/100) that will dial the first listed contact for the AOR. The function DIAL_PJSIP_CONTACTS() will grab all the contacts of the endpoint to dial. It also allows you to pick an endpoint AND an AOR to use. So you could pick endpoint 100 and use AOR 200 to grab the contacts from.

billsimon · April 23, 2021, 1:13pm

I know how it works, thanks.

I’m saying this would be better if it dialed all the contacts. What’s the point of just dialing the first (perhaps arbitrary) one in the list.

BlazeStudios · April 23, 2021, 1:24pm

Thats not what you said. You wanted a way to dial an AOR and not have it expand into a list of contacts. I said how to. But you need the contacts to dial so how do you think this should be done to dial AOR 100 and hit all 3 contacts it has?

Just for the record, this is how other systems work. You look up the AOR to get all the contacts and that builds a destination set to send them request to. Including NOTIFYs, etc.

billsimon · April 23, 2021, 1:30pm

The problem is that DIAL_PJSIP_CONTACTS() could expand into a string that is a thousand characters long and (as may be the case here) cause a surprise, maybe a failure.

I think it would be a good default and should be done “behind the scenes” in the PJSIP stack. If you want to enumerate contacts and work with them individually there should be functions for that but if you just want to fork a call to all contacts it should be as easy as dialing PJSIP/AOR. Just my opinion.

BlazeStudios · April 23, 2021, 1:56pm

Well that is more an Asterisk thing considering that all the tech in Asterisk up until Chan_PJSIP was introduced was a 1:1 relationship. Meaning that a single Chan_SIP/IAX peer meant there was a single contact location associated with it. So it looks up a single contact, hence the behavior of dial(PJSIP/device) only pulling the top listed contact. So I can see the need for DIAL_PJSIP_CONTACTS() as method was needed for the now 1:many relationship of an AOR having numerous contact locations.

I think we need to see more @GSnover of this. A full call from start to finish so we can see just how many contacts are being loaded into the dial string. I do believe that Dial() has character limitations and this could be hitting it. However, I am see a few issues from that Dial() posted.

I don’t see the ringtimer variable set. That should be in between the dial string and the dial options. If not set there should be a ,, between the dial string and options.
The GLOBAL dial options string is missing. There are no other dial options being prefixed to the predial handler (b option). There should be more options like Tt or Hh or even an r, those are all set via the global variable and any updates to it in the dialplan. I doubt it was reset to null.
The dial command is malformed. There are missing commas between all the dial settings, except for at the end of the predial handler.

So we need to see a properly formatted debug output of this to see what is actually happening. Is this really a Dial() application limitation (string being too long) or is the dialplan messed up? Because if this is an issue with the dial string being too long, this is going to require an Asterisk bug report not a FreePBX bug report.

system · June 3, 2021, 11:12pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.