[SOLVED] PJSIP Service Unavailable

I am seeing something very odd happening on a system that has been running very well for months. We recently started seeing large chunks of phones (35-40 phones at a time) go unavailable. I started looking at some pjsip logging details and I’m seeing that these phones are attempting to re-REGISTER before their expiration time. What is weird is that Asterisk is responding to their re-REGISTER packet with “503 Service Unavilable”. Then a few seconds later Asterisk removes the pjsip contact due to “expiration”. Then a little later the phone will start over trying to register and this time the process looks normal and it works for a while until eventually this odd behavior happens again.

Here is the log output…

[2022-02-17 10:13:49] VERBOSE[2445] res_pjsip_logger.c: <--- Received SIP request (649 bytes) from TCP:xxx.xxx.xxx.20:11890 --->
REGISTER sip:SCRUBBED.DNS.ENTRY:5060 SIP/2.0
Via: SIP/2.0/TCP 10.0.0.192:11890;branch=z9hG4bK3881921094
From: "John Doe (7065)" ;tag=2536490602
To: "John Doe (7065)" 
Call-ID: [email protected]
CSeq: 1 REGISTER
Contact: 
Allow: INVITE, INFO, PRACK, ACK, BYE, CANCEL, OPTIONS, NOTIFY, REGISTER, SUBSCRIBE, REFER, PUBLISH, UPDATE, MESSAGE
Max-Forwards: 70
User-Agent: Yealink SIP-T54W 96.84.0.30
Expires: 60
Allow-Events: talk,hold,conference,refer,check-sync
Content-Length: 0

[2022-02-17 10:13:49] VERBOSE[2445] res_pjsip_logger.c: <--- Transmitting SIP response (400 bytes) to TCP:xxx.xxx.xxx.20:11890 --->
SIP/2.0 503 Service Unavailable
Via: SIP/2.0/TCP 10.0.0.192:11890;received=xxx.xxx.xxx.20;branch=z9hG4bK3881921094
Call-ID: [email protected]
From: "John Doe (7065)" ;tag=2536490602
To: "John Doe (7065)" ;tag=z9hG4bK3881921094
CSeq: 1 REGISTER
Server: FPBX-14.0.16.11(16.20.0)
Content-Length: 0

[2022-02-17 10:14:04] VERBOSE[53826] res_pjsip_logger.c: <--- Transmitting SIP request (463 bytes) to TCP:xxx.xxx.xxx.20:11890 --->
OPTIONS sip:[email protected]:11890;transport=TCP SIP/2.0
Via: SIP/2.0/TCP xxx.xxx.xxx.12:5060;rport;branch=z9hG4bKPj88134694-2580-4968-b27b-a4e00a7c2abf;alias
From: ;tag=65cc0190-02c9-44fa-bbff-5a1015e2eb92
To: 
Contact: 
Call-ID: a255fd4e-a752-42d0-9668-9b68445826e1
CSeq: 35299 OPTIONS
Max-Forwards: 70
User-Agent: FPBX-14.0.16.11(16.20.0)
Content-Length: 0


[2022-02-17 10:14:04] VERBOSE[2445] res_pjsip_logger.c: <--- Received SIP response (370 bytes) from TCP:xxx.xxx.xxx.20:11890 --->
SIP/2.0 200 OK
Via: SIP/2.0/TCP xxx.xxx.xxx.12:5060;rport=5060;branch=z9hG4bKPj88134694-2580-4968-b27b-a4e00a7c2abf;alias
From: ;tag=65cc0190-02c9-44fa-bbff-5a1015e2eb92
To: ;tag=1837962263
Call-ID: a255fd4e-a752-42d0-9668-9b68445826e1
CSeq: 35299 OPTIONS
User-Agent: Yealink SIP-T54W 96.84.0.30
Content-Length: 0

. . .

[2022-02-17 10:14:05] VERBOSE[2463] res_pjsip_registrar.c: Removed contact 'sip:[email protected]:11890;transport=TCP;x-ast-orig-host=10.0.0.192:11890' from AOR '7065' due to expiration

[2022-02-17 10:14:20] VERBOSE[2445] res_pjsip_logger.c: <--- Received SIP request (648 bytes) from TCP:xxx.xxx.xxx.20:11890 --->
REGISTER sip:SCRUBBED.DNS.ENTRY:5060 SIP/2.0
Via: SIP/2.0/TCP 10.0.0.192:11890;branch=z9hG4bK3913095958
From: "John Doe (7065)" ;tag=795045674
To: "John Doe (7065)" 
Call-ID: [email protected]
CSeq: 1 REGISTER
Contact: 
Allow: INVITE, INFO, PRACK, ACK, BYE, CANCEL, OPTIONS, NOTIFY, REGISTER, SUBSCRIBE, REFER, PUBLISH, UPDATE, MESSAGE
Max-Forwards: 70
User-Agent: Yealink SIP-T54W 96.84.0.30
Expires: 60
Allow-Events: talk,hold,conference,refer,check-sync
Content-Length: 0


[2022-02-17 10:14:20] VERBOSE[45806] res_pjsip_logger.c: <--- Transmitting SIP response (550 bytes) to TCP:xxx.xxx.xxx.20:11890 --->
SIP/2.0 401 Unauthorized
Via: SIP/2.0/TCP 10.0.0.192:11890;rport=11890;received=xxx.xxx.xxx.20;branch=z9hG4bK3913095958
Call-ID: [email protected]
From: "John Doe (7065)" ;tag=795045674
To: "John Doe (7065)" ;tag=z9hG4bK3913095958
CSeq: 1 REGISTER
WWW-Authenticate: Digest realm="asterisk",nonce="SCRUBBED",opaque="SCRUBBED",algorithm=md5,qop="auth"
Server: FPBX-14.0.16.11(16.20.0)
Content-Length: 0


[2022-02-17 10:14:20] VERBOSE[2445] res_pjsip_logger.c: <--- Received SIP request (932 bytes) from TCP:xxx.xxx.xxx.20:11890 --->
REGISTER sip:SCRUBBED.DNS.ENTRY:5060 SIP/2.0
Via: SIP/2.0/TCP 10.0.0.192:11890;branch=z9hG4bK530506181
From: "John Doe (7065)" ;tag=795045674
To: "John Doe (7065)" 
Call-ID: [email protected]
CSeq: 2 REGISTER
Contact: 
Authorization: Digest username="7065", realm="asterisk", nonce="SCRUBBED", uri="sip:SCRUBBED.DNS.ENTRY:5060", response="SCRUBBED", algorithm=MD5, cnonce="SCRUBBED", opaque="SCRUBBED", qop=auth, nc=00000001
Allow: INVITE, INFO, PRACK, ACK, BYE, CANCEL, OPTIONS, NOTIFY, REGISTER, SUBSCRIBE, REFER, PUBLISH, UPDATE, MESSAGE
Max-Forwards: 70
User-Agent: Yealink SIP-T54W 96.84.0.30
Expires: 60
Allow-Events: talk,hold,conference,refer,check-sync
Content-Length: 0


[2022-02-17 10:14:20] VERBOSE[81373] res_pjsip_registrar.c: Added contact 'sip:[email protected]:11890;transport=TCP;x-ast-orig-host=10.0.0.192:11890' to AOR '7065' with expiration of 60 seconds
[2022-02-17 10:14:20] VERBOSE[81373] res_pjsip_logger.c: <--- Transmitting SIP response (505 bytes) to TCP:xxx.xxx.xxx.20:11890 --->
SIP/2.0 200 OK
Via: SIP/2.0/TCP 10.0.0.192:11890;rport=11890;received=xxx.xxx.xxx.20;branch=z9hG4bK530506181
Call-ID: [email protected]
From: "John Doe (7065)" ;tag=795045674
To: "John Doe (7065)" ;tag=z9hG4bK530506181
CSeq: 2 REGISTER
Date: Thu, 17 Feb 2022 15:14:20 GMT
Contact: ;expires=59
Expires: 60
Server: FPBX-14.0.16.11(16.20.0)
Content-Length: 0

What would cause Asterisk to respond with 503 Service Unavailable randomly to some phones some of the time?

If the system is overloaded and tasks are piling up (they are being created faster than the system can handle it), then new requests will receive a 503 to give the system time to work off the queue.

@jcolp Thank you for that. I was suspicious of that myself but when I look at the system utilization I can’t see what is being “starved”. The system has 8 cores and 16GB of memory a few hundred phones with no mor than 40 concurrent calls (usually only 20). The system seems to idle around a cpu load of 3-5 and like 20% memory utilized.

Can you point me in a direction of where I should look for what would be causing this system not to be able to keep up?

The “core show traskprocessors” would show if taskprocessors are being overloaded and which ones, and it also appears in the log. I can’t point you at a specific thing as to the cause even with that, as it depends on how the system is used as well.

@jcolp I have seen messages intermittently in the logs like:

(Last 50000 lines filtered on “taskprocessor”)

[2022-02-17 10:02:12] WARNING[120959] taskprocessor.c: The 'stasis/m:manager:core-00000007' task processor queue reached 3000 scheduled tasks again.
[2022-02-17 10:04:20] WARNING[122079][C-00000124] taskprocessor.c: The 'stasis/m:manager:core-00000007' task processor queue reached 3000 scheduled tasks again.
[2022-02-17 10:05:12] WARNING[122529][C-00000128] taskprocessor.c: The 'stasis/m:manager:core-00000007' task processor queue reached 3000 scheduled tasks again.
[2022-02-17 10:06:26] WARNING[122937][C-00000129] taskprocessor.c: The 'stasis/m:manager:core-00000007' task processor queue reached 3000 scheduled tasks again.
[2022-02-17 10:08:06] WARNING[123745][C-0000012b] taskprocessor.c: The 'stasis/m:manager:core-00000007' task processor queue reached 3000 scheduled tasks again.
[2022-02-17 10:08:47] WARNING[124045][C-0000012d] taskprocessor.c: The 'stasis/m:manager:core-00000007' task processor queue reached 3000 scheduled tasks again.
[2022-02-17 10:09:25] WARNING[124359][C-0000012e] taskprocessor.c: The 'stasis/m:manager:core-00000007' task processor queue reached 3000 scheduled tasks again.
[2022-02-17 10:09:47] WARNING[124508][C-0000012f] taskprocessor.c: The 'stasis/m:manager:core-00000007' task processor queue reached 3000 scheduled tasks again.
[2022-02-17 10:10:12] WARNING[124706][C-00000130] taskprocessor.c: The 'stasis/m:manager:core-00000007' task processor queue reached 3000 scheduled tasks again.
[2022-02-17 10:10:50] WARNING[125073][C-00000132] taskprocessor.c: The 'stasis/m:manager:core-00000007' task processor queue reached 3000 scheduled tasks again.
[2022-02-17 10:13:03] WARNING[126103][C-00000132] taskprocessor.c: The 'stasis/m:manager:core-00000007' task processor queue reached 3000 scheduled tasks again.
[2022-02-17 10:13:31] WARNING[126353][C-00000134] taskprocessor.c: The 'stasis/m:manager:core-00000007' task processor queue reached 3000 scheduled tasks again.
[2022-02-17 10:14:44] WARNING[126888][C-00000135] taskprocessor.c: The 'stasis/m:manager:core-00000007' task processor queue reached 3000 scheduled tasks again.
[2022-02-17 10:15:20] WARNING[127197][C-00000137] taskprocessor.c: The 'stasis/m:manager:core-00000007' task processor queue reached 3000 scheduled tasks again.
[2022-02-17 10:16:11] WARNING[127497][C-00000139] taskprocessor.c: The 'stasis/m:manager:core-00000007' task processor queue reached 3000 scheduled tasks again.
[2022-02-17 10:18:18] WARNING[128343][C-0000013a] taskprocessor.c: The 'stasis/m:manager:core-00000007' task processor queue reached 3000 scheduled tasks again.
[2022-02-17 10:19:18] WARNING[128763][C-0000013b] taskprocessor.c: The 'stasis/m:manager:core-00000007' task processor queue reached 3000 scheduled tasks again.
[2022-02-17 10:27:39] WARNING[1378][C-00000147] taskprocessor.c: The 'stasis/m:manager:core-00000007' task processor queue reached 3000 scheduled tasks again.
[2022-02-17 10:28:03] WARNING[1478][C-00000148] taskprocessor.c: The 'stasis/m:manager:core-00000007' task processor queue reached 3000 scheduled tasks again.
[2022-02-17 10:36:38] WARNING[5595][C-00000153] taskprocessor.c: The 'stasis/m:manager:core-00000007' task processor queue reached 3000 scheduled tasks again.
[2022-02-17 10:38:53] WARNING[6681][C-00000155] taskprocessor.c: The 'stasis/m:manager:core-00000007' task processor queue reached 3000 scheduled tasks again.
[2022-02-17 10:40:48] WARNING[7650][C-00000157] taskprocessor.c: The 'stasis/m:manager:core-00000007' task processor queue reached 3000 scheduled tasks again.
[2022-02-17 10:41:32] WARNING[2371] taskprocessor.c: The 'stasis/m:manager:core-00000007' task processor queue reached 3000 scheduled tasks again.
[2022-02-17 10:42:14] WARNING[8418][C-00000159] taskprocessor.c: The 'stasis/m:manager:core-00000007' task processor queue reached 3000 scheduled tasks again.
[2022-02-17 10:43:08] WARNING[8832][C-0000015a] taskprocessor.c: The 'stasis/m:manager:core-00000007' task processor queue reached 3000 scheduled tasks again.
[2022-02-17 10:44:08] WARNING[9381][C-0000015b] taskprocessor.c: The 'stasis/m:manager:core-00000007' task processor queue reached 3000 scheduled tasks again.
[2022-02-17 10:44:54] WARNING[9699][C-0000015c] taskprocessor.c: The 'stasis/m:manager:core-00000007' task processor queue reached 3000 scheduled tasks again.
[2022-02-17 10:45:33] WARNING[10070][C-0000015d] taskprocessor.c: The 'stasis/m:manager:core-00000007' task processor queue reached 3000 scheduled tasks again.
[2022-02-17 10:46:06] WARNING[10290][C-0000015e] taskprocessor.c: The 'stasis/m:manager:core-00000007' task processor queue reached 3000 scheduled tasks again.
[2022-02-17 10:48:20] WARNING[11192][C-0000015f] taskprocessor.c: The 'stasis/m:manager:core-00000007' task processor queue reached 3000 scheduled tasks again.
[2022-02-17 10:49:15] WARNING[11618][C-00000160] taskprocessor.c: The 'stasis/m:manager:core-00000007' task processor queue reached 3000 scheduled tasks again.
[2022-02-17 12:24:52] WARNING[39152][C-000001c8] taskprocessor.c: The 'stasis/m:manager:core-00000007' task processor queue reached 3000 scheduled tasks again.

Does that help you be able to give me any further direction on what to look at?

That would mean there are a ton of manager events being created and they aren’t being processed fast enough. There is an option in Asterisk to restrict the PJSIP 503 behavior to only overloads in PJSIP, I don’t know how or where that is exposed in FreePBX. I do believe in recent versions the default was changed to PJSIP only.

@jcolp Where is the option exposed in Asterisk? Meaning do you know the setting and config file where that is set? We’re running Asterisk 16.

@jcolp you are awesome! There are options in the GUI of FreePBX 15+ to do this but since I’m on 14 at the moment I’d like to do it manually. Asyou point out, the setting seems to be taskprocessor_overload_trigger in pjsip.conf in the [global] section. Since I can’t edit pjsip.conf on a FreePBX system without it being overwritten I need another way to do it.

In pjsip.conf I see this:

[global]
type=global
user_agent=FPBX-14.0.16.11(16.20.0)
use_callerid_contact=no
endpoint_identifier_order=ip,username,anonymous,header,auth_username
#include pjsip_custom_post.conf

Does that mean I can add this setting in pjsip_custom_post.conf? If so, can you confirm how to do it?

Does this look correct?

[global](+)
taskprocessor_overload_trigger=none

UPDATE: I can confirm the above content in pjsip_custom_post.conf do properly configure this setting.

That seems vaguely correct, but I don’t do such configuration ever.

@jcolp Hmm, pjsip show settings reveals that this is already set to none.

lobal Settings:

ParameterName                              : ParameterValue
=======================================================================================
contact_expiration_check_interval          : 30
debug                                      : no
default_from_user                          : asterisk
default_outbound_endpoint                  : default_outbound_endpoint
default_realm                              : asterisk
default_voicemail_extension                :
disable_multi_domain                       : false
endpoint_identifier_order                  : ip,username,anonymous,header,auth_username
ignore_uri_user_options                    : false
keep_alive_interval                        : 90
max_forwards                               : 70
max_initial_qualify_time                   : 0
mwi_disable_initial_unsolicited            : false
mwi_tps_queue_high                         : 500
mwi_tps_queue_low                          : -1
norefersub                                 : yes
regcontext                                 :
send_contact_status_on_update_registration : no
taskprocessor_overload_trigger             : none
unidentified_request_count                 : 5
unidentified_request_period                : 5
unidentified_request_prune_interval        : 30
use_callerid_contact                       : no
user_agent                                 : FPBX-14.0.16.11(16.20.0)

System Settings:

ParameterName               : ParameterValue
============================================
accept_multiple_sdp_answers : false
compact_headers             : false
disable_rport               : false
disable_tcp_switch          : true
follow_early_media_fork     : true
threadpool_auto_increment   : 5
threadpool_idle_timeout     : 60
threadpool_initial_size     : 0
threadpool_max_size         : 50
timer_b                     : 32000
timer_t1                    : 500

Shouldn’t that mean that asterisk wouldn’t have responded with a 503?

The “pjsip show settings” CLI command would show it.

@jcolp Sorry, I realized that after posting and updated my post. The setting is currently set to none, shouldn’t that mean that asterisk wouldn’t have responded with a 503?

If the option is set to “none” then the behavior with 503 on overload should not occur.

@jcolp So, what do you make of that fact that it did happen even though that setting was set to none?

I’d need to see an actual full log[1] to be able to answer further.

[1] Collecting Debug Information - Asterisk Project - Asterisk Project Wiki

@jcolp I don’t want to take any more of your time so let me continue to do as much research as I can and if I get completely stuck I will add a debug log and let it run while we’re experiencing the issue then I can send it directly to you if you’re still willing to take a look at that time.

@jcolp Since my last posts we have completely rebuilt the system in question and it is now on FreePBX 15 (FreePBX Distro) and Asterisk 16.20.0. I have seen you post many times regarding task processors and what they are. I’ve also seen you say that understanding what is causing them to get overloaded is not a simple task. Given that this server seems to have ample resources (CPU and MEMORY are at low utilization) and we’re seeing messages such as:

[2022-02-21 13:09:28] WARNING[117154][C-0000038f] taskprocessor.c: The 'stasis/m:channel:all-0000037a' task processor queue reached 500 scheduled tasks.

AND

[2022-02-21 12:55:57] WARNING[111477][C-00000367] taskprocessor.c: The 'stasis/m:ari:application/intulse-hold-unhold-0000039a' task processor queue reached 500 scheduled tasks again.

If we were to share more information with you would you be able to point us to where they may be an issue? Or even at a higher level are you available as a consultant?

It means lots of channels, and an ARI application is not handling the events fast enough. That’s the extent of what I can say. I’m not available for consulting.

@jcolp Thanks for the response. Can you expand at all on “not handling events fast enough”? For example, are you saying that the events are being dispatched the the ARI application and it’s not processing them fast enough? Or is Asterisk having a hard time keeping up with sending the events to the ARI app?

Regarding channels… what app or part of Asterisk is not keeping up? This is where I think the community who are not Asterisk developers ourselves are struggling with understanding. In this case this system is pretty “out of the box” FreePBX. The one added thing is that we do have an ARI application which subscribes to All Channels and is set to only allow ChannelHold and ChannelUnhold events. See Below:

[
    {
        "name": "intulse-hold-unhold",
        "channel_ids": [
            "__AST_CHANNEL_ALL_TOPIC"
        ],
        "bridge_ids": [],
        "endpoint_ids": [],
        "device_names": [],
        "events_allowed": [
            {
                "type": "ChannelHold"
            },
            {
                "type": "ChannelUnhold"
            }
        ],
        "events_disallowed": []
    }
]