[SOLVED] PJSIP Service Unavailable

Not really, I’ve given as much as I can to this post.

Not directly related… Do you have FastAGI enabled on your FreePBX? FastAGI makes a significant difference in performance.

Something is subscribed to them but is failing to read them.

@PitzKey In the Advanced Settings I see Launch local AGIs through FastAGI Server is to NO. I know nothing about FastAGI. When would you want to enable it and when wouldn’t you want to? Are there additional steps to “enable” it or is it as simple as setting that to YES?

I definitely learning things here… but still don’t have my head around this yet. I do not use the traditional asterisk manager interface at all. So, if there is something connected to that it must be FreePBX. HOWEVER, I do use the Asterisk HTTP XML Interface. Which got me thinking could that be an issue? So I ran manager show connected and got this back:

Username         IP Address                                               Start       Elapsed     FileDes   HttpCnt   Read   Write
admin                                                  1645522103  37848       46        0         1073750015  1073750015
firewall                                                1645528089  31862       83        0         2147483647  00064
my_custom_user   xxx.xx.x.xxx                                             1645528749  31202       -1        0         08191  08191
[...thousands of similar lines omitted...]
my_custom_user   xxx.xx.x.xxx                                             1645554560  5391        -1        0         08191  08191
admin                                                  1645559951  0           82        0         1073750015  1073750015
67425 users connected.

The I do recognize “my_custom_user” (not real) as something I created and that we use to connect to the aforementioned HTTP Manager.

My understanding of how that manager works is that you send it a simple command and it gives you the response… there is no mechanism to get a stream of events that I’m aware of. So are these messages stacking up because you make a single request to that interface and the server is keeping that connection around for forever?!? Or, am I missing something that I need to be doing to “close” that connection?

This is a good read: Performance Improvements in FreePBX

I enable it on all systems, IIRC it’s enabled by default in FreePBX 15+

You’ll need to reboot your PBX afterwards.

I have a feeling that after enabling FastAGI the task processor will be able to handle all of your requests without any issues.

I would suggest eliminating usage of the HTTP interface and seeing if it changes. That interface is ancient, used by few, barely maintained, and likely not understood by many people. If eliminating its usage results in manager working as expected then it would point to that.

1 Like

I see your using the default port. I know it worked fine for a long time but I would suggest using a different non standard port. I didnt see in the question if the phones were co-located with the pbx? Is this going through a firewall. Not sure if it applies but I have had horrendous problems with the standard port 5060 (pjsip) and some routers/ISP’s, works fine for weeks and then just stops (especially FiOS quantum and Spectrum routers). Rebooting the router didn’t help, felt like the mac was stuck in an arp cache or SIG ALS, don’t know but changing to a new non standard port cleared it up, haven’t had an issue in months. Additionally, if the PBX is on a public network its possible hackers are constantly trying to connect or process calls (even though they fail). If its public facing turn on intrusion detection and set the limits fairly low so bad actors get blocked for a significant amount of time. That could reduce the workload on the server if its processing lots of calls. Also, what’s the CPU, how much RAM?

The port shouldn’t matter with the firewall configured the way it is, we only allow traffic from branch offices.

The server has 8 cores and 16 GB memory. Which are both way under utilized as far as I can see.

@PitzKey This system is a fresh FreePBX 15 distro install and it’s not enabled so I don’t think it’s enabled by default. We’ll try it in our lab and if it seems to work well we’ll give it a try but I still think that this is related the to the HTTP manager.

did you restart asterisk without your ami client running yet ?

Does your client conclude with a proper Action: Logoff\n\n ?

@dicko Again, we don’t have an AMI client we only use the HTTP XML Manager Interface so there is no opportunity to send subsequent messages such as “Logoff”. On Monday we didn’t have our services running that make those HTTP calls and we were still seeing taskprocessor warnings. However, I can’t say for certain if the same exact taskprocessor was backing up or if there were manage events backed up like we’re seeing now.

Right now we’re going to try and replicate this in our lab by making a huge amount of Manager HTTP requests.

Unless I’m missing something, is there a way to force a logoff when using the HTTP interface?

How Asterisk returns the results is not the same as how the request is made, with AMI I believe that if you don’t have two CR’s at the end of a transaction, (either over TCP or wrapped in HTML ) , the connection is considered incomplete .

I think you may not understand how the Manager HTTP XML Interface works. You make a single HTTP GET request such as https://YOURSERVER:8089/amxml?action=QueueStatus and it returns an XML document such as:

    <response type='object' id='unknown'>
        <generic response='Success' eventlist='start' message='Queue status will follow' />
    <response type='object' id='unknown'>
        <generic event='QueueParams' queue='default' max='0' strategy='ringall' calls='0' holdtime='0' talktime='0' completed='0' abandoned='0' servicelevel='0' servicelevelperf='0.0' servicelevelperf2='0.0' weight='0' />
    <response type='object' id='unknown'>
        <generic event='QueueParams' queue='5000' max='0' strategy='ringall' calls='0' holdtime='0' talktime='0' completed='0' abandoned='0' servicelevel='60' servicelevelperf='0.0' servicelevelperf2='0.0' weight='0' />
    <response type='object' id='unknown'>
        <generic event='QueueMember' queue='5000' name='Test Agent 1' location='PJSIP/510' stateinterface='PJSIP/510' membership='static' penalty='0' callstaken='0' lastcall='0' lastpause='0' incall='0' status='5' paused='0' pausedreason='' wrapuptime='0' />
    <response type='object' id='unknown'>
        <generic event='QueueStatusComplete' eventlist='Complete' listitems='21' />

Unless I’m mistaken I’m not sure where the double \r\n which is required in the standard AMI connection comes into play here.

Quite likely you are right, But you are probably an outlier here, but let us know how it works out for you after closing the connections.

The way the port may matter is that some routers try and insert an application level gateway for SIP, which is usually a bad thing. They will probably only do so for port 5060.

Well, we finally figured out what was going on here and we put a work around in place. I figured that I would post what we found and what we did for the benefit of others who may find this thread in the future.

The Asterisk Manager HTTP interface is crap, don’t use it. The Asterisk team should really remove it, mark it as obsolete, or better yet… fix it.

Because this thread is semi-long, I highly recommend you read it carefully, so you don’t miss any of the details of the journey that led to figuring this out.

Let’s start with a review of the symptoms. Phones were going “offline” in huge chunks and then would come back online. We already knew that there were not firewall or NAT issues in this deployment so we didn’t have to spend too much time looking at those kinds of things. But we decided to do some SIP debugging anyway to see what else could be going on. We saw that the phones that were going offline were doing so because when they attempted to refresh their registration Asterisk would respond with 503 Service Unavailable.

We learned that this SHOULD only happen when Asterisk’s task processors are backed up. We took a look at the task processors and found that there was one that was enormously backed up. That task processor was stasis/m:manager:core-00000007 which we guessed has something to do with the Asterisk Manager Interface. I’m not sure why a backup in that task processor would cause PJSIP to stop working but I’m not familiar with the internals of Asterisk. First, we tried using the setting taskprocessor_overload_trigger=none (this is available in FreePBX 15+ GUI under Settings > SIP Settings) which should tell Asterisk not respond with 503 ever but that setting seemed to have no affect. After a system restart would see that the setting was in set properly (using pjsip show settings) but those phones were still going offline for the same reason.

Remember, this entire time the system was operating WAY under its available memory, disk IOPS, and total processor. Note what I said there, TOTAL processor. We started noticing that that a single asterisk thread was sitting at around 250%. This server is running virtually on a host with 32 core 2.4Ghtz Xeons processor, 320 gigs of RAM, and a RAID 10 SSD array. There are currently NO other workloads on this host. And still this task processor couldn’t keep up. (likely because it all being done by a single thread)

Next we took a look at manager show eventq and saw that had approximantely 34 thousand messages listed. Next we took a look at manage show connected and saw that there were 65 thousand connections!!! On a FreePBX system there should be 2 or 3 only unless you’ve written something custom which connects to the manager. So at this point it seemed like there were only a few possibilities, a service which is a part of FreePBX which stays connected to the manager is broken, Asterisk itself is broken, or something we built is broken. Since we don’t have any custom code that connects directly to the standard AMI (Asterisk Manager Interface) we initially assumed that it couldn’t be something we wrote. We DO have custom code which makes calls to the Asterisk Manager HTTP Interface but didn’t think that could be related because it doesn’t hold a connection open to receive any events.

We now understand what the HTTP interface really is. It’s just a wrapper that accepts manager actions. Opens a connection to the AMI, authenticates, sends your action, waits for the response, and returns it to you. However, it doesn’t logoff and close the connection right away. I have no idea why not. In fact in testing which can be replicated on ANY Asterisk system. The connection it creates to the AMI will remain open and visible in manage show connected for 60 seconds. We found that you can lower that time using the httptimeout setting in manager.conf. However, even if you set to something like 5 seconds Asterisk still doesn’t clear those connections right at 5 seconds. And more importantly you can replicate the connections stacking up by simply “spamming” (1-2 requests per second) the HTTP interface with any simple action. The connections will stack up and even though they’re supposed to timeout at 5 seconds they will remain open for DAYS (yes, literally) if more connections continue to be opened.

This caused the entire spiral into death.

Now, onto the solution. We built our own HTTP interface which does the exact same thing. Accept an HTTP post, open a connection to the AMI, run a action, wait for all the expected events, and then log off immediately and close the connection and return the events as nicely formatted JSON. We spammed our version of this at a rate of 20 requests per second for 30 minutes with no errors and when we finished there were only the 3 connections open from FreePBX. There are still a few events that show up in manage show eventq but they seem to be queued for the FreePBX connections. There are no task processors backed up, no threads maxed out, and me and my team can finally get some sleep after 2 weeks of hell.

In closing I will say that it’s frustrating that using Asterisk and FreePBX each on their own feels like trying to navigate a mine field even after 15 years of using them. For example, we read the most up-to-date documentation about features in Asterisk, built a solution on top of them only to find out later that they don’t work well, have HUGE performance issues, and when asking for help to be told “don’t use that it’s old and unmaintained” feels bad. I can only encourage the veterans of this community to continue to stay involved and be helpful and kind, you’re the only thing that will keep this community from slowly falling apart IMHO. Thank you to those of you who do just that!

P.S. Should anyone at Sangoma be interested in what we learned or wants to replicate this issue don’t hesitate to reach out I’m happy to help.

1 Like

Hi Darren, we literally had yesterday a similar issue. I am curious if this is a bug introduced in recent Asterisk versions.

A client called us to complain that calls are randomly dropping.
I checked core show uptime and it showed that Asterisk was running for a couple of minutes. But the actual Linux uptime was way longer, I checked /tmp and there was a nice amount of core dumps which indicates that Asterisk kept on crashing. To confirm that, I first grepped for the Asterisk version from all log files (grep "16.10.3" /var/log/asterisk/full*), on a healthy system, it should only return one line per log file. It returned multiple times…

The first thing we did, was a system reboot. But Asterisk was still.crashing every here and there.

Next, I opened htop and was able to watch Asterisk eating all system resources until it finally crashes.

In /tmp I also saw a couple of ast-ami-XXXXX files, which from what I understand, gets created if AMI events/commands are not delivered in a timely manner.

So I thought, airtight, this is probably FOP2 since it is the only third party application that uses AMI. But it wasn’t that.

This client uses ARI for call generation/origination. We went ahead and disabled the ARI user and restarted Asterisk. Once we have done that, the average CPU went down from 99% to 8% and obviously Asterisk didn’t crash since.

We thought that they have written an ARI app full of bugs which is why this is happening, so we have asked them to share with us their source code for review… But now I am thinking that their code might actually be written very well and there’s a bug in ARI.

Would you mind checking if Asterisk has also crashed on your system?

And these question is to the community: what’s the best practice to debug ARI connections?
In our case, shutting the ARI user was a guess, we didn’t see any indications that ARI is eating resources. How can one tell which application within Asterisk is misbehaving?

@darrenhollick If you wish to file an issue on the Asterisk issue tracker[1] feel free, I can’t guarantee anything will happen as a result. It would also be good to know how exactly you knew of the HTTP interface, how to use it, etc. I personally forgot it even existed until you brought it up.

Aside from the HTTP interface - Asterisk is an old old project, with many different things of differing quality as it has developed over time. Newer things are generally of better quality, more well documented, more tested, and more well used. Some things, such as the HTTP interface, are from a time where there was no code review, no huge testing, it was just throw it in.

I spent some time last year putting into place a policy[2] so we can actually deprecate things in a controlled reasonable fashion, because as a large well used project we have to try not to just rip things out without notice. There’s a balance that we ultimately have to strike.

@PitzKey The issue referred to by @darrenhollick is unlikely to have been a recent regression, and is unrelated to yours. There is no per-application or per-thread CPU monitoring/profiling built into Asterisk which you’d likely need to be able to find a resource consumption issue with an ARI application, but even then that’s no guarantee if we did have it that it would show depending on what is being done. What version of Asterisk is in use? If it’s not a recent version I would first update to a more recent version, as any subsequent investigation would need it. Another thing with the Originate option is that depending on what is being originated a computationally expensive transcode path can occur if formats are not provided to the “formats” parameter.

[1] System Dashboard - Digium/Asterisk JIRA
[2] Asterisk Module Deprecation ⋆ Asterisk

Asterisk was not crashing for us.

core show taskprocessors CAN point you the right direction.

Since this seems like it may be a different issue I suggest a new thread, which I will happily work with you in.