[SOLVED] PJSIP Service Unavailable

The way the port may matter is that some routers try and insert an application level gateway for SIP, which is usually a bad thing. They will probably only do so for port 5060.

Well, we finally figured out what was going on here and we put a work around in place. I figured that I would post what we found and what we did for the benefit of others who may find this thread in the future.

TL;DR;
The Asterisk Manager HTTP interface is crap, don’t use it. The Asterisk team should really remove it, mark it as obsolete, or better yet… fix it.

Because this thread is semi-long, I highly recommend you read it carefully, so you don’t miss any of the details of the journey that led to figuring this out.

Let’s start with a review of the symptoms. Phones were going “offline” in huge chunks and then would come back online. We already knew that there were not firewall or NAT issues in this deployment so we didn’t have to spend too much time looking at those kinds of things. But we decided to do some SIP debugging anyway to see what else could be going on. We saw that the phones that were going offline were doing so because when they attempted to refresh their registration Asterisk would respond with 503 Service Unavailable.

We learned that this SHOULD only happen when Asterisk’s task processors are backed up. We took a look at the task processors and found that there was one that was enormously backed up. That task processor was stasis/m:manager:core-00000007 which we guessed has something to do with the Asterisk Manager Interface. I’m not sure why a backup in that task processor would cause PJSIP to stop working but I’m not familiar with the internals of Asterisk. First, we tried using the setting taskprocessor_overload_trigger=none (this is available in FreePBX 15+ GUI under Settings > SIP Settings) which should tell Asterisk not respond with 503 ever but that setting seemed to have no affect. After a system restart would see that the setting was in set properly (using pjsip show settings) but those phones were still going offline for the same reason.

Remember, this entire time the system was operating WAY under its available memory, disk IOPS, and total processor. Note what I said there, TOTAL processor. We started noticing that that a single asterisk thread was sitting at around 250%. This server is running virtually on a host with 32 core 2.4Ghtz Xeons processor, 320 gigs of RAM, and a RAID 10 SSD array. There are currently NO other workloads on this host. And still this task processor couldn’t keep up. (likely because it all being done by a single thread)

Next we took a look at manager show eventq and saw that had approximantely 34 thousand messages listed. Next we took a look at manage show connected and saw that there were 65 thousand connections!!! On a FreePBX system there should be 2 or 3 only unless you’ve written something custom which connects to the manager. So at this point it seemed like there were only a few possibilities, a service which is a part of FreePBX which stays connected to the manager is broken, Asterisk itself is broken, or something we built is broken. Since we don’t have any custom code that connects directly to the standard AMI (Asterisk Manager Interface) we initially assumed that it couldn’t be something we wrote. We DO have custom code which makes calls to the Asterisk Manager HTTP Interface but didn’t think that could be related because it doesn’t hold a connection open to receive any events.

We now understand what the HTTP interface really is. It’s just a wrapper that accepts manager actions. Opens a connection to the AMI, authenticates, sends your action, waits for the response, and returns it to you. However, it doesn’t logoff and close the connection right away. I have no idea why not. In fact in testing which can be replicated on ANY Asterisk system. The connection it creates to the AMI will remain open and visible in manage show connected for 60 seconds. We found that you can lower that time using the httptimeout setting in manager.conf. However, even if you set to something like 5 seconds Asterisk still doesn’t clear those connections right at 5 seconds. And more importantly you can replicate the connections stacking up by simply “spamming” (1-2 requests per second) the HTTP interface with any simple action. The connections will stack up and even though they’re supposed to timeout at 5 seconds they will remain open for DAYS (yes, literally) if more connections continue to be opened.

This caused the entire spiral into death.

Now, onto the solution. We built our own HTTP interface which does the exact same thing. Accept an HTTP post, open a connection to the AMI, run a action, wait for all the expected events, and then log off immediately and close the connection and return the events as nicely formatted JSON. We spammed our version of this at a rate of 20 requests per second for 30 minutes with no errors and when we finished there were only the 3 connections open from FreePBX. There are still a few events that show up in manage show eventq but they seem to be queued for the FreePBX connections. There are no task processors backed up, no threads maxed out, and me and my team can finally get some sleep after 2 weeks of hell.

In closing I will say that it’s frustrating that using Asterisk and FreePBX each on their own feels like trying to navigate a mine field even after 15 years of using them. For example, we read the most up-to-date documentation about features in Asterisk, built a solution on top of them only to find out later that they don’t work well, have HUGE performance issues, and when asking for help to be told “don’t use that it’s old and unmaintained” feels bad. I can only encourage the veterans of this community to continue to stay involved and be helpful and kind, you’re the only thing that will keep this community from slowly falling apart IMHO. Thank you to those of you who do just that!

P.S. Should anyone at Sangoma be interested in what we learned or wants to replicate this issue don’t hesitate to reach out I’m happy to help.

1 Like

Hi Darren, we literally had yesterday a similar issue. I am curious if this is a bug introduced in recent Asterisk versions.

A client called us to complain that calls are randomly dropping.
I checked core show uptime and it showed that Asterisk was running for a couple of minutes. But the actual Linux uptime was way longer, I checked /tmp and there was a nice amount of core dumps which indicates that Asterisk kept on crashing. To confirm that, I first grepped for the Asterisk version from all log files (grep "16.10.3" /var/log/asterisk/full*), on a healthy system, it should only return one line per log file. It returned multiple times…

The first thing we did, was a system reboot. But Asterisk was still.crashing every here and there.

Next, I opened htop and was able to watch Asterisk eating all system resources until it finally crashes.

In /tmp I also saw a couple of ast-ami-XXXXX files, which from what I understand, gets created if AMI events/commands are not delivered in a timely manner.

So I thought, airtight, this is probably FOP2 since it is the only third party application that uses AMI. But it wasn’t that.

This client uses ARI for call generation/origination. We went ahead and disabled the ARI user and restarted Asterisk. Once we have done that, the average CPU went down from 99% to 8% and obviously Asterisk didn’t crash since.

We thought that they have written an ARI app full of bugs which is why this is happening, so we have asked them to share with us their source code for review… But now I am thinking that their code might actually be written very well and there’s a bug in ARI.

Would you mind checking if Asterisk has also crashed on your system?

And these question is to the community: what’s the best practice to debug ARI connections?
In our case, shutting the ARI user was a guess, we didn’t see any indications that ARI is eating resources. How can one tell which application within Asterisk is misbehaving?

@darrenhollick If you wish to file an issue on the Asterisk issue tracker[1] feel free, I can’t guarantee anything will happen as a result. It would also be good to know how exactly you knew of the HTTP interface, how to use it, etc. I personally forgot it even existed until you brought it up.

Aside from the HTTP interface - Asterisk is an old old project, with many different things of differing quality as it has developed over time. Newer things are generally of better quality, more well documented, more tested, and more well used. Some things, such as the HTTP interface, are from a time where there was no code review, no huge testing, it was just throw it in.

I spent some time last year putting into place a policy[2] so we can actually deprecate things in a controlled reasonable fashion, because as a large well used project we have to try not to just rip things out without notice. There’s a balance that we ultimately have to strike.

@PitzKey The issue referred to by @darrenhollick is unlikely to have been a recent regression, and is unrelated to yours. There is no per-application or per-thread CPU monitoring/profiling built into Asterisk which you’d likely need to be able to find a resource consumption issue with an ARI application, but even then that’s no guarantee if we did have it that it would show depending on what is being done. What version of Asterisk is in use? If it’s not a recent version I would first update to a more recent version, as any subsequent investigation would need it. Another thing with the Originate option is that depending on what is being originated a computationally expensive transcode path can occur if formats are not provided to the “formats” parameter.

[1] System Dashboard - Digium/Asterisk JIRA
[2] Asterisk Module Deprecation ⋆ Asterisk

Asterisk was not crashing for us.

core show taskprocessors CAN point you the right direction.

Since this seems like it may be a different issue I suggest a new thread, which I will happily work with you in.

@jcolp Thanks for the reply. I will think about filing an issue but I doubt it’s worth it. I do understand it’s hard to remove functionality from a project, I don’t envy having to figure that out for a project like Asterisk.

I would suggest updating any documentation references to it such as:

Maybe add a warning about the state/age of this feature. Thanks again for engaging with the community Josh!

The HTTP server itself has other purposes, such as for WebRTC. I’ve updated the specific page that talks about AJAM with a warning at the top.

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.