Asterisk 16.30.0 OOM killing Asterisk processes every 2 weeks breaking phones

Hi All,

We upgraded to the following version in hopes it would resolve the issue but the problem persists
Asterisk 20.5.2 built by mockbuild @ jenkins7 on a x86_64 running Linux on 2024-01-02
13:17:54 UTC

It’s gotten to the point I had to create a cronjob to periodically reboot the server to keep our phones working during business hours. Can anyone give a hand with troubleshooting?

what process trigers the oom?
how much memory do you have ?

What unusual usage do you have?

What activities cause the problem processes virtual memory size to increase, an, if in discrete amounts, by how much each time?

Asterisk is the process consuming all the RAM / triggering the OOM. We have 4 GB of RAM(hosted on AWS). We use to run an old version of FreePBX(can’t recall which version) and it worked very reliably(we had uptimes longer than 1 year). The only reason for the upgrade was to get security patches… Any pointers on identifying if this is a memory leak or if something in the code base changed and it now requires significantly more RAM would be much appreciated!

The screenshot below is after 5 hrs of uptime(cronjob rebooted server at 3am), and likely no phone calls have been made yet as it’s before business hours.

Sorry I don’t understand the question… The only change I’m aware of is upgrading to a supported version of FreePBX. In regards to what activities cause the problem, I don’t know how we could determine that?

We’re a small business with about 20 people with phones(most rarely ever use the phone), and I’d guesstimate the maximum is about 5 simultaneous phone calls and that’s infrequent. I’m not sure what other factors would play into how much RAM Asterisk consumes… I just know from looking at htop /usr/sbin/asterisk is the process that is consuming all the RAM.

This is the standard question when something stops working, but was working before.

Unusual behaviour under such circumstances normally indicates a security problem; large numbers of toll fraud attempts are getting through your firewall.

image

When debugging memory leaks, what is important is the time progression of the memory usage.

Incidentally, I note that your real memory use is only 9.4% and you have essentially zero swap, so I think most of the virtual memory here represents memory mapped files, which may not be an issue, unless they are log files, in which case, one probably comes back to a security failure causing those to grow, rapidly, or a failure to rotate them properly.

The htop output shows less than 30% of memory is in use. The blue and yellow bars can be sacrificed if memory becomes tight, although might cause performance problems. Memory mapped files will be represented by the yellow bar, which also includes pure executable code

From what version of FreePBX to what version? Asterisk was not touched? (I should also add that 16 is end of life and no longer receives security fixes)

I’ll also go into memory usage a bit. Asterisk is a huge application with lots of functionality. What parts of Asterisk are used determines the code paths, and the memory. There shouldn’t be a leak of course but determining the code path and how to reproduce it can be helpful in identifying it.

1 Like

Please keep in mind I’m using a really big hammer approach of rebooting via cron every night, so there isn’t enough time for the RAM to grow, as we know it will eventually result in the OOM killing Asterisk and our business phones will go offline as a result if I leave it running.

The basic principles here are that, if normal use resulted in memory leaks, there would be lots of reports and it would already be fixed. I think you are using the last non-security related release of 16, which means that, although it wouldn’t have been fixed in 16, if it was new in 16.30, the number of people using it beyond end of life would be enough that it would have produced multiple reports, if it occurred in normal use.

Therefore, the assumption has to be that there is something unusual about your system and it will need a lot of information to track down the cause.

I apologize, after reviewing the history of events my initial description is slightly incorrect.

  • Earlier this year we noticed the OOM killer was killing Asterisk processes, and we didn’t know what changed. We assumed a module update… Prior to this our phone server had over a year of uptime with no issues.
  • We upgraded from version 16.30 → 20.5.2 to get security patches, and in hopes it might also fix the memory problems. It didn’t…

How can we “size” the RAM, maybe we just don’t have enough RAM?

Datapoint below FWIW.

OOM killer kernel messages:

Mar 27 01:42:39 ip-172-31-2-102 kernel: httpd invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Mar 27 01:42:40 ip-172-31-2-102 kernel: httpd cpuset=/ mems_allowed=0
Mar 27 01:42:40 ip-172-31-2-102 kernel: CPU: 0 PID: 1626 Comm: httpd Kdump: loaded Tainted: G           OE  ------------   3.10.0-1127.19.1.el7.x86_64 #1
Mar 27 01:42:40 ip-172-31-2-102 kernel: Hardware name: Amazon EC2 t3.medium/, BIOS 1.0 10/16/2017
Mar 27 01:42:40 ip-172-31-2-102 kernel: Call Trace:

Mar 27 01:42:41 ip-172-31-2-102 kernel: Out of memory: Kill process 26026 (asterisk) score 776 or sacrifice child
Mar 27 01:42:41 ip-172-31-2-102 kernel: Killed process 26026 (asterisk), UID 995, total-vm:7802388kB, anon-rss:2295836kB, file-rss:0kB, shmem-rss:0kB

RAM consumption graph from Cloudwatch(AWS):

I can’t speak for how to size RAM, but memory usage shouldn’t just keep going up and up until it gets killed - unless you really are somehow pushing things to its extreme but even then I haven’t seen that in practice.

Identifying the specific usage of FreePBX that causes the memory to increase like that to understand the cause would be needed.

Thanks for the help! How can I identify the specific usage of FreePBX that is causing memory to increase?

One option is to reproduce the same setup, confirm that the memory usage increases on it, and then eliminating usage of functionality in FreePBX to determine the source of it.

Right now it’s a bit of looking for a needle in a haystack. Even if I were to set up a FreePBX, I probably wouldn’t have the same issue since I wouldn’t set it up the same. The same goes for Asterisk itself without FreePBX.

Thanks, that’s what I was worried was the case… If you were in my shoes what would you do? Do we take a really big hammer and do a clean install, or just use the not so elegant approach of just leave the cron job running that reboots the server every night… Or maybe something else?

I’d like to determine if it is a memory leak and bug in Asterisk, but I can’t do that without details and further isolation. It’s up to you as to what is acceptable for you long term - be it a cronjob nightly, or isolating so information can be provided.

Is anyone else reading this having the same issue?

1 Like

It would seem that your httpd process caused the OOM, as a diagnostic, the webserver is not needed for FreePBX to be ‘ticking along’ ,so you could try stopping it and see if that returns the memory or at least stops the leaking

1 Like

I think what is happening is Asterisk memory consumption grows and grows until there is no RAM left, then whatever the next process is to malloc RAM will trigger the OOM to kill a process. In this case httpd was the last process to malloc memory when none was left.

A couple of screen captures of htop sorted by memory consumptions shows Asterisk as consuming the most RAM, and Asterisk’s RAM consumption is growing:

Baseline:

12 hrs later:

I’ll shutdown httpd as I’m grasping at straws at this point, but don’t have a lot of hope given the data above.

I’m happy to do any testing / troubleshooting / data collection if you could give me pointers. I think the recommendation is clone the VM(create a sandbox for troubleshooting), then verify the sandbox memory consumption is growing(same problem exists). Then start experiment with disabling features to isolate where the leak is coming from. What features in your experience are most likely to consume the most RAM, I’m thinking I should start with disabling modules first?

I can’t speak for FreePBX, that is what directs Asterisk in what it is doing.

After standing up a large number of boxes, I have had similar results with 14, 15, 16. I might spin up 5 or 10 at a time, identical distro installs, yet one out of the bunch would just continue to grow memory until Asterisk failed and a reboot was required. The number of calls didn’t matter as it could grow overnight when there were no calls and 16GB of RAM did nothing. Tried to poke at it a few times, never really got to the root of it.

We discarded the boxes that showed the behavior and spun up a new VM. For a while we did what you did with full reboots, but then changed to clearing the ram, which seemed to do the job without the need for a full reboot. Clear RAM Memory Cache, Buffer, and Swap Space on Linux | Hostbillo

1 Like