FRACK and other serious issues (maybe mixmonitor? maybe tasksprocessor?)

amcoit · August 17, 2018, 8:23pm

We’ve had a production level freepbx distro (all modules up to date and also all yum updates applied) running for about 1 month, no issues. Last week things went crazy.
Now several times per day extensions unregister and can no longer register. No incoming calls work either. The only solution Ive found is either a full system restart or a fwconsole restart.

Things that changed last week.
I activated the “Queues Pro” commercial module, but I have disabled and removed it from the system since. Removing it made no difference.
I played with asternic’s convert to MP3 settings.
I might have updated modules last week. I don’t remember.

So now, a few things I see (though I can’t be sure they weren’t there before)… I get the warnings about task processor being over 3000 all the time. I’ve checked and sometimes tasks have the queue column up to the hundreds of thousands. The system is a Xen Server virtual machine, with 4 cores and 16GB, but only about 100 extensions and Ive never seen more than 40 simultaneous calls.

Things seem to work normally and then the unregistration thing happens. I thought it was the queued up tasks, but Ive now seen the unregistration happen with queues at zero.

What I discovered now is that it seems mixmonitor starts recording and never stops. It also creates dozens of tiny 44byte files for each call.

My latest test is stopping all recordings and see if there are any problems, but I just made that change so nothing to report yet.

While in the CLI Ive also seen this FRACK and some backtracing, though this wasnt folliowed by the unregistration so no clue if this is related.

Any tips on what I should try to look for in order to debug this issue?
thank you!
-amco

PS.- problem just happened even though the call recordings are off, so maybe those 44 byte files are unrelated

some of the tasks that give warnings are:

subm:ast_channel_topic_all-cached-0000057b
subm:ast_channel_topic_all-00000583
subp:ast_channel_topic_all-00000895
subm:cel_aggregation_topic-00000006
this following one happens most often
subm:manager_topic-00000007 5818352 1650025 1663207 2700 3000

here is the frack i caught

[2018-08-17 14:59:11] ERROR[32159][C-00000035]: pbx_functions.c:701 ast_func_write: Function AUDIOHOOK_INHERIT not registered
[2018-08-17 14:59:12] ERROR[32207][C-00000011]: astobj2.c:518 __ao2_ref: Excessive refcount 100000 reached on ao2 object 0x2a189d0
[2018-08-17 14:59:12] ERROR[32207][C-00000011]: astobj2.c:518 __ao2_ref: FRACK!, Failed assertion Excessive refcount 100000 reached on ao2 object 0x2a189d0 (0)
Got 21 backtrace records
#0: [0x5fcc55] asterisk __ast_assert_failed() (0x5fcbd1+84)
#1: [0x45d4bd] asterisk ()
#2: [0x45d816] asterisk __ao2_ref() (0x45d7e8+2E)
#3: [0x5cf8e6] asterisk ()
#4: [0x5cfa69] asterisk ()
#5: [0x5c7786] asterisk ()
#6: [0x5c8338] asterisk ()
#7: [0x5c8620] asterisk ()
#8: [0x5c868f] asterisk stasis_publish() (0x5c8667+28)
#9: [0x5d4569] asterisk ast_channel_publish_snapshot() (0x5d44b7+B2)
#10: [0x582e74] asterisk pbx_exec() (0x582dda+9A)
#11: [0x570787] asterisk ()
#12: [0x573c6a] asterisk ast_spawn_extension() (0x573c06+64)
#13: [0x574945] asterisk ()
#14: [0x576115] asterisk ()
#15: [0x5f9c08] asterisk ()
[2018-08-17 14:59:13] ERROR[32223][C-00000031]: pbx_functions.c:701 ast_func_write: Function AUDIOHOOK_INHERIT not registered
[2018-08-17 14:59:15] ERROR[32336][C-0000002d]: pbx_functions.c:701 ast_func_write: Function AUDIOHOOK_INHERIT not registered
[2018-08-17 14:59:16] ERROR[32326][C-0000003a]: astobj2.c:518 __ao2_ref: Excessive refcount 100000 reached on ao2 object 0x2a18d20
[2018-08-17 14:59:16] ERROR[32326][C-0000003a]: astobj2.c:518 __ao2_ref: FRACK!, Failed assertion Excessive refcount 100000 reached on ao2 object 0x2a18d20 (0)
Got 23 backtrace records
#0: [0x5fcc55] asterisk __ast_assert_failed() (0x5fcbd1+84)
#1: [0x45d4bd] asterisk ()
#2: [0x45d816] asterisk __ao2_ref() (0x45d7e8+2E)
#3: [0x5d7ca2] asterisk stasis_message_create_full() (0x5d7c2e+74)
#4: [0x5d7d2a] asterisk stasis_message_create() (0x5d7d02+28)
#5: [0x5d3d4c] asterisk ()
#6: [0x5d3dab] asterisk ast_channel_blob_create_from_cache() (0x5d3d62+49)
#7: [0x5d45bf] asterisk ast_channel_publish_cached_blob() (0x5d457c+43)
#8: [0x5d4716] asterisk ast_channel_publish_varset() (0x5d466c+AA)
#9: [0x58ca5e] asterisk pbx_builtin_setvar_helper() (0x58c6fc+362)
#10: [0x7fd6f5b5912e] app_macro.so ()
#11: [0x7fd6f5b59552] app_macro.so ()
#12: [0x582ef6] asterisk pbx_exec() (0x582dda+11C)
#13: [0x570787] asterisk ()
#14: [0x573c6a] asterisk ast_spawn_extension() (0x573c06+64)
#15: [0x574945] asterisk ()
#16: [0x576115] asterisk ()
#17: [0x5f9c08] asterisk ()
[2018-08-17 14:59:16] ERROR[32338][C-0000003b]: astobj2.c:518 __ao2_ref: Excessive refcount 100000 reached on ao2 object 0x2a18d20
[2018-08-17 14:59:16] ERROR[32338][C-0000003b]: astobj2.c:518 __ao2_ref: FRACK!, Failed assertion Excessive refcount 100000 reached on ao2 object 0x2a18d20 (0)
Got 23 backtrace records
#0: [0x5fcc55] asterisk __ast_assert_failed() (0x5fcbd1+84)
#1: [0x45d4bd] asterisk ()
#2: [0x45d816] asterisk __ao2_ref() (0x45d7e8+2E)
#3: [0x5d7ca2] asterisk stasis_message_create_full() (0x5d7c2e+74)
#4: [0x5d7d2a] asterisk stasis_message_create() (0x5d7d02+28)
#5: [0x5d3d4c] asterisk ()
#6: [0x5d3dab] asterisk ast_channel_blob_create_from_cache() (0x5d3d62+49)
#7: [0x5d45bf] asterisk ast_channel_publish_cached_blob() (0x5d457c+43)
#8: [0x5d4716] asterisk ast_channel_publish_varset() (0x5d466c+AA)
#9: [0x58ca3e] asterisk pbx_builtin_setvar_helper() (0x58c6fc+342)
#10: [0x7fd6f5b58220] app_macro.so ()
#11: [0x7fd6f5b59552] app_macro.so ()
#12: [0x582ef6] asterisk pbx_exec() (0x582dda+11C)
#13: [0x570787] asterisk ()
#14: [0x573c6a] asterisk ast_spawn_extension() (0x573c06+64)
#15: [0x574945] asterisk ()
#16: [0x576115] asterisk ()
#17: [0x5f9c08] asterisk ()
[2018-08-17 14:59:16] ERROR[32370][C-0000003b]: pbx_functions.c:701 ast_func_write: Function AUDIOHOOK_INHERIT not registered

amcoit · August 17, 2018, 8:41pm

update: I am not sure if the “calls processed” number that shows at the bottom of “core show channels” is literally number of phone calls, but its increasing at a rate of dozens per second even though our call volume is just a few hundred per day.

adtopkek · August 17, 2018, 8:47pm

We have been fighting this for a while. I haven’t reported this yet, lack of time, but for us we have a customer with 13 phones in a queue. Those phones are part of a company with 20 phones with BLF’s all watching one another. Whenever they get an inbound call that rings their queue it throws:

[2018-07-27 23:17:20] WARNING[28680][C-00000000]: taskprocessor.c:895 taskprocessor_push: The ‘subm:manager_topic-00000007’ task processor queue reached 3000 scheduled tasks.

Without fail. 13 phones in a queue plus all those BLF updates seem to overwhelm the ‘subm:manager_topic-00000007’. Do you have a similar setup when this happens?

Asterisk mentioned that you can do:

If the task processor name happens to be of the ‘subm:<topic>’ form where the topic is your stasis application name then that task processor is servicing the stasis message bus communication with your application. One thing you can try is to register multiple copies of your stasis application and randomly spread calls to the different copies of the application. e.g., my_stasis_app1, my_stasis_app2,…

But I don’t know if its a configuration option or a Freepbx change.

adtopkek · August 17, 2018, 8:49pm

I think every phone that rings increments it by 1 so a queue with 13 members would add 13 to it. Not 100% sure but it seems like that.

tm1000 · August 17, 2018, 9:00pm

Freepbx doesn’t have any stasis applications besides Zulu. So it’s not freepbx.

amcoit · August 17, 2018, 9:03pm

I wrote about this cause it looks weird, but its one of several things going on. I have had this situation where phone unregister and cannot register again with 0 on the queue for the taskprocessor, so I can’t say this is specifically the problem.
We have asternic and fop2 installed, maybe they are using the BLF on our side. We use mostly softphones (Zoiper).

amcoit · August 17, 2018, 9:06pm

Is there anything you can recommend we should look into? the system worked perfectly for over a month with the same loads. Things started going crazy last week on friday with one of these “crashes” during the day and it has gotten continuously worse, with several per day now.
thank you,
amco

amcoit · August 17, 2018, 10:19pm

now I am getting thousands of this message

Spawn extension (from-queue-exten-internal, 7229, 10) exited non-zero on ‘Local/7229@from-queue-00004f8b;2’
== Spawn extension (macro-hangupcall, s, 6) exited non-zero on ‘Local/7239@from-queue-00004f8a;2’ in macro ‘hangupcall’
== Spawn extension (from-queue-exten-internal, h, 1) exited non-zero on ‘Local/7239@from-queue-00004f8a;2’
== Spawn extension (macro-hangupcall, s, 6) exited non-zero on ‘Local/7229@from-queue-00004f8b;2’ in macro ‘hangupcall’
== Spawn extension (from-queue-exten-internal, h, 1) exited non-zero on ‘Local/7229@from-queue-00004f8b;2’
== Spawn extension (from-queue-exten-internal, 7239, 10) exited non-zero on ‘Local/7239@from-queue-00004f8c;2’
== Spawn extension (macro-hangupcall, s, 6) exited non-zero on ‘Local/7239@from-queue-00004f8c;2’ in macro ‘hangupcall’
== Spawn extension (from-queue-exten-internal, h, 1) exited non-zero on ‘Local/7239@from-queue-00004f8c;2’
== Spawn extension (from-queue-exten-internal, 7204, 10) exited non-zero on ‘Local/7204@from-queue-00004f8d;2’
== Spawn extension (macro-hangupcall, s, 6) exited non-zero on ‘Local/7204@from-queue-00004f8d;2’ in macro ‘hangupcall’
== Spawn extension (from-queue-exten-internal, h, 1) exited non-zero on ‘Local/7204@from-queue-00004f8d;2’
== Spawn extension (from-queue-exten-internal, 7238, 10) exited non-zero on ‘Local/7238@from-queue-00004f8e;2’
== Spawn extension (macro-hangupcall, s, 6) exited non-zero on ‘Local/7238@from-queue-00004f8e;2’ in macro ‘hangupcall’
== Spawn extension (from-queue-exten-internal, h, 1) exited non-zero on ‘Local/7238@from-queue-00004f8e;2’
== Spawn extension (from-queue-exten-internal, 7133, 10) exited non-zero on ‘Local/7133@from-queue-00004f8f;2’
== Spawn extension (macro-hangupcall, s, 6) exited non-zero on ‘Local/7133@from-queue-00004f8f;2’ in macro ‘hangupcall’
== Spawn extension (from-queue-exten-internal, h, 1) exited non-zero on ‘Local/7133@from-queue-00004f8f;2’
== Spawn extension (from-queue-exten-internal, 7204, 10) exited non-zero on ‘Local/7204@from-queue-00004f90;2’
== Spawn extension (from-queue-exten-internal, 7236, 10) exited non-zero on ‘Local/7236@from-queue-00004f91;2’
== Spawn extension (macro-hangupcall, s, 6) exited non-zero on ‘Local/7204@from-queue-00004f90;2’ in macro ‘hangupcall’
== Spawn extension (from-queue-exten-internal, h, 1) exited non-zero on ‘Local/7204@from-queue-00004f90;2’
== Spawn extension (from-queue-exten-internal, 7238, 10) exited non-zero on ‘Local/7238@from-queue-00004f92;2’
== Spawn extension (macro-hangupcall, s, 6) exited non-zero on ‘Local/7236@from-queue-00004f91;2’ in macro ‘hangupcall’
== Spawn extension (from-queue-exten-internal, h, 1) exited non-zero on ‘Local/7236@from-queue-00004f91;2’
== Spawn extension (from-queue-exten-internal, 7256, 10) exited non-zero on ‘Local/7256@from-queue-00004f93;2’
== Spawn extension (macro-hangupcall, s, 6) exited non-zero on ‘Local/7238@from-queue-00004f92;2’ in macro ‘hangupcall’
== Spawn extension (from-queue-exten-internal, h, 1) exited non-zero on ‘Local/7238@from-queue-00004f92;2’
== Spawn extension (from-queue-exten-internal, 7133, 10) exited non-zero on 'Local/7133@from-queue-000
indent preformatted text by 4 spaces

tonyclewis · August 17, 2018, 10:34pm

Stop FOP and other 3rd party add ok and see if that fixes your problem.

ordinatrum · August 19, 2018, 5:39pm

Me too! When this problem first occurred it caused new calls to return unavailable and no logfiles were updated when calls were received. Outgoing calls failed too (with no log messages).

I rebuilt the system from scratch and although I still get the warnings the system hasn’t failed since. I have just installed FreePBX on a ProLiant 12-core 32GB RAM that Is laying around unused so will see if the extra horsepower helps.

amcoit · August 19, 2018, 6:53pm

Jon
Do you happen to remember if you did any major changes before this started happening? Do you also use Asternic or fop2? My system ran fine for about a month before these problems started and though I did a few changes the days before, the problems happened 2 days after the latest change. I have now reverted all changes from the week prior to the problem starting and the problem continues.

amcoit · August 19, 2018, 6:54pm

Tony Lewis, we stopped the services but it failed again anyway. Any other things you can suggest we should try? Or what should we document when the problem arises so maybe the info can help debug it?

ordinatrum · August 21, 2018, 1:02pm

No Astemic or FOP2

It’s difficult to say what prompted the issue… what happened was…

My system is virtual, running on a VMware ESX installation. I had been gradually adding more “features” to the installation and, mid-week, decided to make a “bare-metal” backup. To do this, I shut the FreePBX guest down from the FreePBX system admin power options menu, then copied the VM folder on the host creating a duplicate copy. I then powered the original back on again, it started without an issue but the next day began to act very strangely. After a few queues had come into the queue (sequentially, we rarely get more than two simultaneous calls in the queue), new device registrations would fail, incoming calls didn’t ring and the FreePBX logs showed no activity.

I started to “unpick” my activities the previous week but was unable to correct the problem. I finally created a new VM with identical architecture, re-installed FreePBX, activated it, restored the config from the failing system, copied moh, voicemail and some other bits over… since then I have lots of taskprocessor warnings but no failures (although I am nervous).

So - a brief description, see what we have in common. Several queues. Each queue has another queue set not to answer as it’s destination. The second queue in line only has static agents. Two of these queues have 3 statics, the other two of these queues have about 12 statics. We are using PJSIP for device connections and CHANSIP for trunk connections. We use one trunk for all outgoing calls since out provider supports CLIP masquerading. There are another five trunks used for incoming calls.

We have multiple MOH groups, one for each queue. We have chosen to edit the MOH with Audacity to include “press * if you don’t want to wait” type announcements rather than letting the queue do the announcements. The MOH wavs are 8Khz and were uploaded to the moh folder with WinSCP and then the Settings\Music-on-hold module performed format changes creating alaw, sln, sln16, sln48 and g722 versions.

Each of the “front” queues (that receive inbound routes) have an IVR attached which allows the caller to leave the queue and go to voicemail.

All inbound and outbound routes have call recording set to Force and call recording format is wav (lowercase). The call recording location is as per default.

I really don’t know what to say - it would be a help if anyone could say WHAT these taskprocessor metrics are monitoring. If I restart Asterisk and make a test call, as soon as the handsets start ringing, the high water limit is hit and normally exceeded by 200-300%.

Oh, one other thing - the PBX isn’t local to the site - all the handsets (about 18) are registered through the responsive firewall. I have been wondering if these taskprocessor queues have something to do with BLF subscribers at the end of the remote links.

As I said, I have build FreePBX on a test server (a ProLiant G8, 32GB, 16C) which would normally be good for 20,000+ devices. I’ll try and switch over one day this week with the VM on hot-standby to see if the figures look better - or whether this is a network bottleneck etc.

ordinatrum · August 21, 2018, 1:09pm

Oh - something else that I wouldn’t class as “unusual” but maybe it is - neither PJSIP or CHANSIP are set to the default ports - we use UDP ports in the 3000-4000 range for SIP but keep to 10000-20000 for RTP.

Security by obscurity I know it’s hardly foolproof but it keeps some script kiddies at bay

ordinatrum · August 23, 2018, 2:49pm

I’m not sure if I’ve found something that may be causing the problem - but given that I suspect the large taskprocessor queues are due to notifying extensions that the call is ringing… Four of my extensions are showing over 100 watchers for their hint… I only have six handsets which support BLF subscriptions - the rest are all DECT handsets with no BLF functionality! Clearly the persistent information in astdb for hint/subscription watchers is in error… I will dig more tonight.

amcoit · August 23, 2018, 4:49pm

thats above my current knowledge. I will research a bit and then check my own install.
thank you.

amcoit · August 23, 2018, 7:59pm

Jon, we just found my system had hundreds of Conference Rooms that had been automatically created (not by us). Might be from an old bug, might be something else. Can you check your conference room menu and see if you have them too please?

ordinatrum · August 23, 2018, 8:32pm

I just had a look, no conference rooms on the system at all mate.

So from what I said earlier, I went to Reports, Info, Subscriptions - the “watchers” column was indicating an impossibly high number of watchers on several hints.

I deleted the persistent subscriptions by:-

Stop Asterisk (core stop gracefully)
SSH onto the server
cd /var/lib/asterisk
sqlite3 astdb.sqlite3

At the sqlite3 > prompt

delete from astdb where key like ‘/subscription_persistence/%’;
.quit

Restart Asterisk (fwconsole start)

The watchers count for the subscriptions is correct now. I must admit that with a few test calls, I’m not convinced it’s reduced the taskprocessors count but will see what happens tomorrow.

Regards, Jon

ordinatrum · August 23, 2018, 8:39pm

OK, you peaked my interest and I had a closer look at the astdb. Given there are no calls in progress, these are a surprise…

/RG/302/Local/302@from-queue-0000000c;2 : RINGING
/RG/302/Local/302@from-queue-0000001c;2 : RINGING
/RG/302/Local/302@from-queue-00000032;2 : RINGING
/RG/302/Local/302@from-queue-00000042;2 : RINGING
/RG/302/Local/302@from-queue-00000056;2 : RINGING
/RG/302/Local/302@from-queue-00000067;2 : RINGING
/RG/302/Local/302@from-queue-0000006b;2 : RINGING
/RG/302/Local/302@from-queue-0000006d;2 : RINGING
/RG/302/Local/302@from-queue-00000076;2 : RINGING
/RG/302/Local/302@from-queue-0000007c;2 : RINGING
/RG/302/Local/302@from-queue-00000088;2 : RINGING
/RG/302/Local/302@from-queue-0000008b;2 : RINGING
/RG/302/Local/302@from-queue-00000090;2 : RINGING
/RG/302/Local/302@from-queue-0000009f;2 : RINGING
/RG/302/Local/302@from-queue-000000a0;2 : RINGING
/RG/302/Local/302@from-queue-000000c0;2 : RINGING
/RG/302/Local/302@from-queue-000000d6;2 : RINGING
/RG/302/Local/302@from-queue-000000ea;2 : RINGING
/RG/302/Local/302@from-queue-00000111;2 : RINGING
/RG/302/Local/302@from-queue-00000117;2 : RINGING
/RG/302/Local/302@from-queue-0000011d;2 : RINGING
/RG/302/Local/302@from-queue-00000123;2 : RINGING
/RG/302/Local/302@from-queue-00000132;2 : RINGING
/RG/302/Local/302@from-queue-000001f0;2 : RINGING
/RG/302/Local/302@from-queue-00000200;2 : RINGING
/RG/302/Local/302@from-queue-00000208;2 : RINGING
/RG/302/Local/302@from-queue-0000021c;2 : RINGING
/RG/302/Local/302@from-queue-00000248;2 : RINGING
/RG/302/Local/302@from-queue-0000026c;2 : RINGING
/RG/302/Local/302@from-queue-0000031c;2 : RINGING
/RG/302/Local/302@from-queue-00000350;2 : RINGING
/RG/302/Local/302@from-queue-00000358;2 : RINGING
/RG/302/Local/302@from-queue-0000053c;2 : RINGING
/RG/302/Local/302@from-queue-0000054c;2 : RINGING
/RG/302/Local/302@from-queue-0000060e;2 : RINGING
/RG/302/Local/302@from-queue-0000062e;2 : RINGING
/RG/302/Local/302@from-queue-00000662;2 : RINGING
/RG/302/Local/302@from-queue-00000692;2 : RINGING
/RG/302/Local/302@from-queue-0000069e;2 : RINGING
/RG/302/Local/302@from-queue-000006de;2 : RINGING
/RG/302/Local/302@from-queue-0000071e;2 : RINGING
/RG/302/PJSIP/201-0000015d : RINGING
/RG/401/Local/401@from-queue-0000003f;2 : RINGING
/RG/401/Local/401@from-queue-00000045;2 : RINGING
/RG/401/Local/401@from-queue-000001d1;2 : RINGING
/RG/401/Local/401@from-queue-000001df;2 : RINGING
/RG/401/Local/401@from-queue-000001e5;2 : RINGING
/RG/401/Local/401@from-queue-000001eb;2 : RINGING
/RG/401/Local/401@from-queue-000001f1;2 : RINGING
/RG/401/Local/401@from-queue-000001f7;2 : RINGING
/RG/401/Local/401@from-queue-00000201;2 : RINGING
/RG/401/Local/401@from-queue-00000207;2 : RINGING
/RG/401/Local/401@from-queue-00000211;2 : RINGING
/RG/401/Local/401@from-queue-00000217;2 : RINGING
/RG/401/Local/401@from-queue-0000021d;2 : RINGING
/RG/401/Local/401@from-queue-00000223;2 : RINGING
/RG/401/Local/401@from-queue-0000022f;2 : RINGING
/RG/401/Local/401@from-queue-00000235;2 : RINGING
/RG/401/Local/401@from-queue-0000023b;2 : RINGING
/RG/401/PJSIP/401-000001ba : RINGING
/RG/401/PJSIP/401-00000339 : RINGING
/RG/401/PJSIP/401-00000348 : RINGING
/RINGGROUP/280/changecid : default
/RINGGROUP/280/fixedcid :
/RINGGROUP/285/changecid : default
/RINGGROUP/285/fixedcid :
/RINGGROUP/380/changecid : default
/RINGGROUP/380/fixedcid

ordinatrum · August 23, 2018, 8:45pm

So, deleted these entries with the above process but the following sqllite commands

delete from astdb where key like ‘/RG/%’;
delete from astdb where key like ‘/RINGGROUP/%’;
.quit

astDB looks much cleaner now - but not sure the taskprocessor overcommitment has gone away…

FRACK and other serious issues (maybe mixmonitor? maybe tasksprocessor?)

PS.- problem just happened even though the call recordings are off, so maybe those 44 byte files are unrelated

subm:ast_channel_topic_all-cached-0000057b subm:ast_channel_topic_all-00000583 subp:ast_channel_topic_all-00000895 subm:cel_aggregation_topic-00000006 this following one happens most often subm:manager_topic-00000007 5818352 1650025 1663207 2700 3000

subm:ast_channel_topic_all-cached-0000057b
subm:ast_channel_topic_all-00000583
subp:ast_channel_topic_all-00000895
subm:cel_aggregation_topic-00000006
this following one happens most often
subm:manager_topic-00000007 5818352 1650025 1663207 2700 3000