FRACK and other serious issues (maybe mixmonitor? maybe tasksprocessor?)

ordinatrum · August 23, 2018, 8:45pm

Incidentally, if you don’t want to dive right in to running sqlite3, go to the Asterisk CLI and run “database show” (without the quotes)

amcoit · August 23, 2018, 8:55pm

I just checked the subscriptions and there are 484 hints and on the watchers column they are either 0 or 1, nothing is over that. But the system is pretty much at a standstill now and using a recreated astdb so i dont know if the problem is there or gone away.

dicko · August 23, 2018, 9:52pm

A (hard earned) word of caution to anyone following this thread ,

NEVER ever use the sqlite3 interface while Asterisk is running, the code is thread safe for asterisk but in no way multi-user safe, you will , sooner or later, set up a lock on the database that is VERY hard to resolve. Asterisk has its inbuilt

database query 
Usage: database query "<SQL Statement>"
       Run a user-specified SQL query on the database. Be careful.

so if you want to mess with the database while asterisk is running, wrap it up this way while “being careful” (otherwise you will all learn what VACUUM does and when it can be run safely while asterisk is running (that is NEVER) "

The underlying problem is still open, what is generating those spurious taskprocesses, until it is found and fixed, then . . . .

mitch · August 26, 2018, 6:52am

Hey Dicko - GREAT POINT. I just found out myself (the “hard” way) that this is a big issue… I don’t mean to highjack the thread but maybe some of these leaking ASTDB fields are a problem? We get thousands of these, so I had set up a cronjob like this:
#!/bin/sh
SQLITE3=/usr/bin/sqlite3
$SQLITE3 /var/lib/asterisk/astdb.sqlite3 " DELETE FROM astdb WHERE key like ‘/RG/%’;"
$SQLITE3 /var/lib/asterisk/astdb.sqlite3 " DELETE FROM astdb WHERE key like ‘/BLKVM/%’;"
$SQLITE3 /var/lib/asterisk/astdb.sqlite3 " VACUUM;"

BUT, found out that sometimes, this causes asterisk to lose access to it’s database, and do so on a way that does not even clearly log that it’s an issue. When it happens the “fix” is to restart asterisk.

So given that sqlite3 does need to be maintained, and that asterisk bloats it up with various status entries, is there a recommended way to fix? I guess my delete statements could be moved to “database query” calls… but what about vacuum? I guess we could periodically shut down, vacuum, restart but we’ve never had to do that before - I don’t really want to start.

Any thoughts? If I should restart this as a separate thread please let me know.

#!/bin/sh
ASTERISK=`which asterisk`
$ASTERISK -rx 'database query " DELETE FROM astdb WHERE key like \'/RG/%\';" '
$ASTERISK -rx 'database query " DELETE FROM astdb WHERE key like \'/BLKVM/%\';" '
#DICKO SAYS NO! $ASTERISK -rx 'database query " VACUUM;" '

Thanks,

Mitch

dicko · August 26, 2018, 8:43am

Nio thoughts on this or the Fracking issue, Out of the literally hundreds of machines, large and small, that I have built, I have never experienced that problem… Some of those machines have hundreds of thousands of rows in astdb, its never been a problem.

ordinatrum · August 26, 2018, 6:06pm

Nonetheless dicko, there are at least three users on the FreePBX forums all with the same issue in the last week… I realise this is an Asterisk problem not a FreePBX one - but it seems very real to me.

Incidentally, a clean install on physical (not virtualised) Proliant G8 2x6 core 32GB - and when I say clean, I used the bulk handler to create extensions etc and configured everything else manually, no backup was restored. On the first test call which terminated with a queue, taskprocessors shows:-

subm:ast_bridge_topic_all-cached-0000007d subm:ast_channel_topic_all-00000081 subm:ast_channel_topic_all-00000087 subm:ast_channel_topic_all-cached-0000007e subm:ast_channel_topic_all-cached-0000007f subm:ast_channel_topic_all-cached-00000080 subm:ast_device_state_topic-00000002 subm:ast_device_state_topic-00000004 subm:ast_device_state_topic-00000086 subm:ast_parking-0000005e subm:ast_presence_state_topic_all-00000005 subm:ast_security-00000084 subm:ast_system-00000071 subm:ast_system-00000072 subm:ast_system-0000007c subm:ast_system-00000088 subm:cdr_engine-00000003 subm:cel_aggregation_topic-00000006 subm:endpoint_topic_all-cached-00000008 subm:endpoint_topic_all-cached-00000082 subm:manager_topic-00000007 1 0 1 450 500
6501 0 139 450 500
6500 0 89 450 500
8459 0 235 450 500
8458 0 674 450 500
8457 0 86 450 500
454 0 6 450 500
453 0 18 450 500
256 0 12 450 500
1 0 1 450 500
15 0 9 450 500
1296 0 2 450 500
172 0 4 450 500
171 0 3 450 500
136 0 1 450 500
129 0 1 450 500
8530 0 344 4500 5000
8462 0 5052 2700 3000
2172 0 37 450 500
2055 0 64 450 500
10646 0 4487 2700 3000

Both the subm:ast_channel_topic_all-cached-0000007f and subm:manager_topic-00000007 look poor - as does the subm:cel_aggregation_topic-00000006

I think sooner or later, most users with queues will start reporting issues. My installation is tiny compared to most, there is no way this hardware should be showing warnings that the system is overstressed…

tm1000 · August 26, 2018, 6:29pm

While you have a legitimate issue with task processors on your system I don’t think any of the issued are related to astdb.

dicko · August 26, 2018, 8:13pm

I concur with @tm1000 , yes there are generated rows in astdb, you can set the high water mark higher, or you can try chan_sip to likely alleviate the problem, but I would take these problems to asterisk/soon to be Sangoma

adtopkek · August 27, 2018, 5:47pm

I can cause subm:manager_topic-00000007 to throw an error by simply calling a queue with 13 members out of 20 phones with BLF’s watching one another. It doesn’t take much to overload it.

subm:manager_topic-00000007 does not improve with chan_sip. Converted from chan_sip to PJSIP recently and that did not fix my issue with it.

tm1000 · August 27, 2018, 7:00pm

I fail to see what manager_topic has to do with astdb

Edit: Nothing, these frack issues aren’t related to Astdb

tm1000 · August 27, 2018, 7:05pm

Hello

One of our engineers (@sgriepentrog) has figured out the cause of this issue I believe [FREEPBX-18078] Unnecessary AMI sockets may cause Asterisk taskprocessor backup - Sangoma Issue Tracker

As suspected it is not astdb releated. The key point is “subm:manager_topic-00000007”. Which is the manager queue. See the ticket for more information

From @sgriepentrog

[2:04 PM] Scott Griepentrog: The problem is almost certainly that we have an open astman socket that isn’t being read
[2:04 PM] Scott Griepentrog: that causes the taskprocessor queue backup
[2:05 PM] Scott Griepentrog: This will tell you what process has the queue backed up:
netstat -anp |grep "127.*127.0.0.1:5038"
[2:05 PM] Scott Griepentrog: Then look for a non-zero value in the first column (recv-q)
[2:05 PM] Scott Griepentrog: That process is the culprit for the bug – kill it and everything goes back to normal.
[2:06 PM] Scott Griepentrog: Then figure out how to keep that process from leaving a connection to 5038 (AMI) open.

ordinatrum · August 27, 2018, 10:31pm

Thank you for the feedback Andrew.

I think we were just clutching at straws - and in my case, seeing subscriptions from “ghost” devices I suspected that Asterisk was attempting to sent updates to non-existent subscribers… obviously this wasn’t the case.

I’m still not clear what is causing the queues to build up on my installations - given that everything seems pretty standard.

I’ll monitor that ticket with interest… thanks again

tonyclewis · August 27, 2018, 10:51pm

Can you run the commands Andrew asked for and post them in the ticket.

ordinatrum · August 28, 2018, 10:28am

Sorry Tony, I hadn’t done so since, although my stats are constantly over the highwater level, they aren’t impacting calls at this time. Anyway, the output from netstat is:-

tcp 0 0 127.0.0.1:38078 127.0.0.1:5038 TIME_WAIT -
tcp 0 0 127.0.0.1:39098 127.0.0.1:5038 ESTABLISHED 17008/php
tcp 0 0 127.0.0.1:38070 127.0.0.1:5038 TIME_WAIT -
tcp 0 0 127.0.0.1:39094 127.0.0.1:5038 ESTABLISHED 17008/php
tcp 0 0 127.0.0.1:38062 127.0.0.1:5038 TIME_WAIT -
tcp 0 0 127.0.0.1:38054 127.0.0.1:5038 TIME_WAIT -
tcp 0 0 127.0.0.1:35790 127.0.0.1:5038 ESTABLISHED 6335/voipfirewalld
tcp 0 0 127.0.0.1:38066 127.0.0.1:5038 TIME_WAIT -
tcp 0 0 127.0.0.1:38086 127.0.0.1:5038 TIME_WAIT -
tcp 0 0 127.0.0.1:35624 127.0.0.1:5038 ESTABLISHED 3180/node /var/www/
tcp 0 0 127.0.0.1:37666 127.0.0.1:5038 ESTABLISHED 6208/php
tcp 0 0 127.0.0.1:38074 127.0.0.1:5038 TIME_WAIT -
tcp 0 0 127.0.0.1:38082 127.0.0.1:5038 TIME_WAIT -
tcp 0 0 127.0.0.1:38058 127.0.0.1:5038 TIME_WAIT -

I will run netstat every so often and see if the figure is different. Do any standard modules make use of the AMI interface?

I’m not sure it’s worth posting that output to the ticket is it? Since the recv-q isn’t non-zero?

amcoit · September 11, 2018, 7:12pm

Given that this issue makes our PBX pretty much useless (we have to reboot every 15 minutes) and the bug was assigned as “minor”, we have to find a way to avoid this issue from our side. Can somebody recommend things for us to disable/remove/change to avoid this problem? We have a powerful machine with about 30 calls simulataneously, and it doesnt seem like a lot of people have this issue, so what can we do differently?
all our extensions and a few of the trunks are on PJSip. Would changing everything to chan_sip make a difference?
thank you

tm1000 · September 12, 2018, 12:14am

What bug are you talking about? All tickets in our system start off as “minor”. We do not change priority of tickets at all but it is not something we can remove from our issue tracker so please do not make assumptions that we ‘assigned’ anything as minor. That is the default and we rarely change it.

If you are referencing GitHub - FreePBX/issue-tracker: The unified FreePBX issue tracker. read the last response and you will see that this is not related to your issues, nor is that really a bug. We spent half a day trying to create a POC and were unable to do so.

Again for clarification purposes since you did not go and read the ticket. @sgriepentrog and I went through and proved that FreePBX does not ask for events. It is therefore not getting events and therefore nothing is blocking at that level.

After investigating the AMI library, there appears to be a disconnect somewhere in the AMI protocol. The php-asmanager AMI library does disable events while logging in to the AMI connection using an undocumented Events parameter. Provided the value for the Events: parameter is “off”, the result is that no events should arrive on that channel.

I believe Scott provided a very detailed and through explanation and I encourage you to read it.

I think you need to take a long look back through this thread and read. Several tips have been provided.

Mainly

[2:05 PM] Scott Griepentrog: This will tell you what process has the queue backed up:
netstat -anp |grep "127.*127.0.0.1:5038"

amcoit · September 12, 2018, 11:56am

Hi Andrew,

What bug are you talking about? All tickets in our system start off as “minor”. We do not change priority of tickets at all but it is not something we can remove from our issue tracker so please do not make assumptions that we ‘assigned’ anything as minor. That is the default and we rarely change it.

I don’t think I made any assumptions. Few people have reported this bug, which I mentioned, and the bug was indeed marked as Minor. These two facts seem to make sense together. I am not blaming you or attacking you. I just made an observation. I do not know what the rules are for marking bugs serious or minor, I just mentioned what I saw.

If you are referencing GitHub - FreePBX/issue-tracker: The unified FreePBX issue tracker. read the last response and you will see that this is not related to your issues, nor is that really a bug. We spent half a day trying to create a POC and were unable to do so.

yes, this is the bug I am referring to. You say its not a bug. I don’t know what you would call it, but i´ve already tried the latest sangoma distro on a virtual machine and a real machine and both give me the same problem. If there are over 30 simultaneous calls /calls on hold on the queue, then extensions start getting unregistered, cannot register again, and new calls are rejected. This does not get itself fixed until a reboot. If this isnt a bug, great, it should be fixable with a config change, which is what I was hoping somebody would help me with cause I haven´t been able to solve this.

Again for clarification purposes since you did not go and read the ticket. @sgriepentrog and I went through and proved that FreePBX does not ask for events. It is therefore not getting events and therefore nothing is blocking at that level.

I did read the ticket. Several times. Unfortunately, I must admit, I didn’t understand most of it. The technical explanation didn’t really give me anything actionable to do other than a command to run when the system is down. Unfortunately, since this system was a live system, we had to migrate to our legacy PBXinaFlash version (several years old, without updates applied) were this problem doesn’t exist. We returned to the old system before that ticket was opened and before that command was suggested. I cant reproduce the bug on the bad system unless I have “high” load, and I am currently not allowed to return to the sangoma distro system since it caused so much trouble. I do not know how to simulate high load.

After investigating the AMI library, there appears to be a disconnect somewhere in the AMI protocol. The php-asmanager AMI library does disable events while logging in to the AMI connection using an undocumented Events parameter. Provided the value for the Events: parameter is “off”, the result is that no events should arrive on that channel.

I do not understand this, unfortunately.

I believe Scott provided a very detailed and through explanation and I encourage you to read it.

yes he did provide one and i did read it, but its very technical and I wasnt able to get a solution out of it, which is why I asked for help.

I think you need to take a long look back through this thread and read. Several tips have been provided.

Mainly

[2:05 PM] Scott Griepentrog: This will tell you what process has the queue backed up:
netstat -anp |grep "127.*127.0.0.1:5038"
[/quote]

If I can find a way to simulate a high call volume on the system and replicate the failure I will try this command.
As far as the rest of the system goes, its a sangoma distro. the only none sangoma things installed were asternic and fop2. Both services were stopped when things went crazy. Stopping those services did not avoid the problem from happening. All our extensions are on PJSIP, some trunks are PJSIP, others are chan_sip. We use a few of your commercial modules (webcallback, class of service, queue pro, sys admin pro, maybe others, i dont remember). Ive also tried disabling modules I am sure we dont use but that didnt help either. Other than that, I dont know what else might be relevant.

thank you.

tm1000 · September 12, 2018, 2:43pm

The ticket, at this moment in time, is not a bug. The issue it is reporting is not an issue. It does not happen. It also has nothing to do with your present issue, frack errors. Nor can replicate your frack issues on machines larger than yours. In essence we are unable to get frack errors no matter what we try.

I assumed that the bug report was related to your issue. Until we actually tried to fix it only to discover it already was fixed. This is why the ticket is minor. When you open a bug report the ticket is set to minor. Staff doesn’t change priority of tickets so they most often stay minor. The priority of a ticket is rather irrelevant.

This reminds me of the yealink pjsip issue from two years ago. Sangoma staff have spent time trying to figure this one out. We don’t get frack errors. We never have. We also don’t have task processor errors. But again. We have tried to replicate it to be able to fix something for you. As it stands there’s nothing for us to fix because we cant replicate

amcoit · September 12, 2018, 7:39pm

I saw the frack error twice, I dont know if the system recovers on its own from frack and maybe they were logged but I didnt see them. That I dont know. The system did go into the mode were it rejects all new activity and unregisters all every 15 minutes continuously for several days until we moved to another machine Actually it wasnt every 15 minutes it was whenever the combined load from active calls and calls on hold on the queue reached about 30.
The machine is still live and since we expect lower call volume for a few weeks, we migrated back to this machine. The load has been low and the machine has worked fine all day. If you guys would want to see a log or something, let me know.
thank you.

adtopkek · September 14, 2018, 5:20pm

Is this the errors you got? We just got pages of them when we tried to version switch asterisk to 15. Asterisk would live for a little bit then die upon activity from a certain client on it.

[2018-09-13 22:57:33] ERROR[12493] astobj2.c: FRACK!, Failed assertion user_data is NULL (0)
[2018-09-13 22:57:33] VERBOSE[12493] logger.c: Got 29 backtrace records
[2018-09-13 22:57:33] VERBOSE[12493] logger.c: #0: [0x5fb2c3] asterisk __ast_assert_failed() (0x5fb23f+84)
[2018-09-13 22:57:33] VERBOSE[12493] logger.c: #1: [0x45c174] asterisk ()
[2018-09-13 22:57:33] VERBOSE[12493] logger.c: #2: [0x45c83d] asterisk ()
[2018-09-13 22:57:33] VERBOSE[12493] logger.c: #3: [0x45cc96] asterisk __ao2_ref() (0x45cc68+2E)
[2018-09-13 22:57:33] VERBOSE[12493] logger.c: #4: [0x45df2c] asterisk ao2_iterator_init() (0x45dee2+4A)
[2018-09-13 22:57:33] VERBOSE[12493] logger.c: #5: [0x7f98c6de9882] app_queue.so ()
[2018-09-13 22:57:33] VERBOSE[12493] logger.c: #6: [0x56fe09] asterisk ()
[2018-09-13 22:57:33] VERBOSE[12493] logger.c: #7: [0x57159c] asterisk ()
[2018-09-13 22:57:33] VERBOSE[12493] logger.c: #8: [0x45c9b3] asterisk ()
[2018-09-13 22:57:33] VERBOSE[12493] logger.c: #9: [0x45cc96] asterisk __ao2_ref() (0x45cc68+2E)
[2018-09-13 22:57:33] VERBOSE[12493] logger.c: #10: [0x5717b6] asterisk ()
[2018-09-13 22:57:33] VERBOSE[12493] logger.c: #11: [0x57482c] asterisk ()
[2018-09-13 22:57:33] VERBOSE[12493] logger.c: #12: [0x575871] asterisk ast_context_remove_extension_callerid2() (0x575118+759)
[2018-09-13 22:57:33] VERBOSE[12493] logger.c: #13: [0x57d5bd] asterisk __ast_context_destroy() (0x57cfa1+61C)
[2018-09-13 22:57:33] VERBOSE[12493] logger.c: #14: [0x57da4e] asterisk ast_context_destroy() (0x57da1b+33)
[2018-09-13 22:57:33] VERBOSE[12493] logger.c: #15: [0x7f997e8851f9] pbx_config.so ()
[2018-09-13 22:57:33] VERBOSE[12493] logger.c: #16: [0x533abb] asterisk ()
[2018-09-13 22:57:33] VERBOSE[12493] logger.c: #17: [0x456305] asterisk ()
[2018-09-13 22:57:33] VERBOSE[12493] logger.c: #18: [0x455ef1] asterisk ()
[2018-09-13 22:57:33] VERBOSE[12493] logger.c: #19: [0x457074] asterisk ()
[2018-09-13 22:57:33] VERBOSE[12493] logger.c: #20: [0x4d66c5] asterisk ast_cli_command_full() (0x4d6454+271)
[2018-09-13 22:57:33] VERBOSE[12493] logger.c: #21: [0x4d6808] asterisk ast_cli_command_multiple_full() (0x4d676c+9C)
[2018-09-13 22:57:33] VERBOSE[12493] logger.c: #22: [0x454c6a] asterisk ()
[2018-09-13 22:57:33] VERBOSE[12493] logger.c: #23: [0x5f8276] asterisk ()
[2018-09-13 22:57:33] ERROR[12493] astobj2.c: FRACK!, Failed assertion user_data is NULL (0)
[2018-09-13 22:57:33] VERBOSE[12493] logger.c: Got 29 backtrace records
[2018-09-13 22:57:33] VERBOSE[12493] logger.c: #0: [0x5fb2c3] asterisk __ast_assert_failed() (0x5fb23f+84)
[2018-09-13 22:57:33] VERBOSE[12493] logger.c: #1: [0x45c174] asterisk ()
[2018-09-13 22:57:33] VERBOSE[12493] logger.c: #2: [0x45c206] asterisk ()
[2018-09-13 22:57:33] VERBOSE[12493] logger.c: #3: [0x45e11c] asterisk ()
[2018-09-13 22:57:33] VERBOSE[12493] logger.c: #4: [0x45e3f9] asterisk __ao2_iterator_next() (0x45e3cc+2D)

The server is still “running” but was abandoned as we moved these clients to another server.

full-20180914