50 calls per seconds crash astdb with dialparties.agi

hi,
i has installed freepbx (with elastix-1.5.2-stable) on a HP server with xeon-4-core 2GHz.
first everything is ok, but i want to test the performance of asterisk with sipp.
sipp (http://sipp.sourceforge.net) is a automatic sip traffic generator.
i want to simulate the real world so add 200 extensions by “PBX > Extensions Batch”:

—start-extbatch.txt—
“Display Name”,“User Extension”,“Direct DID”,“Outbound CID”,“Call Waiting”,“Secret”,“Voicemail Status”,“Voicemail Password”,“VM Email Address”,“VM Pager Email Address”,“VM Options”,“VM Email Attachment”,“VM Play CID”,“VM Play Envelope”,“VM Delete Vmail”,“Context”
100,100,“”,“”,“ENABLED”,“”,“disable”,“”,“”,“”,“”,“no”,“no”,“no”,“no”,“from-internal”
101,101,“”,“”,“ENABLED”,“”,“disable”,“”,“”,“”,“”,“no”,“no”,“no”,“no”,“from-internal”
…snip…
299,299,“”,“”,“ENABLED”,“”,“disable”,“”,“”,“”,“”,“no”,“no”,“no”,“no”,“from-internal”
—endof-extbatch.txt—

now i have 200 extensions for test, the extension is: 100–299.
i do not want to write auth processing in my sipp scenario file.
so the “secret” field in batch file is null.

when i use sipp i must set static “host” and “port” for every peer.
so i execute a mysql command to set “host” and “port” of these extensions
i also disable “qualify” without sending OPTIONS to sipp for simpleness.

$ mysql -u root -peLaStIx.2oo7

use asterisk;
update sip set data=‘20.0.6.103’ where keyword=‘host’ and id>=100 and id<200;
update sip set data=‘20.0.6.103’ where keyword=‘host’ and id>=200 and id<300;
update sip set data=‘5071’ where keyword=‘port’ and id>=100 and id<200;
update sip set data=‘5072’ where keyword=‘port’ and id>=200 and id<300;
update sip set data=‘no’ where keyword=‘qualify’ and id>=100 and id<300;

now i want to let 100 extensions call other 100 extensions, something like:

100 --> 200
101 --> 201
... 
199 --> 299

and when the call is established, calling(UAC) extension will send 20 seconds
pcap audio, and the called(UAS) will echo (with -rtp_echo) to calling party.

first in one bash window start sipp UAS script:

$ sipp -i 20.0.6.103 -p 5072 -sf s.xml -aa -rtp_echo

then in other bash window start sipp UAC script: (20.0.6.121 is asterisk)

$ sipp 20.0.6.121 -i 20.0.6.103 -p 5071 -aa -sf c.xml -inf c.csv -l 100 -r 50 -rp 1000

—start-c.csv—
SEQUENTIAL
100;200
101;201
…snip…
199;299
—endof-c.csv—

after some minutes i found number of incompleted calls rised.
and i found some verbose message in the asterisk CLI like:

when i view the code of /var/lib/asterisk/agi-bin/dialparties.agi, i found
the failure is at the line:

so i known there must be some error in my astdb file, i dump it:

$ db_dump185 -p /var/lib/asterisk/astdb

and i found the line for extension “101” is like:


/AMPUSER/100/device\00
100\00
/AMP\00SER/101/device\00
101\00

you can see the “key” is broken: “/AMP\00SER”, the “U” has been replaced!
i can also found some error places alike:


/AMPUSER/137/device\00
137\00
/AMPUSER/1\008/device\00
138\00

so astdb crashed, call “101” or “138” will fail.
i googled the problem but can not found anything about astdb crash!
i think there must be some astdb write error when concurrent calls happening.
so i view the code asterisk-src/main/db.c:
i found every db operation will::: ast_mutex_lock(&dblock);
and the dblock is defined with something like:

PTHREAD_RECURSIVE_MUTEX_INITIALIZE_NP
PTHREAD_MUTEX_RECURSIVE_NP

so asterisk use mutex in RECURSIVE type, is this the source of my probem?

so, anybody can help me about this problem? thanks.

It is almost certainly your “PBX > Extensions Batch” to blame.

Do it all over again, but this time after you have created all your extension, dump the database.

The part of the database that contains AMPUSER are created when you create an extension and should look like (showing extension/user 7197):

database show AMPUSER
/AMPUSER/7197/outboundcid                         : Some User <1234567890>
/AMPUSER/7197/password                            : 7197
/AMPUSER/7197/recording                           : out=Adhoc|in=Adhoc
/AMPUSER/7197/ringtimer                           : 0
/AMPUSER/7197/voicemail                           : novm

If that is OK and all of your other AMPUSER and DEVICE entries are OK then do the test again.

The AMPUSER and DEVICE are read only attributes used by dialparties.agi, if those are corrupt before your test then almost everything can happen.

first off, 50 cps is an extreme load. If you need to handle that many calls, you likely are not using the proper platform.

As far as the corrupted AMPUSER records, as mentioned, your described scenario should be read only. It’s only when you are in devicesanduser mode logging into and out of phones that the dialplan would modify anything in that structure. Also - FreePBX goes through either the manager OR AGI to modify anything in astdb (it’s the only way to do it). So they are all going through Asterisk. And as you pointed out, access is locked…

it would appear that something was already corrupt and well worth re-doing your test as suggested above.

first, i have redo the extenions batch load after i cleared astdb.
but the crash happen again after 1~2 hours.
but when i config a simple dialpan manually without any astdb R/W,
the server is ok and calls is completed well.

so i think the bottleneck should be the AGI/manager alike stuff.
anyone can tell me the benchmark of freepbx in a specific hardware
condition?

there are some performance test report:

http://www.transnexus.com/White%20Papers/Performance_Test_of_Asterisk_v1-4.htm

100~200 calls per second is very fluent for a quad core xeon cpu.

thanks.

well again, what is your goal here? 50 cps let alone 100-200 cps is an extreme load, if you have an application that needs that, your probably need a very tuned dialplan for your application vs. FreePBX.

Second, you say it crashes, are you seeing the same corruption? If so, that is very odd because again, nothing should be written. You can try instrumenting the database_put() commands in the phpagi and phpagi-manager libraries to create a log of any that may be happening in to see if something unknown is lurking. But as mentioned, it should all be read only for what you are describing.