Queues - dialparties.agi & high CPU load

pscs · July 24, 2009, 9:17pm

Actually after my last post, I think the bug IS in FreePBX, and I’ve reported that here:

http://www.freepbx.org/trac/ticket/3792#preview

along with the fix that is needed in dialparties.agi

a simple one line fix that definitely works is to change

$astman = new AGI_AsteriskManager( );

to

$astman = new AGI_AsteriskManager( );
$astman->pagi = & $AGI;

You may be able to change the line to

$astman = $AGI->new_AsteriskManager();

instead. That does seem to work for me, but it does change some other things so I’m not as confident about it

plindheimer · July 24, 2009, 9:47pm

That’s interesting. What version of Asterisk are you running because there may be an even better fix, which is no connection to the Manager. (Assuming this is a fix to the issues in this thread).

In core module version 2.6, I had made the following change to dialparties.agi:
[php]

<?php $ast_ge_16 = version_compare($ast_version, "1.6", "ge"); $has_extension_state = (get_var( $AGI, "HAS_EXTENSION_STATE" ) || $ast_ge_16); // If we are 1.6 then we have the EXTENSION_STATE() function and don't need to use the manager // if (!$has_extension_state) { $astman = new AGI_AsteriskManager( ); if (!$astman->connect("127.0.0.1", $ampmgruser , $ampmgrpass)) { exit (1); } } [/php] I have back ported the EXTENSION_STATE() function to 1.4 as it was quite trivial, so if you do the same, you could use this on 1.4, thus the other checks. The crux is, dialparties.agi does not use a manager connection any longer if either of the conditions are met, which is an even better solution. Assuming of course, that this is really related to the problems reported here.

pscs · July 25, 2009, 10:15am

I’m using Asterisk 1.6.0.9

The (current?) FreePBX 2.5.1 download doesn’t have code as you describe, and the ‘is_ext_avail’ function uses the manager connection, so that’s what I’ve been working with.

Using the original dialparties.agi, an incoming call to a single extension would have a second or two delay before the internal extension started ringing, even though the outside line would give the ring tone immediately. With my suggested fix, the inside line starts ringing almost immediately. Queues are similar, with them ringing immediately, and no excessive CPU usage, delays or failure.

I don’t know if you changed the code to what you described to try and fix this delay problem (because it would), but using the AGI->new_AsteriskManager() function instead of ‘new AGI_AsteriskManager()’ does fix it, and seems a simpler change, and it will work with Asterisk 1.4 as well

I’m not sure why the PHP ‘error_log’ function is so slow, but if that is different speeds on different computers (eg because of different PHP configurations), that would explain why some people have problems, and other people can’t reproduce it.

I’m also not sure why phpagi uses error_log if the AGI_AsteriskManager isn’t linked to an AGI instance, so maybe that should be reported as a bug in phpagi as well.

plindheimer · July 25, 2009, 2:15pm

pscs,

your observation will be looked at, it is interesting and I was not familiar with creating a manager session like you describe, so don’t know what, if any, implications it has.

The change I described about was put into 2.6 simply because we are usually very conservative in introducing anything into such ‘core’ components of the system, such as dialparties, on released versions. The change was put because we have been wanting to remove the need to hit the manager from dialparties since ‘the beginning of time’ but until the new function, which is 1.6 only, it was the only way to do that.

It would be really nice if Asterisk would simply backport that officially to 1.4. But they almost never add new features to a new release, which I understand and agree with, though one could almost argue it is a ‘bug’ not to have and extension state function at the dialplan level when the have a device state and have had extension state available through the AGI since 1.2 (and maybe even prior to 1.2?).

Lastly, whether a bug in phpagi or FreePBX, last I checked phpagi is not maintained and we have been maintaining it within our project (it has deviated quite a bit since the one on sourcefourge). We have already fixed quite a few bugs. So … the buck stops here either way

plindheimer · July 25, 2009, 3:40pm

Let’s examine this, I’m going to ‘think outloud’ here to see if we can get to the bottom of this. Here are the only differences, to the best that I can tell (and I have only had 1 coffee so …).

Firs, this is influenced by my current phpagi.conf file which includes:

[festival]
text2wave=/usr/src/festival/bin/text2wave
tempdir=/var/lib/asterisk/sounds/tmp/
[asmanager]
server=localhost
port=5038
username=admin
secret=amp111

The config setting of the astman handle as is today are:

ASM config: Array
(
    [festival] => Array
        (
            [text2wave] => /usr/src/festival/bin/text2wave
            [tempdir] => /var/lib/asterisk/sounds/tmp/
        )

    [asmanager] => Array
        (
            [server] => localhost
            [port] => 5038
            [username] => admin
            [secret] => amp111
        )

)

Then with your change of:
[php]

<?php $astman = $AGI->new_AsteriskManager(); [/php] [code] ASM config: Array ( [festival] => Array ( [text2wave] => /usr/src/festival/bin/text2wave [tempdir] => /var/lib/asterisk/sounds/tmp/ ) [asmanager] => Array ( [server] => localhost [port] => 5038 [username] => admin [secret] => amp111 [festival] => Array ( [text2wave] => /usr/src/festival/bin/text2wave [tempdir] => /var/lib/asterisk/sounds/tmp/ ) [asmanager] => Array ( [server] => localhost [port] => 5038 [username] => admin [secret] => amp111 ) [phpagi] => Array ( [error_handler] => 1 [debug] => [admin] => [tempdir] => /var/spool/asterisk//tmp/ ) ) ) [/code] Looking through the code, I'm not seeing the above config differences having an effect. The only difference between the way it is now and your change is the config that is passed into the astman handle resulting from: [php] <?php $this->asm->pagi =& $this; $this->config =& $this->asm->config; [/php] Those change that you pointed out mean that with the new call, the asm->pagi is set telling the astman handle that it has access to AGI, so to speak (e.g. running in an AGI session) So ... this looks like it may be the crux of the difference, because the other configs above don't look like they make a difference for AMI, but his function: [php] <?php function log($message, $level=1) { if($this->pagi != false) $this->pagi->conlog($message, $level); else error_log(date('r') . ' - ' . $message); } [/php] Which means it writes to error_log if you have not set pagi (as is the case today) but to conlog() if you have. The conlog function is: [php] <?php function conlog($str, $vbl=1) { static $busy = false; if($this->config['phpagi']['debug'] != false) { if(!$busy) // no conlogs inside conlog!!! { $busy = true; $this->verbose($str, $vbl); $busy = false; } } } [/php] Since in my case, config['phpagi']['debug'] is false, nothing gets written. And if a lot was otherwise getting written, that could very well bog things down (though it's strange that writing out log messages would have such a dramatic effect. Now be aware that the PHPAGI Config module, if being used, has settings for debug and error which can further influence the behavior. (If you are seeing the debug message as verbose messages, then you probably want to turn that level off as well to reduce the noise). So ... I think you may be right in that we want to change the code, though will probably only do it initially in 2.6 since we always approach such fundamental changes cautiously, but seems like it should be: [code] =================================================================== --- dialparties.agi (revision 7882) +++ dialparties.agi (working copy) @@ -62,7 +62,7 @@ // If we are 1.6 then we have the EXTENSION_STATE() function and don't need to use the manager // if (!$has_extension_state) { - $astman = new AGI_AsteriskManager( ); + $astman = $AGI->new_AsteriskManager(); if (!$astman->connect("127.0.0.1", $ampmgruser , $ampmgrpass)) { exit (1); } [/code]

pscs · July 26, 2009, 8:26am

I agree that it’s very odd that the logging slows things down so much.

On the ‘EventState’ AGI call, about 100+ events are returned, each of those results in a ‘process_event’ (or something) call - those don’t actually do anything since no event processor is registered - however each one does log a line saying ‘no event processor registered’ (or words to that effect).

For some reason, on the PCs I’ve tried it on here, error_log takes about 0.1 seconds per call, so those 100 events take about 10 seconds to log. Simply commenting out the error_log line in phpagi made a dramatic difference to performance.

I really don’t know why error_log takes so long here - and I’m not in the mood to start debugging PHP as well The simpler solution for me was to stop phpagi from calling it unnecessarily. I guess error_log probably hasn’t been optimised since it’s only meant to be used for errors, not for general diagnostic logging in the way that phpagi seems to use it in some circumstances.

(I can put full debug logging on in Asterisk and it makes virtually no difference to performance, so I don’t think the slowness of error_log is to do with my PC’s hardware.)

warlock67 · July 28, 2009, 9:23pm

Question to Phillippe and PSCS,
what changes should I make for testing? Is it sufficient only changing the dialparties.agi or do I need to take out the logging feature as well??

At the mome nt I am still using the SIP instead of Local but as Phillippe mentions there are some drawbacks to this such as the transfer of calls issue so I would like to get back to a working solution with “Local” but without the long wait times eg in transfers etc.

Appreciate your response here.

warlock67 · July 28, 2009, 10:33pm

I have just made that one single change in dialparties.agi and changes back the queueś to use Local instead of SIP, and have been running for an hour without a hitch. I will post tomorrow or in the next couple of days to see if we are error free and without dialparties.agi hanging. Thanks Phillippe and everyone else involved up to now. This is fantastic to see results this way. I will keep contributing with support tickets and time.

plindheimer · July 28, 2009, 10:59pm

warlock67,

the change has already been checked into 2.6 #3792. If you test on 2.5 and have some solid results it will help to drive us to back port it to 2.5. My general approach is to introduce it cautiously, even though it looks pretty straight forward per the long post above. Further testing will help convince us that it should go into 2.5 (and maybe 2.4 also).

I do still plan on getting a tarball of 2.6 beta out within the next week I hope as well, but that will be beta though pretty solid.

warlock67 · July 28, 2009, 11:15pm

Up to now this seems to have taken care of the problem, because I have been up now for more than 3 hours and no hanging dialparties.agi and I have had some peaks of 40 simultaneaous incoming and outgoing calls on this box where it would usually have crippled the box before.

Since the change to dialparties.agi we have processed some 275 incoming calls and over 500 minutes on 4 inbound queues with arround 35 static agents (using penalty within the queues) and no hangs, so this really looks like it did the trick…

plindheimer · July 29, 2009, 12:12am

warlock67,

if you could make sure to ping the ticket in the next day or two once you are pretty certain this is solid, it will trigger me to look at merging that fix into 2.5, it is sounding pretty convincing to me.

warlock67 · July 30, 2009, 1:16am

Philippe,
day 2 of operation with standard Local Static Agents and no dialparties.agi hanging and CPU load normal! Here are todays stats for the queues:
Answered Calls 1012 calls
Transferred Calls 154 calls
Avg Durat.: 368.96 secs
Total Call Time: 6223:10 min

Congratulations and thanks for the fix…looking forward to seeing it in the release and maybe even as a bugfix in 2.5.

Regards

plindheimer · July 30, 2009, 8:18am

well I think kudos goes to the gentleman above who brought it to our attention Please do update the ticket on this if you get a chance to make sure that the data is not lost. Sounds like pushing it to 2.5 makes a lot of sense.

dionv · August 3, 2009, 11:11pm

While I have not had the same call volume previous posters describe, I have a large number of queues (about 65 right now), although each queue only has about 8 static agents. We have had this issue happen twice now in the past month or two.

We are also (coincidence or otherwise) running an HP server, a DL380 G5 with 2GB RAM and dual CPUs. CPU usage is rarely an issue, and call quality is normally very good, so I don’t think we have a resource issue. We are running Trixbox 2.2 CE, upgraded with FreePBX 2.5.x, so we still have Asterisk 1.2 under the hood.

Today was the second time this happened. There were 3 calls in progress and two more calls in queue. All were in different queues. All queues use ringall, 20 seconds, with 5 second wait, no timeouts, periodic announcements every 1 minute 30 seconds, random hold music.

The two calls on hold ended up being there for almost 7 minutes by the time two of the other calls wrapped up. By almost 10 minutes in queue, they still had not rung the available agents, although those agents could call each other, so they weren’t DND.

After the third call wrapped up, I restarted asterisk, and calls started ringing through again.

These calls were coming in via an IAX2 trunk from our client, then routed to the agents’ X-Lite SIP softphones.

I’ve read through this thread, but when the problem was occurring I did not check CPU usage, and would not have known to look for lots of dialparties.agi processes. I did notice in the /var/log/asterisk/full a huge number of lines (63927) of

Of course, the timestamps were different, as well as the DEBUG numbers and the ending number on the trunk…

Since the problem seemed very similar to mine, I checked, and our queues_additional.conf is using “Local” to identify the members. But it also looks like the recent discovery of how the manager call is made may resolve the issue, but will either of these work on Asterisk 1.2? (We are investigating our upgrade path, but this is what we have for now) Or, do we have a different issue altogether? (If so, I’ll start a different thread and apologize for disturbing this one.)

TIA

DionV

(For historical note: In the past, we have had a total of 4 queues with 20 calls in progress and another 15 in queue, all in one queue, with a ringall strategy, with little degradation of service, running on a PIII-700MHz dual CPU system, using Trixbox 2.2 CE stock. All calls were coming in from SIP trunks. This happened from 8am to 9pm Monday to Friday for about 3 weeks. Never had an issue like this. 1) Testimony to the power of the software! 2) Don’t think it’s a lack of hardware resources.)

plindheimer · August 3, 2009, 11:33pm

The reported error seems like you have a completely different problem. But as far as the ‘fix’ discussed here, it is not related to asterisk version so is safe to put on your system. I merged it to 2.5 yesterday and it will work it’s way into an update to core in the not to distant future.

dionv · August 4, 2009, 2:13pm

Thanks for the quick reply, Philippe!

I’ll keep investigating the error message. I found some links regarding it yesterday, but did not have time to follow up on them.

Keep up the great work!

DionV

warlock67 · September 7, 2009, 4:01pm

Philippe,
can I now upgrade my systems with the latest 2.5.2.0?? Will this make the change in the “dialparties.agi” as discussed here???

I have been running for almost a month now with this change and no high CPU usage or hanging of dialparties.agi any more.

Please inform.

Thanks

plindheimer · September 7, 2009, 6:37pm

if I put it into 2.5 it should, you may want to confirm but if I said I was doing it here then I probably did.

warlock67 · September 7, 2009, 6:55pm

You mentioned that you would put it in 2.6 but said it would make sence to do the same for 2.5 but did not confirm that you would do so…could you let me know please before I do the upgrades on the 3 systems I have.

plindheimer · September 7, 2009, 11:42pm

warlock67,

you just need to do an ‘svn log’ on the files in the 2.5 branch (or check the changelog in the new module). I don’t remember but it will be there if I put it in.