We need better HANGUPCAUSE handling!

pwalker · April 7, 2011, 10:37pm

After some customer complaints concerning those infamous “All circuits are busy” messages, I started investigating how FreePBX handles DIALSTATUS and HANGUPCAUSE.
…and found that it’s laking some polishing.
Well, the “Route Congestion Messages” module (outroutemsg) is cool, however, it - together whith the FreePBX core - has quite some room for improvement.

Most of the few HANGUPCAUSEs that are handled by FreePBX (17,18,22,23,28) are rarely reached. (Not to mention that one of them even isn’t even defined in Asterisk’s causes.h)
One of the Messages that can be defined doesn’t even seem to be used (unalloc_msg_id).
There have been various forum here (and on that other FreePBX-CentOS-based VoIP-distro-Who-Should-Not-Be-Named’s forum) already and even some TODO remarks by Philippe in at least one commit but as far as I’ve seen, this hasn’t been taken care of yet.
Therefore, I decided for myself that we finally should be solved or at least improved now - and volunteer to contribute.
So - Any suggestions on
what HANGUPCAUSEs should be handled how?
Where (or how) is the best place to handle the different cases?
In my opinion, it would be best if the case handling would be configurable (= extending the outroutemsg module?), but that would mean more work to be done than hard coding it

~Philipp

mickecarlsson · April 8, 2011, 5:11am

From causes.h, checked both 1.4 and 1.8 of Asterisk:

- AST_CAUSE_USER_BUSY                       17
- AST_CAUSE_NO_USER_RESPONSE                18
- AST_CAUSE_NUMBER_CHANGED                  22
- AST_CAUSE_REDIRECTED_TO_NEW_DESTINATION   23
- AST_CAUSE_INVALID_NUMBER_FORMAT           28

There is a bug in Asterisk 1.8 that affect hangupcause on at least 28 - Number Incomplete: https://issues.asterisk.org/view.php?id=18681

unalloc_msg_id was removed some time ago from core, as you can have HANGUPCAUSE 1 and yet, try another trunk. So, we need to remove that from outroutemsg module.

pwalker, the issue with catching hangupcauses is that you can have many trunks going to different providers. When you get a trunk fail and a hangupcause, that does not mean that you should stop your dial attempt, you should try other trunks.

The traditional PBX’s only have one trunk, and then all hangupcauses make sense, but with Asterisk you can have a lot of trunks for one outbound route, and therefore never should halt on first failure.

plindheimer · April 7, 2011, 11:24pm

Is the issue with the messages being played, or do you have fundamental issues in the call flow?

Call flow issues would be that you think some hangup cause should result in the next trunk being tried vs. giving up even if more trunks are available, as an example, or the flip side of course…

In either case, if there are explicit things you think are out right wrong and need to be addressed as a bug, feel free to file a bug, or otherwise bring them up for discussion to get to the bottom.

As far as the one not in causes.h, I wouldn’t know. Could be it was in a different version of Asterisk, could be it was never in there and a bug in Asterisk (e.g. it should be there), could be a bug on our part…

pwalker · April 8, 2011, 6:58am

Okay, my fault:
I was missing “AST_CAUSE_REDIRECTED_TO_NEW_DESTINATION” (23). - It’s there in Asterisk 1.8 (overlooked this, sorry!), but it isn’t in 1.4.

~Philipp

pwalker · April 8, 2011, 11:47am

I have to agree with the both of you that we - at least in most cases - should not
stop after we get an error while trying the first trunk - but I get quite some of those “All circuits are busy” messages for HANGUPCAUSES where another message would be more appropriate imho.
E.g. on “that customer system”:
20 - AST_CAUSE_SUBSCRIBER_ABSENT
21 - AST_CAUSE_CALL_REJECTED / AST_CAUSE_UNREGISTERED

So, in addition to the questions in my initial post:

Which HANGUPCAUSES should cause the system to try other trunks or give up?
(Problem here on this system is that I only have one dial out trunk…)
I think I need to do more analysis what exactly causes which HANGUPCASE. Or does anybody have som documentation on that?

@Philippe: No fundamental issues, these are more like “Product Improvement” suggestions for which HANGUPCASES should cause what (including the continue/give up question and what message should be played in which case)since in my opinion, too many cases are currently going to outisbusy. (I’m currently only talking about SIP Trunks, not sure how this behaves for other channels/media.)

~Philipp

plindheimer · April 8, 2011, 3:20pm

we always like improvements so no complaints.

As far as trying other trunks, the basic algorithm is:

[list]
[] If a hangup cause is interpreted to be something related to the trunk and not the final number (conceptually, the trunk is CONGESTED/DOWN) then we move on to the next trunk.
[] if a hangup cause is interpreted to be related to the end number such as it is busy, or the end number does not exist, then we stop and don’t try any more trunk.
[/list]

This algorithm, which is what has always been taken, makes sense since you should not keep trying a ‘busy’ or ‘non-existent’ extension on every trunk as the first one already gave you a ‘conclusive’ resolution as to the availability of the end number.

Since carriers are far from perfect, it does not mean this will always work however it really is a carrier issue when they are signaling you wrong. For example, some SIP carriers will give you something that translates to ‘busy’ vs. ‘congestion’ if you have used up all your channels with them, which is wrong. There are ways to get around this by using the max channels configurations for example. It’s also possible for a carrier to incorrectly tell you a number does not exist, again this is a carrier issue that needs to be resolved but could obviously affect a working system. However, the general consensus is that most users might want to know of the issue right away vs. making many possibly expensive calls on a failover carrier to some number before discovering (via a big phone bill) that the calls were going down the wrong trunk because their primary carrier had screwed up. However, making some of that behavior programmable would obviously be the best of all worlds…