FreePBX in Amazon EC2 cloud, and SipStation Trunk issues!

Hi Everyone!

I have FPBX running on an AWS EC2 instance. I’m using SipStation trunking. It’s been performing very well for a year and a half, except for one intermittent, intractable problem. Occastionally, spontaneously, my trunks suddenly stop working. Incoming and outgoing external calls do not work during this time, but echo tests and intra-company extension calling works fine.

The weird thing is that if I ssh into the server and run an “fwconsole restart”, everything immediately works. The trunks re-register, incoming and outgoing calls work, everyone is happy.

This happens occasionally. Sometimes once a month, somethings as long as 2 months may pass without it happening, rarely I have a spate where it happens every day! I’m in one of those situations right now. It’s happened spontaneously each day for the last three business days and I’m getting exasperated. Usually it’s when staff get to work in the morning and they realize at opening time, that the phones aren’t working.

I’ve had a Sangoma licensed engineer work on this with me and look over the logs. Here’s what he noticed in the logs recently when this happened a couple weeks ago:

—————

“Here, SIPSTATION Trunk 1 failed out at 1:30pm so you failed over to Trunk 2:

[2021-07-13 13:30:51] VERBOSE[3626][C-000017d6] pbx.c: Executing [[email protected]:27] Dial(“PJSIP/220-000042b0”, "PJSIP/[email protected],300,Tb(func-apply-sipheaders^s^1,(1))U(sub-send-obroute-email^866xxxxxxx^866xxxxxxx^1^162xxxxxxx^^530xxxxxxx)”) in new stack

[2021-07-13 13:30:51] ERROR[16498] res_pjsip.c: Endpoint 'fpbx-1-xxxxxxxxxxxxx’: Could not create dialog to invalid URI 'fpbx-1-xxxxxxxxxxxx’. Is endpoint registered and reachable?

[2021-07-13 13:30:51] ERROR[16498] chan_pjsip.c: Failed to create outgoing session to endpoint 'fpbx-1-xxxxxxxxxxxx’

[2021-07-13 13:30:51] WARNING[3626][C-000017d6] app_dial.c: Unable to create channel of type ‘PJSIP’ (cause 3 - No route to destination)

[2021-07-13 13:30:51] VERBOSE[3626][C-000017d6] app_dial.c: No devices or endpoints to dial (technology/resource)

[2021-07-13 13:30:51] VERBOSE[3626][C-000017d6] pbx.c: Executing [[email protected]:28] NoOp(“PJSIP/220-000042b0”, “Dial failed for some reason with DIALSTATUS = CHANUNAVAIL and HANGUPCAUSE = 3”) in new stack

[2021-07-13 13:30:51] VERBOSE[3626][C-000017d6] pbx.c: Executing [[email protected]:29] GotoIf(“PJSIP/220-000042b0”, “0?continue,1:s-CHANUNAVAIL,1”) in new stack

[2021-07-13 13:30:51] VERBOSE[3626][C-000017d6] pbx_builtins.c: Goto (macro-dialout-trunk,s-CHANUNAVAIL,1)

[2021-07-13 13:30:51] VERBOSE[3626][C-000017d6] pbx.c: Executing [[email protected]:1] Set(“PJSIP/220-000042b0”, “RC=3”) in new stack

[2021-07-13 13:30:51] VERBOSE[3626][C-000017d6] pbx.c: Executing [[email protected]:2] Goto(“PJSIP/220-000042b0”, “3,1”) in new stack

[2021-07-13 13:30:51] VERBOSE[3626][C-000017d6] pbx_builtins.c: Goto (macro-dialout-trunk,3,1)

[2021-07-13 13:30:51] VERBOSE[3626][C-000017d6] pbx.c: Executing [[email protected]:1] Goto(“PJSIP/220-000042b0”, “continue,1”) in new stack

[2021-07-13 13:30:51] VERBOSE[3626][C-000017d6] pbx_builtins.c: Goto (macro-dialout-trunk,continue,1)

[2021-07-13 13:30:51] VERBOSE[3626][C-000017d6] pbx.c: Executing [[email protected]:1] NoOp(“PJSIP/220-000042b0”, “TRUNK Dial failed due to CHANUNAVAIL HANGUPCAUSE: 3 - failing through to other trunks”) in new stack

[2021-07-13 13:30:51] VERBOSE[3626][C-000017d6] pbx.c: Executing [[email protected]:2] ExecIf(“PJSIP/220-000042b0”, “1?Set(CALLERID(number)=220)”) in new stack

[2021-07-13 13:30:51] VERBOSE[3626][C-000017d6] pbx.c: Executing [[email protected]:13] Macro(“PJSIP/220-000042b0”, “dialout-trunk,2,866xxxxxxx,off”) in new stack

Then it looks like either Trunk 2 failed and/or Trunk 1 reported connection again so the PBX attempted to start routing calls that way at 1:40pm and started failing again:

[2021-07-13 13:40:14] VERBOSE[5181][C-000017dc] app_stack.c: PJSIP/fpbx-1-xxxxxxxxxxxxxxxx-000042c3 Internal Gosub(func-apply-sipheaders,s,1(1)) complete GOSUB_RETVAL=

[2021-07-13 13:40:14] VERBOSE[5181][C-000017dc] app_dial.c: Called PJSIP/[email protected]

[2021-07-13 13:40:15] VERBOSE[5181][C-000017dc] app_dial.c: Everyone is busy/congested at this time (1:1/0/0)

[2021-07-13 13:40:15] VERBOSE[5181][C-000017dc] pbx.c: Executing [[email protected]:28] NoOp(“PJSIP/222-000042c2”, “Dial failed for some reason with DIALSTATUS = BUSY and HANGUPCAUSE = 17”) in new stack

[2021-07-13 13:40:15] VERBOSE[5181][C-000017dc] pbx.c: Executing [[email protected]:29] GotoIf(“PJSIP/222-000042c2”, “0?continue,1:s-BUSY,1”) in new stack

[2021-07-13 13:40:15] VERBOSE[5181][C-000017dc] pbx_builtins.c: Goto (macro-dialout-trunk,s-BUSY,1)

[2021-07-13 13:40:15] VERBOSE[5181][C-000017dc] pbx.c: Executing [[email protected]:1] NoOp(“PJSIP/222-000042c2”, “Dial failed due to trunk reporting BUSY - giving up”) in new stack

From there it looks like Trunk 2 is not tried again (maybe it went unavailable?) and Trunk 1 keeps failing until you restart services. I see two HANGUPCAUSE codes: 3 (No route to destination) and 17 (User busy). Either could indicate SIPSTATION/Bandwidth.com (their back end provider, since they are effectively a reseller) was having issues and that’s what dropped your calls, but it could also other network issues somewhere between AWS and the Trunks. You should speak to SIPSTATION Support to see what their logs show for these two timestamps."

—————

I’ve contacted SipStation to trace these calls during this time and they say that they see on their end that calls “timed out after starting to establish the call, this timeout generally indicates that there may have been an interruption in network connectivity at the time”.

So the engineer who is my consultant thinks this may be an AWS connectivity problem. He sees no evidence in the logs of asterisk crashing. AWS internet connectivity interruptions seem surprising to me - AWS should have the most rock solid internet connectivity on earth. A lot of people run large corporate PBX’s on AWS cloud with no issues. What’s more, if this were an AWS connectivity problem, why would restarting asterisk always fix the issue?

Does anyone have any ideas anything I could try to address this issue?

Thanks!

And one other question. If this is indeed an AWS connectivity problem, is there a way to get FPBX/SipStation to automatically re-register the trunks in a situation like this? I shouldn’t have to restart asterisk to force re-registration. It should detect this and re-register automatically. I have never had a situation where fwconsole restart didn’t fix the problem the first time I ran it!

This is not a connectivity issue. If the EC2 instance were losing connectivity, your internal calls would not work. If freepbx.com were losing connectivity, this board would be jammed with complaints. While it’s conceivable that an internet routing issue would cause traffic between these endpoints to fail, it is very implausible for it to happen repeatedly.

My first guess is a FreePBX firewall issue. In Connectivity -> Firewall -> Intrusion Detection, are there any banned IPs? If so, are either trunk1 [192.159.66.3] or trunk2 [162.253.134.142] on the list? If so, while I don’t know how they could have gotten there, whitelisting them will probably improve things. Also, mark those addresses Trusted on the Networks tab.

If no luck, at the Asterisk command prompt, type
pjsip set logger on
which will log all SIP traffic to The Asterisk log (along with the regular entries) and to the console. With default settings, you should see REGISTER requests going out at least once per minute. If so, what replies, if any, do you see?

Also, run sngrep . This will show traffic ‘outside’ of the FreePBX firewall, i.e. it is ahead of the firewall for incoming packets, and after the firewall for outgoing packets. For example, if it shows replies to REGISTER, but they don’t appear in pjsip logger, the firewall is blocking them.

I believe that SIPStation does not require registration for outgoing calls, even if set up to use registration for incoming. So, it would be interesting to see the outgoing INVITE for a failed call, as well as the error response. That should show whether the INVITE headers are somehow corrupted, or an error response is being inappropriately sent by SIPStation. If the latter, it could indicate a problem with your account, or a response from a security mechanism at SIPStation e.g. because Asterisk is sending excessive traffic, due to a bug or configuration error.

Though I don’t suggest rocking the boat now, in the long term it will be more robust to use IP auth rather than registration. On the SIPStation portal, you configure the IP address of your PBX and incoming calls are automatically sent there. That ensures that if connectivity is ok when a call is received, the call will come in, even if there were previous connectivity issues that could have resulted in loss of registration.

Very interesting! Thank you for your thoughts. I do not use the firewall module in FPBX, it is not even enabled at all. The consultant recommend we manage fireballing directly in the EC2 instance with AWS security groups. (He has seen times when the firewall module in FPBX ended up locking a user out of their AWS instance, and with no physical access, there was no way to recover). The ports and IP’s of SipStation are allowed in our AWS security group - and I suppose it’s not an incorrect or missing security group entry, since restarting asterisk always fixes the problem.

When the issue next occurs, I’ll run the commands above and report any useful info received!

Thanks so much for taking time to respond.

On another note, the second post here: SIP Trunks are de-registering up to 4 times a day!
makes it sound like asterisk only tries to register 10 times and then stops. Does anyone know if that’s still the case?

If so, would setting registerattempts=0 as per the post above potentially be helpful?

Thanks!

That may be the default for chan_sip, but you have pjsip trunks. The analogous parameter is Max Retries, which defaults to 10000, several days (assuming sane retry intervals).

If you are set up to use registration and there are times when it is lost, capturing the REGISTER requests and corresponding replies can be a good way to track down networking, firewall or server issues. However, in most cases, I believe that a server with a static IP should avoid registration where possible. You still have qualify to monitor the connection.

Thanks! I will turn off SipStation registration and use ACL to limit access to my PBX static IP. That alone may fix the problem.

THANK YOU @Stewart1 !

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.