Hi Everyone!
I have FPBX running on an AWS EC2 instance. I’m using SipStation trunking. It’s been performing very well for a year and a half, except for one intermittent, intractable problem. Occastionally, spontaneously, my trunks suddenly stop working. Incoming and outgoing external calls do not work during this time, but echo tests and intra-company extension calling works fine.
The weird thing is that if I ssh into the server and run an “fwconsole restart”, everything immediately works. The trunks re-register, incoming and outgoing calls work, everyone is happy.
This happens occasionally. Sometimes once a month, somethings as long as 2 months may pass without it happening, rarely I have a spate where it happens every day! I’m in one of those situations right now. It’s happened spontaneously each day for the last three business days and I’m getting exasperated. Usually it’s when staff get to work in the morning and they realize at opening time, that the phones aren’t working.
I’ve had a Sangoma licensed engineer work on this with me and look over the logs. Here’s what he noticed in the logs recently when this happened a couple weeks ago:
—————
“Here, SIPSTATION Trunk 1 failed out at 1:30pm so you failed over to Trunk 2:
[2021-07-13 13:30:51] VERBOSE[3626][C-000017d6] pbx.c: Executing [s@macro-dialout-trunk:27] Dial(“PJSIP/220-000042b0”, "PJSIP/866xxxxxxx@fpbx-1-xxxxxxxxxxx,300,Tb(func-apply-sipheaders^s^1,(1))U(sub-send-obroute-email^866xxxxxxx^866xxxxxxx^1^162xxxxxxx^^530xxxxxxx)”) in new stack
[2021-07-13 13:30:51] ERROR[16498] res_pjsip.c: Endpoint 'fpbx-1-xxxxxxxxxxxxx’: Could not create dialog to invalid URI 'fpbx-1-xxxxxxxxxxxx’. Is endpoint registered and reachable?
[2021-07-13 13:30:51] ERROR[16498] chan_pjsip.c: Failed to create outgoing session to endpoint 'fpbx-1-xxxxxxxxxxxx’
[2021-07-13 13:30:51] WARNING[3626][C-000017d6] app_dial.c: Unable to create channel of type ‘PJSIP’ (cause 3 - No route to destination)
[2021-07-13 13:30:51] VERBOSE[3626][C-000017d6] app_dial.c: No devices or endpoints to dial (technology/resource)
[2021-07-13 13:30:51] VERBOSE[3626][C-000017d6] pbx.c: Executing [s@macro-dialout-trunk:28] NoOp(“PJSIP/220-000042b0”, “Dial failed for some reason with DIALSTATUS = CHANUNAVAIL and HANGUPCAUSE = 3”) in new stack
[2021-07-13 13:30:51] VERBOSE[3626][C-000017d6] pbx.c: Executing [s@macro-dialout-trunk:29] GotoIf(“PJSIP/220-000042b0”, “0?continue,1:s-CHANUNAVAIL,1”) in new stack
[2021-07-13 13:30:51] VERBOSE[3626][C-000017d6] pbx_builtins.c: Goto (macro-dialout-trunk,s-CHANUNAVAIL,1)
[2021-07-13 13:30:51] VERBOSE[3626][C-000017d6] pbx.c: Executing [s-CHANUNAVAIL@macro-dialout-trunk:1] Set(“PJSIP/220-000042b0”, “RC=3”) in new stack
[2021-07-13 13:30:51] VERBOSE[3626][C-000017d6] pbx.c: Executing [s-CHANUNAVAIL@macro-dialout-trunk:2] Goto(“PJSIP/220-000042b0”, “3,1”) in new stack
[2021-07-13 13:30:51] VERBOSE[3626][C-000017d6] pbx_builtins.c: Goto (macro-dialout-trunk,3,1)
[2021-07-13 13:30:51] VERBOSE[3626][C-000017d6] pbx.c: Executing [3@macro-dialout-trunk:1] Goto(“PJSIP/220-000042b0”, “continue,1”) in new stack
[2021-07-13 13:30:51] VERBOSE[3626][C-000017d6] pbx_builtins.c: Goto (macro-dialout-trunk,continue,1)
[2021-07-13 13:30:51] VERBOSE[3626][C-000017d6] pbx.c: Executing [continue@macro-dialout-trunk:1] NoOp(“PJSIP/220-000042b0”, “TRUNK Dial failed due to CHANUNAVAIL HANGUPCAUSE: 3 - failing through to other trunks”) in new stack
[2021-07-13 13:30:51] VERBOSE[3626][C-000017d6] pbx.c: Executing [continue@macro-dialout-trunk:2] ExecIf(“PJSIP/220-000042b0”, “1?Set(CALLERID(number)=220)”) in new stack
[2021-07-13 13:30:51] VERBOSE[3626][C-000017d6] pbx.c: Executing [866xxxxxxx@restrictedroute-xxxxxxxxxxxxxxxxxxxxe:13] Macro(“PJSIP/220-000042b0”, “dialout-trunk,2,866xxxxxxx,off”) in new stack
Then it looks like either Trunk 2 failed and/or Trunk 1 reported connection again so the PBX attempted to start routing calls that way at 1:40pm and started failing again:
[2021-07-13 13:40:14] VERBOSE[5181][C-000017dc] app_stack.c: PJSIP/fpbx-1-xxxxxxxxxxxxxxxx-000042c3 Internal Gosub(func-apply-sipheaders,s,1(1)) complete GOSUB_RETVAL=
[2021-07-13 13:40:14] VERBOSE[5181][C-000017dc] app_dial.c: Called PJSIP/530xxxxxxx@fpbx-1-xxxxxxxxxxxxx
[2021-07-13 13:40:15] VERBOSE[5181][C-000017dc] app_dial.c: Everyone is busy/congested at this time (1:1/0/0)
[2021-07-13 13:40:15] VERBOSE[5181][C-000017dc] pbx.c: Executing [s@macro-dialout-trunk:28] NoOp(“PJSIP/222-000042c2”, “Dial failed for some reason with DIALSTATUS = BUSY and HANGUPCAUSE = 17”) in new stack
[2021-07-13 13:40:15] VERBOSE[5181][C-000017dc] pbx.c: Executing [s@macro-dialout-trunk:29] GotoIf(“PJSIP/222-000042c2”, “0?continue,1:s-BUSY,1”) in new stack
[2021-07-13 13:40:15] VERBOSE[5181][C-000017dc] pbx_builtins.c: Goto (macro-dialout-trunk,s-BUSY,1)
[2021-07-13 13:40:15] VERBOSE[5181][C-000017dc] pbx.c: Executing [s-BUSY@macro-dialout-trunk:1] NoOp(“PJSIP/222-000042c2”, “Dial failed due to trunk reporting BUSY - giving up”) in new stack
From there it looks like Trunk 2 is not tried again (maybe it went unavailable?) and Trunk 1 keeps failing until you restart services. I see two HANGUPCAUSE codes: 3 (No route to destination) and 17 (User busy). Either could indicate SIPSTATION/Bandwidth.com (their back end provider, since they are effectively a reseller) was having issues and that’s what dropped your calls, but it could also other network issues somewhere between AWS and the Trunks. You should speak to SIPSTATION Support to see what their logs show for these two timestamps."
—————
I’ve contacted SipStation to trace these calls during this time and they say that they see on their end that calls “timed out after starting to establish the call, this timeout generally indicates that there may have been an interruption in network connectivity at the time”.
So the engineer who is my consultant thinks this may be an AWS connectivity problem. He sees no evidence in the logs of asterisk crashing. AWS internet connectivity interruptions seem surprising to me - AWS should have the most rock solid internet connectivity on earth. A lot of people run large corporate PBX’s on AWS cloud with no issues. What’s more, if this were an AWS connectivity problem, why would restarting asterisk always fix the issue?
Does anyone have any ideas anything I could try to address this issue?
Thanks!