Incoming call failures when provider loses a POP

simplydrew · May 3, 2016, 1:24pm

Hi All,

I seem to have come across a problem that might be configuration related in my setup, but haven’t been able to pin it down. FreePBX version is 2.11.0.43, Asterisk version 11.16.0.

I use a few different SIP providers, my primary being a small mom & pop shop, which I have my DIDs with, and route outbound calls to as well. Second in line is Aveno for outbound, and then Vitelity if all else fails. The inbound provider for my DIDs has mostly been solid, and they have a few different POPs across the US - for sake of this explanation, POP1, POP2, and POP3. They’re using Asterisk for call control, it appears, and an A2Billing portal where I set DID destinations, etc.

I have my SIP peers setup to all three of their POPs with a separate peer configuration - all of which have the below config, swapping “pop1.provider.com” to POP2, POP3, and so forth.

====

type=friend
qualify=yes
insecure=port,invite
host=pop1.provider.com
dtmfmode=rfc2833
deny=all
context=from-trunk
canreinvite=no
====

They’ll route calls from any one of those POPs, depending on which one is least busy/responds quickest. Unfortunately, the thing that I’m seeing is that if they lose a POP (provider had a DoS that affected them, etc) or some sort of other connectivity disruption, this presumed redundant peer configuration doesn’t appear to work. The call will still try to deliver over the peer that’s unreachable, and will not fail over to the others. At first, I thought this was most likely something on their end as a result of not seeing anything in the CLI when making test calls during one of these events - as well as having clear iptables allow rules to all of their POPs, etc.

After doing some troubleshooting with the provider yesterday and intentionally “breaking” a few things, where I had them temporarily IP ban my IP from one POP at a time while I placed test calls, the issue wouldn’t reproduce itself at first - but once blocking a POP or two, the provider would eventually see my side returning a busy:

– SIP/[my cloud PBX IP]-00000013 is circuit-busy

== Everyone is busy/congested at this time (1:0/1/0)
Failed to authenticate on INVITE to ‘“WIRELESS CALLER” <sip:[my cell phone test dialing number]@[provider POP IP]>;tag=as13a3b2d7’

Calls are being routed by them to me as SIP/1212XXXXXXX@[my cloud PBX IP], so just sending by DID @ IP.

To them, it looks like I’m actively rejecting invites from their other peers once I’m “stuck” on one, somehow. I’ve gone as far as temporarily turning iptables off all together while testing with them, and this still seems to happen. They’re under the impression that a packet gets sent back to them at one of the other POPs rejecting the call, where everything goes sour. Won’t have an opportunity to do a pcap to confirm that until a weekend maintenance window though.

Just throwing this out there to see if anyone has encountered something similar. This, of course, really screws me up when they have a network event at one of their POPs, and while all POPs should be attempting to send me the call (which appears to be happening on their side), it appears that Asterisk is sending something back on my end that is potentially screwing things up.

Any thoughts?

simplydrew · May 4, 2016, 11:51am

Anyone come across this before?

cynjut · May 4, 2016, 1:57pm

There is a checkbox that says “continue on to the next trunk if there’s a failure” bet that doesn’t help you if you’ve hit the last trunk in the list, but that doesn’t sound like what you are describing.

It might help if you were to split your troubleshooting into incoming and outgoing. Your explanation seems to wander back and forth between these two states, and comingling these states makes your troubleshooting steps unnecessarily complicated.

The logs (/var/log/asterisk/full) should tell you a lot about what is going on at your end. Knowing the context in which the failure is occurring (both ways to read that are correct) should give you a better understanding of what your are seeing as a result.

simplydrew · May 4, 2016, 3:34pm

Hi Dave,

Thanks for the reply. Sorry that it appears this is wandering back and forth - I’m not having issues with outbound at this point, was just trying to describe to folks how my setup looks.

I do not have the “continue on to next trunk” actually checked, as I didn’t know if that applied for incoming calls or not. I assumed that only applied for outbound calls. Is my interpretation incorrect? That may actually be the resolution to my problem, but couldn’t understand if that applied to my situation for inbound.

I’ll do some more logging at the full log. I was troubleshooting in real time with looking at the CLI with high verbosity set, and was not seeing Asterisk return anything to the incoming provider’s other POPs when one was down.

cynjut · May 4, 2016, 4:00pm

Your incoming trunks should always work independently of one another, and (for that matter) largely independent of your outgoing trunks as well.

Your assumption about the “continue to next trunk” is only on the outbound is correct. It wouldn’t apply to inbound, since there’s no such thing as an inbound “next trunk”. The traffic will always come from where it comes from.

One thing to remember is that there are lots of times when the people at the remote end will say “you sent a response” that implies that you are not accepting calls on a specific interface, but in fact your end didn’t send anything at all. I ran into that when I was trying to troubleshoot a firewall issue on one of my new servers.

If an incoming fails, there shouldn’t be anything on your end that changes - you will simply not get anything from them.

How are you validating your connections? Are you using registrations or are you using IP address validation? If the former, make sure that all of your registrations are unique (not using the same login on all of the connections). If the latter, make sure you have the right IP addresses in all the right places (permits/denys, fail2ban whitelists, and the firewall).

simplydrew · May 4, 2016, 4:09pm

Understood - my inbound trunks are differently setup than that of the outbound ones. My outbound ones are registration based to one individual server with the provider, whereas the inbound trunks are setup per the original post, with a separate trunk to each individual POP - the intention being to be able to allow inbound connections from whichever POP they send the call out from.

I figured that “continue to next trunk” was just for outbound, but wanted to confirm.

I’m validating the connections with the qualify parameter being defined in the trunk. IP address validation. Calls are being sent to me by IP from the provider - on my end is a static IP. I’ve ruled out the firewall, as all the provider’s POPs have a clear allow rule for all ports. I’ve also turned iptables completely off as well. Fail2ban I haven’t checked - that might be a good start. However, since all the POP IPs are allowed in the firewall, I didn’t think fail2ban would be something to worry about.

simplydrew · May 5, 2016, 7:36pm

The plot thickens - I have a lead!

So, one of the datacenter providers that my provider uses had a connectivity blip today, and is the primary point of where my inbound calls come out of. The issue reproduced itself, and I was able to capture sip debugs. Sanitized paste below:

Example where failover didn’t work from “east” when “west” went down. Did a “sip set debug peer west”:

Reliably Transmitting (NAT) to [provider_eastcoast_POP_IP:5060:
OPTIONS sip:east.provider.com SIP/2.0
Via: SIP/2.0/UDP my_cloud_pbx_ip:5060;branch=z9hG4bK0802a920;rport
Max-Forwards: 70
From: “Unknown” sip:Unknown@my_cloud_pbx_ip;tag=as567610e8
To: sip:east.provider.com
Contact: sip:Unknown@my_cloud_pbx_ip:5060
Call-ID: 475bc5356f1efdb65f5e98cf3b8c23a0@my_cloud_pbx_ip:5060
CSeq: 102 OPTIONS
User-Agent: FPBX-2.11.0(11.16.0)
Date: Thu, 05 May 2016 18:10:09 GMT
Allow: INVITE, ACK, CANCEL, OPTIONS, BYE, REFER, SUBSCRIBE, NOTIFY, INFO, PUBLISH, MESSAGE
Supported: replaces, timer
Content-Length: 0

<— SIP read from UDP:[provider_eastcoast_POP_IP:5060 —>
SIP/2.0 404 Not Found
Via: SIP/2.0/UDP my_cloud_pbx_ip:5060;branch=z9hG4bK0802a920;received=my_cloud_pbx_ip;rport=5060
From: “Unknown” sip:Unknown@my_cloud_pbx_ip;tag=as567610e8
To: sip:east.provider.com;tag=as7496524a
Call-ID: 475bc5356f1efdb65f5e98cf3b8c23a0@my_cloud_pbx_ip:5060
CSeq: 102 OPTIONS
Server: Asterisk PBX 1.8.13.1~dfsg1-3+deb7u3
Allow: INVITE, ACK, CANCEL, OPTIONS, BYE, REFER, SUBSCRIBE, NOTIFY, INFO, PUBLISH
Supported: replaces, timer
Accept: application/sdp
Content-Length: 0

<------------->
— (11 headers 0 lines) —
Really destroying SIP dialog ‘475bc5356f1efdb65f5e98cf3b8c23a0@my_cloud_pbx_ip:5060’ Method: OPTIONS

<— SIP read from UDP:[provider_eastcoast_POP_IP:5060 —>
INVITE sip:14016546711@my_cloud_pbx_ip SIP/2.0
Via: SIP/2.0/UDP [provider_eastcoast_POP_IP:5060;branch=z9hG4bK7212620e;rport
Max-Forwards: 70
From: “WIRELESS CALLER” sip:[my_cell_number]@[provider_eastcoast_POP_IP;tag=as3d439b9d
To: sip:14016546711@my_cloud_pbx_ip
Contact: sip:[my_cell_number]@:5060[provider_eastcoast_POP_IP
Call-ID: 30fc9dca2b2d8cc728ba282165abc760@[provider_eastcoast_POP_IP:5060
CSeq: 102 INVITE
User-Agent: Asterisk PBX 1.8.13.1~dfsg1-3+deb7u3
Date: Thu, 05 May 2016 18:10:10 GMT
Allow: INVITE, ACK, CANCEL, OPTIONS, BYE, REFER, SUBSCRIBE, NOTIFY, INFO, PUBLISH
Supported: replaces, timer
Remote-Party-ID: “WIRELESS CALLER” sip:[my_cell_number]@[provider_eastcoast_POP_IP;party=calling;privacy=off;screen=no
Content-Type: application/sdp
Content-Length: 303

v=0
o=root 1733330437 1733330437 IN IP4 [provider_eastcoast_POP_IP
s=Asterisk PBX 1.8.13.1~dfsg1-3+deb7u3
c=IN IP4 [provider_eastcoast_POP_IP
t=0 0
m=audio 16524 RTP/AVP 0 3 8 101
a=rtpmap:0 PCMU/8000
a=rtpmap:3 GSM/8000
a=rtpmap:8 PCMA/8000
a=rtpmap:101 telephone-event/8000
a=fmtp:101 0-16
a=ptime:20
a=sendrecv
<------------->
— (15 headers 13 lines) —
Sending to [provider_eastcoast_POP_IP:5060 (NAT)
Sending to [provider_eastcoast_POP_IP:5060 (NAT)
Using INVITE request as basis request - 30fc9dca2b2d8cc728ba282165abc760@[provider_eastcoast_POP_IP:5060

Found peer ‘POP1-out’ for ‘[my_cell_number]’ from [provider_eastcoast_POP_IP:5060

<— Reliably Transmitting (NAT) to [provider_eastcoast_POP_IP:5060 —>
SIP/2.0 401 Unauthorized
Via: SIP/2.0/UDP [provider_eastcoast_POP_IP:5060;branch=z9hG4bK7212620e;received=[provider_eastcoast_POP_IP;rport=5060
From: “WIRELESS CALLER” sip:[my_cell_number]@[provider_eastcoast_POP_IP;tag=as3d439b9d
To: sip:14016546711@my_cloud_pbx_ip;tag=as621a28e2
Call-ID: 30fc9dca2b2d8cc728ba282165abc760@[provider_eastcoast_POP_IP:5060
CSeq: 102 INVITE
Server: FPBX-2.11.0(11.16.0)
Allow: INVITmy_cloud_pbx_ipE, ACK, CANCEL, OPTIONS, BYE, REFER, SUBSCRIBE, NOTIFY, INFO, PUBLISH, MESSAGE
Supported: replaces, timer
WWW-Authenticate: Digest algorithm=MD5, realm=“asterisk”, nonce="53f87fbe"
Content-Length: 0

<------------>
Scheduling destruction of SIP dialog ‘30fc9dca2b2d8cc728ba282165abc760@[provider_eastcoast_POP_IP:5060’ in 32000 ms (Method: INVITE)

<— SIP read from UDP:[provider_eastcoast_POP_IP:5060 —>
ACK sip:14016546711@my_cloud_pbx_ip SIP/2.0
Via: SIP/2.0/UDP [provider_eastcoast_POP_IP:5060;branch=z9hG4bK7212620e;rport
Max-Forwards: 70
From: “WIRELESS CALLER” sip:[my_cell_number]@[provider_eastcoast_POP_IP;tag=as3d439b9d
To: sip:14016546711@my_cloud_pbx_ip;tag=as621a28e2
Contact: sip:[my_cell_number]@:5060[provider_eastcoast_POP_IP
Call-ID: 30fc9dca2b2d8cc728ba282165abc760@[provider_eastcoast_POP_IP:5060
CSeq: 102 ACK
User-Agent: Asterisk PBX 1.8.13.1~dfsg1-3+deb7u3
Content-Length: 0

<------------->
— (10 headers 0 lines) —
li1104-64*CLI>
Disconnected from Asterisk server
Asterisk cleanly ending (0).
Executing last minute cleanups

=====

The bolded part that I called out - “Found peer ‘POP1-out’ for ‘[my_cell_number]’ from [provider_eastcoast_POP_IP:5060” is a little interesting, because “POP1-out” is an outbound peer that’s using registration. Inbound calls should be coming into my inbound IP authentication peers. I think this might be part of the problem - where a call from another POP is coming in, and hitting one of my peers that’s intended for outbound, not inbound. I assume this to be true considering that I see a “401 unauthorized” shortly after that in the debug.

Any way to have inbound calls always hit my peers that I have setup intended for inbound service?

cynjut · May 5, 2016, 7:54pm

Double check your ‘peer’ types:

“Peer” is for inbound
“User” is for outbound
“Friend” is bi-directional.

If you used “Friend” as the type (because it will almost always work) instead of using “Peer”/“User” you might find yourself in the jam you’re describing.

SIP Type Information

simplydrew · May 5, 2016, 7:56pm

Aha! Thanks, Dave. That might be the smoking gun. Both my outbound and inbound peers have type=friend.

Hopefully changing the outbound ones to “user” and inbound ones to “peer” won’t break anything. I’ll give it a shot tonight and see.

cynjut · May 5, 2016, 8:11pm

One catch - I can never remember which is which. “Peer” and “User” are both completely opaque to me.

When I set up connections, I almost always use “Friend” connections and just don’t set up incoming calls. Yes, dear reader, I too am lazy as f*ck sometimes.

Please look it up somewhere.

simplydrew · May 5, 2016, 8:30pm

No worries

I’m having a little bit of trouble, so far, testing this out on my local on prem test system that I have a connection to the provider with as well. Changing the outbound to the “user” configuration seems to bork it with:

[2016-05-05 16:13:42] NOTICE[4316][C-00000007]: chan_sip.c:23121 handle_response_invite: Failed to authenticate on INVITE to ‘sip:[email protected];tag=as6cb089f1’
– SIP/east.provider.com-00000082 is circuit-busy

I think I may need to restructure the config and see what I can find out. Or, maybe they won’t accept the “user” configuration. I’ll see about also changing just the inbound ones and see if that makes a difference.

Will do on the lookup - what I’ve found so far isn’t clear online either