Weird PJSIP bug that killed me today

I have been Google-Fooing to no avail - This morning my carrier had a Datacenter go down, and we should have failed over cleanly to another center with the SRV records that we were registered to - but we didn’t. Looking at packet-traces and such, here is the list of centers in the SRV record:

Identify: customer.vendor.cloud
Match: xxx.xxx.251.41/32
Match: xxx.xxx.251.43/32
Match: xxx.xxx.250.230/32
Match: xxx.xxx.250.43/32

First two IP’s are at the dead data center, next two are alive and kicking.

But here is the problem - PJSIP registers with the first available gateway xxx.xxx.250.230/32 but it insists on sending the Qualifty packets to the first (dead) server xxx.xxx.251.41/32 - so Incoming works, but outgoing fails.

Solution is to configure an outbound proxy in Advanced Settings - sip:xxx.xxx.250.230 and then it registers and qualify’s and Outbound proceeds.

Does anybody know when this will work properly - it totally kills the redundancy of the SIP trunk and causes an outage.

It seems crazy to me that PJSIP has this bug…

I’d suggest filing an Asterisk issue[1]. I found no existing issues reported for such a thing.

From a “how it works” perspective each SIP request results in resolving down the hostname to a list of resolved targets (IP addresses + ports + transport). Request goes out. It fails after a period of time (or instantly in the case of TCP/TLS). On failure of the request it has to invoke a bit of failover logic to move on to the next entry in the list. OPTIONS may not be invoking that logic. I don’t have a time frame on when it would get looked into and resolved.

[1] Issues · asterisk/asterisk · GitHub

Issue Created:

[bug]: PJSIP not following SRV record properly resulting in degraded performance (Outage) · Issue #1927 · asterisk/asterisk

Let’s see your SRV records for all these IPs. Just want to make sure the priority and weights of the records are correct for a failover setup.

I use SRV and I’m not having these problems. Then again, a REGISTER contact doesn’t have SRV in use anyways (unless an outbound proxy is set and using it)

Here is a sanitized version of the SRV record - it’s right:

root@pbx-xxxx:~# dig SRV _sip._udp.host.cloud.provider

; <<>> DiG 9.18.47-1~deb12u1-Debian <<>> SRV _sip._udp.host.cloud.provider
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 10171
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;_sip._udp.host.cloud.provider. IN SRV

;; ANSWER SECTION:
_sip._udp.host.cloud.provider. 3600 IN SRV 10 50 7000 sbc001.z1.las.myvoip.provider.
_sip._udp.host.cloud.provider. 3600 IN SRV 20 50 7000 sbc002.z1.las.myvoip.provider.
_sip._udp.host.cloud.provider. 3600 IN SRV 30 50 7000 sbc001.z1.grr.myvoip.provider.
_sip._udp.host.cloud.provider. 3600 IN SRV 40 50 7000 sbc002.z1.grr.myvoip.provider.

;; Query time: 7 msec
;; SERVER: xxx.xxx.xxx.xxx#53(xxx.xxx.xxx.xxx) (UDP)
;; WHEN: Tue May 12 13:58:55 MDT 2026
;; MSG SIZE rcvd: 248

Per Joshua’s instruction I have opened an Issue on the Asterisk Tracker - it should get fixed at some point…

And it works right if you have the outbound proxy set, correct or did I misread that?

Yes - Setting the OB Proxy to the working server (sip:xxx.xxx.xxx.xxx) makes it work fine but I found out the hard way this morning - when the primary servers come back online, then having it as the forced OB proxy breaks it again sometimes if the secondary servers change - which it did for 2 customers. I just had to pop on and remove it, but it did stop calls from going out until I did it.

It’s a work-around, but not a good one - Joshua gave me a link to the SIP spec that they program to and it specifically says that SIP should step through the list in the SRV record until it get’s a valid response on the Qualify - it’s just stopping after the first server on the list.

Weird bug, and it’s only a problem because the carrier’s Data Center went down forcing the fail-over - it worked for Inbound, just not Outbound.

This is my 5th carrier - True Seamless Fail-Over seems very elusive - no one in the past has been able to have it work correctly, but maybe this bug was the problem all along - I don’t know.

I would love to see a packet trace from a carrier that actually worked - I am thinking that because of this bug, they must do it differently - maybe regenerate the SRV record dynamically putting the (working) servers first on the list - That would bypass this bug.

So in the SIP settings, you have the SIP port set to 0 or blank?

Yes. Blank per instructions.

A work-around to this problem - it’s not elegant, but it works!

All of my FreePBX’s are in the Cloud, so I had to do three things to make this bug stop happening:

  1. Look at the SRV Record under Reports → Asterisk Info and get the list of IP’s specified in the record:

xxx.xxx.251.41/32
xxx.xxx.251.43/32
xxx.xxx.250.41/32
xxx.xxx.250.43/32

Add them as trusted IP’s to your Firewall:

  1. Turn Off Qualify on the Trunk, and then Set the Registration Timeout down to 5 minutes (300 Seconds:

And there you go - Because it is not Qualifying the Trunk, the PJSIP Bug with sending the Qualify to the wrong server (not the one it is registered to) never comes up - It works!