Random issues with phones not updating BLF & stopping ringing when call is cancelled

Good afternoon. Over the past month, we’ve been trying to trace down a problem with phones on a (first) migrated system (fresh build, then restore of v15 backup), then built fresh from scratch (no restore, just all fresh, as if a new system). BOTH systems will start to have BLF update issues in short order (where they no longer get updates on the status of various BLFs they’re subscribed to). Phones will also continue to ring in a ring group once the call has actually been answered. Multiple brands of phones, as well as at least 2 different software versions per brand of phone.

To me, both of these are a symptom of the same problem - missed signals from the system to the phone (so the phone holds the last state). The issue seems to happen more when the system is handling more calls (more traffic), but system’s max CPU has been no more than 25% (from the actual host it’s riding on, not the dashboard stats, which shows under 5%).

We’re also seeing some issues with phone registrations being maintained (again, also worse with more traffic). The “Users Online” graph looks like a southwestern skyline (jagged mountains/hills), rather than a nice Oklahoma field (flat line). At times, we’ll also see PJSIP report a new registration as 600s (what we’ve set as our max reg), but throughout the day, you’ll start to see that reg time decrement down to 542s, 318s, 142s, eventually going to the MIN reg time of 60s.

The system is in the Azure cloud, and there are others w/o issue (this system didn’t have issues for months, until it did), and we have a support ticket open with FPBX, but thus far have not been able to work through the issue.

One thought I had (after exhausting many others), in looking through the config files manually, I’ve noticed a couple items I did not put in there, that really shouldn’t be.

pjsip.transports.conf:
tos=cs3
cos=3

pjsip.endpoint.conf:
(on a per-extension basis)
tos_audio=ef
tos_video=af41
cos_audio=5
cos_video=4

Long ago, in CHAN_SIP, I had items similar to this (but with different values) in the custom fields. We learned at some point with those settings, that doesn’t work well in the cloud (sometimes marking those packets actually dropped them rather than ignore), so we found removing them actually worked better. However, I see no place to remove them from the config under SIP Settings. I have no idea how they got there, and if they’re a static element, I’m wondering if a module update ended up making them appear (explaining why it would be a recent issue).

How can I remove them from being generated (so I can test if this is actually part of the problem), as I see no place where they’ve been added in the first place. It seems odd such things would be forced into the config when they may not be compatible with the environment you’re running in.

The other issue we discovered is that the system seems to occasionally send a batch of BLF updates within a single NOTIFY message. Based on traces of an event when this starts to go downhill, it almost seems as if that is what makes the phones go out of sync (in one message, there were 3 BLF updates). It’s almost as if on occasion it’s aggregating the messages - anyone know how to disable it from doing that? The setting on the extension seems to only apply to MWI aggregation, not BLF.

Also, if anyone has a similar issue, please let me know.

Asterisk v16.13.0, as well as 16.11.0, and 16.9.1, using PJSIP with TCP Signalling for endpoints, and UDP signalling for trunks.

Thanks in advance.

Well, for starters, try switching your endpoints to UDP. TCP requires too much overhead that doesn’t scale well on a PBX…mostly because of how TCP holds open connections for each sender/receiver pair to allow for those data sessions. UDP is “connection-less,” requiring far fewer resources (and bandwidth) per client. While it may have been true 5 years ago that TCP was a more stable way to go for NAT’d SIP clients, the internet has evolved by leaps and bounds since that was a problem and UDP will work fine in almost all cases.

Well, I would but a few things about that:
–TLS (being delivered via TCP) is where we’re headed for security reasons…
–We migrated from UDP to TCP months ago on all our systems because we used to have issues with not only NAT, but certain internet providers would sometimes randomly seem to block calls when in UDP, but calls could progress in TCP (this migration eliminated that problem). That migration worked fine for months, and we’ve had no addition in extensions. That was an initial “test” to validate an ultimate migration to TLS would work.
–While using UDP, we had random BLF updates also lost, another reason we moved to TCP, which seemed to eliminate those issues for months.
–Nothing on the system indicates it’s being hammered. It’s got 100 phones, it averages minimal CPU usage, and network statistics show no lag at all.
–And lastly, we tried switching back for a couple days - it didn’t help. We also tried going back to CHAN_SIP, which helped the registration graphs, but didn’t help the BLFs and random dereg.

Also, found how to edit COS/TOS markings on the outbound:

pjsip_custom_post.conf

[0.0.0.0-tcp](+)
tos=0x00
cos=0

[0.0.0.0-udp](+)
tos=0x00
cos=0

[0.0.0.0-tls](+)
tos=0x00
cos=0

Confirmed this did the job on the markers, but sadly, no impact on the actual problem.

So, it seems the issue might actually be a bug. A similar problem to a MWI Notify storm that was happening pre- 16.3, it apparently seems to happen with BLFs as well. Something that seems to reduce it a bit it is to set MWI Subscription Type to UNSOLICITED (I know, what does MWI have to do with BLF - but it seems to help a little).

As you can see, in the trace (below), it’s sending (and only very randomly), a batch of BLF updates all in one packet, causing the phone to only parse the last entry and losing the previous two status updates. From there, the phone loses track and keeps going downhill, eventually re-registering to purge.

Unfortunately, UNSOLICITED didn’t help that much. It seemed to reduce the issue, but eventually it still fails, so looks like this is going to need to be patched.

This is due to the use of a stream based transport, TCP. TCP can place multiple SIP messages into a single TCP packet, unlike UDP which is message based and thus each one holds a single SIP message. It is up to the receiver of it to properly handle this by parsing and handling each SIP message held within.

We just heard back a similar indication on the ticket we’ve got open. But it seems to me, there needs to be a provision to disable this consolidation if the clients/endpoints don’t support it yet. What’s really odd is how sporadic it seems to be at doing this consolidation. I get the intent (similar to the efficiency of a RLS), but if it’s causing problems, I would say it needs to be something that can be disabled by the system admin.

The consolidation isn’t happening in Asterisk or PJSIP, but at the TCP stack level of the operating system. I’m not aware of any mechanism we can use to explicitly toggle such behavior on and off.

As well the packets are likely all being generated and given to the TCP stack near the same time, so it is combining them.

Interesting. I wonder why this only recently became an issue then, and why switching to UNSOLICITED has seemed to help a bit with phones not going out of sync anywhere near as much.

That reduces the amount of state the phone has to keep, thus reducing memory usage. Entirely possible that there is an off-nominal memory leak in the phone in this scenario.

Multiple phones, multiple brands. :frowning:

What may appear to be different brands, can ultimately be the same underlying SIP stack and implementation. I don’t know for any specific ones, though.

Otherwise I can’t say I’ve seen any other reports like this in the Asterisk side so if there is an issue, then the combining of the SIP packets may not be the problem. Working with a phone vendor to determine what is going on with the endpoint would be best, otherwise it is all guesses externally.

We have tickets open with the phone manufacturers as well. The thing that got me, was that the phone firmware didn’t change between the old/new systems, but the problem appeared with a migration from one system to another. Thinking it was corruption, we built a system fresh (no restores), and the issue still presented itself. :frowning:

MWI and BLF are effectively the same, despite being handled in asterisk by separate libraries/macros. They are both employing the SUBSCRIBE/NOTIFY mechanism to operate. The only real difference is that MWI doesn’t require explicit subscriptions - a “subscription” of sorts is created on registration and NOTIFY messages are sent to extensions that have specified that VM number in their mailbox= line.

So, because an endpoint typically only registers one mailbox number, a MWI change will likely only generate a single NOTIFY to the endpoint, regardless of stack used (unless other BLF NOTIFY messages happen to be destined for the endpoint at the same exact time). But, because an extension can have endless BLFs SUBSCRIBEd on the server, this is more likely to result in the consolidation of messages into a single TCP packet, as @jcolp suggests.

This also explains why there is some improvement when you alter the MWI Subscription type options and such…this would potentially affect how all NOTIFY messages are being handled, not just MWI messages.

Ah, gotcha. I new the two used a similar mechanism (NOTIFY), but was thrown off by the “MWI” wording in the FPBX config. That does bring up an interesting question though…

Extension>Advanced>Aggregate MWI - might this (in theory) then also aggregate BLF events (presumably under UDP only, though)? I wonder…

(That was actually a question I asked a year or two ago when I saw that item, hoping it might also apply to BLFs - at the time, hoping so, as the optimization sounded desirable).

That option does not aggregate BLF events in UDP or any other transport within Asterisk/PJSIP. The option aggregates multiple mailboxes into Asterisk into an aggregated count, which is then sent as a single SIP NOTIFY.

1 Like

Gotcha. Thanks - was always curious.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.