The caller experience is this: call an extension, the other end picks up, garbled audio or no audio is heard for several seconds, then the call continues normally. Both parties tend to experience an interruption.
I took a pcap. I analyzed it with Wireshark’s voip telemetry tools. It indicated a skew occurred at the time the phone was picked up - the ringing is part of the audio capture, then when the party picks up the skew shoots up to a value that stays very near this one: 21080251ms. If you do the math, this is about 6 hours; I’m in the central time zone (-6 GMT) so this sets off alarm bells that something is messing with the time stamp sent with these packets.
After several seconds, the jitter levels out and descends to the 100ms range, then far below (small building, gigabit network). I’ll attach the picture.
What is causing this skew/jitter to occur? Seems like the packets are coming through with the wrong time sequence and it’s causing some garbled audio, regularly. Is this FreePBX 17 / Asterisk 20 or our Grandstream 2135/2670 phones? I’m not sure where to look to correct something like this.
I’ve been looking at the RTP data RFC definitions; I guess I don’t understand why we have issues with this after looking - the RTP packets are supposedly sequenced, so as long as they’re assembled on the other end with regard just to the sequence numbers everything should be fine. There is also a “timestamp” on these packets, which I guess is used by (at least) wireshark to measure jitter & skew.
I’ll say I’ve also seen jitter values that were more like a 12 hour difference (11.8), vs. a 6 hour difference (5.9). This only seems to happen on the stream the pbx sends to the phone. From the phone to the PBX does not cause this jitter value.
I’ve been thinking that changing something like codec or codec negotiation (by callee or caller) might make a difference, or if I tried ‘direct media’ so the streams were handed off to the endpoints and packets didn’t come from asterisk except for ringing?
Since I’m not getting much feedback so far, I’ll try this stuff and see if it makes a difference.
Start with checking your hardware clock and your system clock are synced, make sure the sync is to UTC and not local time. Then verify that you have a valid NTP source to keep your hardware clock regulated
It is also used by the RTCP code in the receiving entity, and to retime the packets, as they leave the jitter buffers, and to decide when to shrink or grow those. They should increase by 1 for every sample time, within the stream associated with a given SSRC. If there is silence, that can be represented by samples which are never sent, but still counted in the time stamp values.
At least in older versions, Asterisk is not good about using different SSRCs for each timestamp sequence, and misuses the marker bit to indicate discontinuities (they are really hints that this is a good place to shrink/grow jitter buffers, but there are none in your log).
To be sure of what is going on, I think one needs the timestamps across a discontinuity, but I think delta represent the arrival time step, measured in terms of the logging system’s system time, and the jitter peak from a very large step in the timestamp. My guess is that there is an unsignalled change in source, and the excess delta represent a pause in mid transition, and the jitter comes from an associated change in timestamp source, possibly between an internal source, for call progress tones, and an external source, for media being passed through.
I don’t know where this is being captured, so I don’t know where the anomaly is being introduced, and whether Asterisk is involved.
I’m sorry I don’t have anything concrete to add at the moment, but your description sounds eerily similar to something I experienced a while back… I’d like to look at the captures I took at the time and see if the same high skew is present. Would you mind telling me where in Wireshark you found that chart listing the packets, jitter, and skew?
If you want to set up an NTP client, feel free to poke me - have done so many times and can send you a config file that would get it running in a matter of minutes if you’d like.
Thank you Dicko. I wasn’t sure if time zone / ntp settings would have much effect on the network hardware, since the packets seem to carry all that info in their headers.
I found a switch that had ntp and was showing the correct time for my time zone, but was set at 0 UTC somehow. I’ll perform another test and report back to the community.
As for OS/time settings, we’re using Debian & FreePBX 17, with the commercial sysadmin module, so we’ve got that set. Running ‘date’ on the cli reports the correct time.
The capture screenshot above was taken from the vm running asterisk.
Thank you for the detailed reply about how RTP/RTCP headers affect packet reassembly and how it’s handled by asterisk! I needed to understand more about how these packets were being affected by their environment.
So yeah, with NTP set time zone shouldn’t matter. I hunted around with netdisco to try and find some unknown network hardware buried in the wall - no dice.
I did notice that one of the switches seems to be connected to all of the clients experiencing the issue; It’s a Cisco 2960, up for over 4.5 years. Should be fine but it seems to be the only common denominator at this point.
Here’s my timedatectl output:
root@pbx:~# timedatectl
Local time: Fri 2025-03-14 08:49:47 CDT
Universal time: Fri 2025-03-14 13:49:47 UTC
RTC time: Fri 2025-03-14 13:49:47
Time zone: America/Chicago (CDT, -0500)
System clock synchronized: yes
NTP service: active
RTC in local TZ: no
root@pbx:~#
Still looking at this. We considered some kind of layer 1 issue between our switches… our current plan is to implement fiber between the switches, add another switch and move all phones to that switch, and move the hosts to that switch as well to eliminate any extra-switch traffic.
However, after rebooting that long time switch, about 12 hours later everything cleared up. A packet capture immediately after showed continued issues, but now for 5 days I haven’t been able to gather any evidence of one, where we were getting issues multiple times per day.
Looks like a network issue, though, and not a configuration issue. Really appreciate all the input. I’ll check back once more after the overhaul and when we finish the final pcaps.