I have a strange situation that frankly, I’m out of ideas on how to troubleshoot.
We have a PBXact-100 system running in high availability mode, version 13 (the last version that supported the HA module that was sold to us) with 2 physical servers. We have approximately 80 phones, 68 of which are the old Cisco 7960 IP phones flashed with version P0S3-8-12-00 SIP firmware with the remainder being various model Sangoma S-series phones. When putting the system in, I initially wanted to slowly migrate people to more recent IP phones, but pretty much everybody likes the Cisco 7960 phones better than the Sangoma, some managers even having me replace their Sangoma phones with a Cisco 7960. (From my perspective, the Sangoma phones have a LOT more hardware issues per phone than do the old Ciscos, so I’ve had no problem at all with this.) Due to circumstances when I installed the phone system, we also have a considerable number of spare Cisco 7960 phones - at the low hardware failure rate they have, I won’t have to worry about having to buy more phones any time before civilization ends. All of the phones are on their own VLAN with the PCs hanging off the back of the phones being on their own VLAN. The switches are Cisco 3750 providing PoE for all the phones with a Meraki MX64 as the edge device, which also provides routing between VLANs.
A few months ago, our Meraki MX64 edge device had it’s firmware updated to version 17.10.5 from 17.10.2 to fix an unrelated problem. The day after this happened, the 7960 phones began having this issue. (We know this precise day as the parent of an employee passed and he was off that week; the manager had to work on the floor and noticed it the last day he was covering, which happens to be the day after the firmware update happened.) Unfortunately, nobody told me about this issue until over a month later and Meraki does not allow you to roll back firmware after two weeks of it being applied to the device. I have been in contact with Meraki support and have tried both the 18.107.2 and 17.10.7 firmwares with no impact at all on the issue. We are currently running 17.10.7 as the 17 branch is the ‘gold standard’ for this device according to Meraki support.
The problem is that intermittently when making an outbound call, you will get 1-2 seconds of normal audio, followed 1-3 seconds of silence, and then the audio is normal for the rest of the call. It does not matter if you’re on the phone for a minute or for two hours - the audio is clean after the initial few seconds. Inbound calls do not exhibit this issue, nor do outbound calls to local extensions or to the PBX itself. The problem is completely random from what I have been able to tell - sometimes it’ll happen on 3 calls in a row, other times you might make 5 calls before it happens once. The average seems to be that it will happen once about every 4 calls. It is ONLY affecting the Cisco 7960 phones - none of the the Sangoma s305, s405, s500, and s705 phones have exhibited this. I did not observe this and had no reports of it happening in the 5 years we’ve been using these phones before the Meraki firmware upgrade to 17.10.5.
To troubleshoot, I first looked at the call recordings on our PBX - they were clean. Then I contacted our SIP trunk provider - they confirmed that there is no gap when the calls hit their server. Then I loaded up a packet sniffer on the PBX, reproduced the problem, transferred the capture file to my PC, reconstructed the RTP streams, and upon listening to them discovered that there’s no gap in the audio. So I moved to the Meraki and repeated the same steps - the audio in the capture done from there is clean as well. Determined to figure out what is going on, I connected a 5-port managed switch at the phone with port mirroring turned on, connected a laptop, began a capture, and reproduced the issue. To my amazement, the reconstructed RTP audio stream is again clean, with no indication of there being any kind of gap or silence on the call! I compared a capture of a clean call to one that had the silence and there was no difference that I could tell. (To me, this would indicate a problem inside the phone itself as it’s receiving audio without a gap, but the Cisco SIP firmware for these 7960 phones hasn’t changed in years, it affected every 7960 we have simultaneously, & I cannot get past the fact that it began the morning after the night the Meraki firmware was updated… seems way too much of a coincidence to be coincidental, even though I cannot think of what the MX64 might be doing to cause such a strange, specific, and device-specific malfunction.)
Finally, I connected a 7960 directly to an open LAN port on the PBX. After enabling the interface, the phone booted, registered, and was able to make calls. Unfortunately, there was no audio either way, most likely due to some kind of firewall issue somewhere. Instead, I connected the phone to the same switch as the PBX servers are on to a switch port that is configured to have both data & voice on the same VLAN, just as the PBX servers do. I tested this phone and made 14 calls that did not have the gap before the 15th had it. Further testing showed that this happening on about one out of every 15 calls to be about average in this configuration. To me, having the problem happen about a quarter as frequently as it does when going through the Meraki from phone->PBX points the finger back at the MX64 as being the source of the issue, but the fact that it happens at all disputes that idea. (Even though calls go through the Meraki when being sent out to our SIP provider’s servers.)
I’ve sent a Cisco 7960 phone with SIP firmware to someone who has a newer version of FreePBX and he’s going to report back after he’s had a chance to test it in his environment. (Having a different version of the PBX and a different edge device is going to make it difficult to track down what caused any differences, but I’m down to grasping at straws.) No matter what he observes, I honestly don’t know where to go from here. The problem must be inside the 7960 phone because the packet capture done while tapped into the network at the phone itself doesn’t have the silence. At the same time, the problem can’t be inside the phone as it started with a firmware upgrade to the Meraki & wiring a phone directly to a switch on the voice VLAN cuts it’s frequency to a quarter of what it is going through the Meraki from phone to PBX. Meraki support says that they’ve received no other reports of SIP issues with the firmwares we are on and has basically washed their hands of the problem.
Would be grateful for any suggestions of what to do next.