Very Puzzling Intermittent Retransmission error

Hey all. I’m at my wit’s end with this one, it’s just so strange! I have one FreePBX server and around 70 phones. They are on different subnets but there are no firewalls or NAT involved.

I have searched high and low about this but I just haven’t found anyone with this problem. I don’t know enough about how Asterisk handles situations when SIP messages go unanswered.

So here’s the problem: Once or twice a day, for about 2-3 minutes, parking and group pickup does not work at all and BLF lights do not change. At the same time, I get a flood of messages like these:

[2017-08-11 14:49:17] WARNING[1762] chan_sip.c: Retransmission timeout reached on transmission [email protected] for seqno 1 (Critical Response) -- See https://wiki.asterisk.org/wiki/display/AST/SIP+Retransmissions
[2017-08-11 14:49:17] WARNING[1762] chan_sip.c: Retransmission timeout reached on transmission [email protected] for seqno 1 (Critical Response) -- See https://wiki.asterisk.org/wiki/display/AST/SIP+Retransmissions
[2017-08-11 14:49:17] WARNING[1762] chan_sip.c: Retransmission timeout reached on transmission [email protected] for seqno 1 (Critical Response) -- See https://wiki.asterisk.org/wiki/display/AST/SIP+Retransmissions
[2017-08-11 14:49:18] WARNING[1762] chan_sip.c: Retransmission timeout reached on transmission [email protected] for seqno 1 (Critical Response) -- See https://wiki.asterisk.org/wiki/display/AST/SIP+Retransmissions

The seqno is different, I just happened to pick these where they’re all 1.

So I’m guessing something is hanging up SIP NOTIFY or PUBLISH messages. I thought it may be a routing or some other intermittent network issue, but the calls in progress continue to work normally… people can still talk and calls do not drop. I have run a packet capture using tcpdump but the only NOTIFY messages it captured were voicemail updates.

After the two or three minutes, the problem clears right up and everything works great again.

I’m not totally sure, but this seems to happen during times of high call volume, but the CPU load is less than 5-10% during those times and there is plenty of memory still available.

Can anyone help me with this? Thanks in advance!!!