Sip trunk failure every Sunday morning, not re-registering without stoping asterisk

Hi,

I have set up a VM FreePBX 14.0.13.4, and a sip trunk to my local phone provider which is working fine. The pbx is behind a Zyxel firewal which then connects to the router. Every Sunday morning at 04:30 I suppose the phone provider is doing some DSLAM maintenance or firmware upgrades and my SIP trunk goes down and does not re-register until I come to office and stop asterisk for at least 60 minutes and restart it, or stop the pbx’s network or of course reboot the pbx, or reboot my firewall. I have a static IP, NAT Settings as follows:
NAT=yes
IP configuration=Dynamic IP (though changed today that to Static IP)
Override External IP=myhostname (though changed that today to my public IP)

I have setup pbx’s SIP Settings as follows:
Registration Timeout=65
Registration Attempts=0
Minimum Expiry=60
Maximum Expiry=300
Default Expiry=120

Here is what I get in logs:
[2019-07-14 04:31:10] VERBOSE[5271] asterisk.c: Remote UNIX connection
[2019-07-14 04:31:10] VERBOSE[3326] asterisk.c: Remote UNIX connection disconnected
[2019-07-14 04:31:10] VERBOSE[5271] asterisk.c: Remote UNIX connection
[2019-07-14 04:31:10] VERBOSE[3332] asterisk.c: Remote UNIX connection disconnected
[2019-07-14 04:31:41] ERROR[5353] netsock2.c: getaddrinfo(“myhostname”, “(null)”, …): Name or service not known
[2019-07-14 04:31:41] NOTICE[5353] chan_sip.c: Warning: Re-lookup of ‘myhostname’ failed!
[2019-07-14 04:31:45] NOTICE[5353] chan_sip.c: Peer ‘LINE1’ is now UNREACHABLE! Last qualify: 32
[2019-07-14 04:31:55] VERBOSE[5271] asterisk.c: Remote UNIX connection
[2019-07-14 04:31:55] VERBOSE[3343] asterisk.c: Remote UNIX connection disconnected
[2019-07-14 04:32:00] ERROR[5283] netsock2.c: getaddrinfo(“serviceprovider”, “(null)”, …): Name or service not known
[2019-07-14 04:32:00] WARNING[5283] acl.c: Unable to lookup ‘serviceprovider’
[2019-07-14 04:32:13] WARNING[5353] chan_sip.c: Retransmission timeout reached on transmission [email protected][::1] for seqno 5130 (Critical Request)
Packet timed out after 31999ms with no response
[2019-07-14 04:32:56] ERROR[5283] netsock2.c: getaddrinfo(“serviceprovider”, “(null)”, …): Name or service not known
[2019-07-14 04:32:56] WARNING[5283] acl.c: Unable to lookup ‘serviceprovider’
[2019-07-14 04:33:36] VERBOSE[5271] asterisk.c: Remote UNIX connection
[2019-07-14 04:33:36] VERBOSE[3591] asterisk.c: Remote UNIX connection disconnected
[2019-07-14 04:33:42] ERROR[5353] netsock2.c: getaddrinfo(“serviceprovider”, “(null)”, …): Name or service not known
[2019-07-14 04:33:42] WARNING[5353] acl.c: Unable to lookup ‘serviceprovider’
[2019-07-14 04:33:46] ERROR[5283] netsock2.c: getaddrinfo(“myinternalpbx”, “(null)”, …): Name or service not known
[2019-07-14 04:33:46] WARNING[5283] acl.c: Unable to lookup ‘myinternalpbx’
[2019-07-14 04:33:52] NOTICE[5353] chan_sip.c: – Registration for ‘[email protected]’ timed out, trying again (Attempt #2)
[2019-07-14 04:34:24] WARNING[5353] chan_sip.c: Retransmission timeout reached on transmission [email protected][::1] for seqno 5132 (Critical Request)
Packet timed out after 31999ms with no response
[2019-07-14 04:34:29] VERBOSE[5271] asterisk.c: Remote UNIX connection
[2019-07-14 04:34:29] VERBOSE[3733] asterisk.c: Remote UNIX connection disconnected
[2019-07-14 04:34:57] NOTICE[5353] chan_sip.c: – Registration for ‘[email protected]’ timed out, trying again (Attempt #4)
[2019-07-14 04:35:14] VERBOSE[5271] asterisk.c: Remote UNIX connection
[2019-07-14 04:35:14] VERBOSE[3892] asterisk.c: Remote UNIX connection disconnected
[2019-07-14 04:35:29] WARNING[5353] chan_sip.c: Retransmission timeout reached on transmission [email protected][::1] for seqno 5133 (Critical Request)
Packet timed out after 32000ms with no response

Might that be a configuration problem of my pbx, or something is messing with the firewall’s NAT?

Looks like it might be a DNS issue. I’d also check connection timeouts on your firewall.

I have checked the DNS issue, and firewall connections by running a constant ping serviceprovider > logfile

Of course by the time the DSL goes down (provider’s maintenance) at 04:31:41 there is no ping but after the network is up, the ping goes on but trunk does not register.
Also there are other time my DSL internet connection goes down and I get messages from FreePBX (I have set up monit) but when it is re-established the trunk registers succesfully.

I suppose there must be some routine running in the firewall that I cannot think of, keeping asterisk’s UDP connection to the provider active.

I am really stuck, don’t know how to proceed. And it is annoying for the employees to find phones not working every Monday morning.

This shouldn’t be hard to troubleshoot, but as a temporary fix, a cron job that stops Asterisk at 04:45 on Mondays, and another that restarts FreePBX Mondays at 06:00?

As for the temporary fix, if I stop Asterisk for a minute (60 seconds) and restart it then trunk registration is OK. If I restart it in less than 60 seconds then registration keeps failing. I have set up the cron job but then the phones (Yealink T29G) loose the BLF registration and users need to reboot their phones to get BLF working again.

Do you think there might be a trunk setup issue? I suppose not because trunk is working fine and registration is done even after I disconnect the internet and reconnect it.

Just for the temporary fix, found out that I can issue command from asterisk to reboot the phones too.

#Log into the Asterisk CLI
asterisk -r

#send a notify command to a pjsip extension
pjsip send notify reboot-yealink endpoint 103

#send a notify command to a sip extension
sip notify reboot-yealink 103

…need to change reboot to true in sip_notify_additional.conf

[reboot-yealink]
Event=>check-sync\;reboot=true

Can your provider be set up with IP authentication (or otherwise send calls to a hard coded SIP URI), so you don’t depend on registration at all?

Otherwise, just as a guess, try setting Registration Timeout and qualify frequency to 600 (10 minutes). If there is a ‘poisoned’ NAT association in the router or firewall, the typical timeout there is 3 or 5 minutes. If this doesn’t help, confirm with tcpdump that there is no activity, e.g. from qualify that is keeping the bad association alive.

If no luck, tell us some more about your network. Are you double-NATted? Or, is the Zyxel configured as a ‘bridging’ firewall? Or, do you have multiple public IPv4 addresses and the (non NAT) router is handing one to the Zyxel? Finally, what virtualization is being used? Is it bridged networking, or is there something complex happening there, too?

I had the registration timeout to 65s, changed that to 600…
In the trunk I had the qualify=yes, changed that to 600, too.

Will wait to see if it happends again next Sunday, and report back

I’d ask your ISP why they need to take down circuits weekly for maintenance. That is quite insane.

It is the local (national) telecoms provider. Though I have already asked them and they responded that those are all new circuits with new equipment (routers). They do firmware and backbone upgrades that eventually will reach maturity. In fact there were some weeks that they did not any maintenance.

Just to make sure I am ready to inspect with tcpdump if this happens again, by using
tcpdump -nqt -s 0 -A host serviceprovider and port 5060

I can see SIP messages to the specific provider such as REGISTER, OPTIONS, INVITE, ACK, BYE
In case a have a ‘poisoned’ NAT association in my firewall or router, asterisk will still try to register the trunk (every 600s) and tcpdump will show me something different? Should I search for some patern/response that will lead me to the problem?

Instead of changing a default setting, add your own.
go to sip_notify_custom.conf

Add this

[reload-yealink]
Event=>check-sync\;reboot=false

[restart-yealink]
Event=>check-sync\;reboot=true

You can do more than one at a time BTW.

rasterisk -x "pjsip send notify restart-yealink endpoint 103 105 106"

1 Like

When it’s working correctly, you should see:
REGISTER sent
401 UnAuthorized or 407 Proxy Authentication Required received
REGISTER sent with Authorization header
200 OK received

If after the outage you have a bad NAT association, you would expect to see
REGISTER sent
(no reply)
REGISTER resent (identical to first)
(no reply)
(repeated retransmissions until timeout)

If this happens, try to capture traffic at the router to determine if it’s a router or a firewall issue.

1 Like

By firewall issues, I mean “connection tracking.” You can’t test this by pinging; you have to look at the firewall’s NAT / UDP connection table (For example, on Cisco ASA the "show connection address " command). Many firewalls have a terrible time with UDP connection tracking through an ISP or upstream connection outage. I don’t know what your infrastructure options are, but DSL is pretty awful in general for VoIP. I’ve found in situations like these that removing the entry from the firewall’s connection table resolves the issue.

I would agree with most of the posters here that a process reset of some sort is probably your easiest option.

Thatnk for this, I changed the phones’ restart to a single line like

asterisk -rx "pjsip send notify restart-yealink endpoint 103 105 106"

So I changed the Registration Timeout and trunk’s qualify to 600. At 04:30 exactly, Sunday morning, the system lost connection and could not register to the provider. My router shows an active line at that time (has not been down for 20 days) so I pressurme internet was active. Didn’t have the continuous ping in place.
What I found out studying the logs is that the registration process took place up to 04:40 without any success utilizing the same nonce that was given to the system before the disconnection.

So at 04:40 the system got a ‘SIP/2.0 401 Unauthorized’ response and it succesfully registered to the trunk.

So that worked for me, at least for now.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.