How to Diagnose Strange Failure

I have a system running SNG7 Distro. Maybe once a week I will wind up in a situation where none of my phones can register. If I SSH in and running fwconsole reload it just hangs forever. If I run fwconsole restart, it tries to till asterisk for 30 seconds, then it tries to force kill it, then it says “asterisk is still running and we can’t stop it.” After that I do a full system reboot and everything runs normally for a while.

I can’t for the life of me figure out what is causing this. It always seems to happen in the middle of the night. Looking at /var/log/asterisk/full during the problem it looks like asterisk was just restarted (I see the build and version information, then some VERBOSE level messages about finding config files). Nothing shows up in the log when I attempt to register the phones.

Any ideas on what could be causing this or how I should go about debugging it?

Using HTOP them kill -9 will kill asterisk and not require a system reboot.

Are you using chan_sip?

Thanks for the tip. I figured I could manually kill asterisk that way, but I wasn’t sure I could get it back after I did. I’m not 100% sure how FreePBX manages the asterisk process.

I’m using PJSIP only (chan_sip is disabled in Advanced Settings).

https://wiki.freepbx.org/display/SUP/Providing+Great+Debug#ProvidingGreatDebug-Backtraces(Segfaults/CoreDumps/AsteriskCrashing)

Thanks for the link. I don’t currently have any core.* files in /tmp, but I’m guessing they would have been cleared when I rebooted. Do you know if there is a setting that needs to be configured to enable core dump files to be placed in /tmp when asterisk crashes?

It just did it again. This time fwconsole restart was able to restart asterisk, so this wasn’t exactly the same behavior. But, it acted the same in that no phone was able to register. No core.* file in /tmp either, but that is to be expected in this scenario I think.

I cranked up several asterisk logging settings while it was down (core set debug 5, core set verbose 5, pjsip set logger on), and then I ran several registration attempts. Nothing showed in the asterisk logs when I attempted to register a phone.

I connected via ADB to the linphone-android app and saw the following logs when it attempted to register:

05-09 16:00:45.538  4948  4948 I Linphone   : transaction [0x9ab74e60] channel state changed to [RES_DONE]
05-09 16:00:45.538  4948  4948 I Linphone   : channel 0x97923a00: state CONNECTING
05-09 16:00:45.538  4948  4948 I Linphone   : transaction [0x9ab74e60] channel state changed to [CONNECTING]
05-09 16:00:45.539  4948  4948 I Linphone   : Trying to connect to [TLS://::ffff:192.168.1.5:5061]
05-09 16:00:45.626  4948  4948 I Linphone   : Channel [0x97923a00]: Connected at TCP level, now doing TLS handshake with cname=freepbx.server.com
05-09 16:00:45.626  4948  4948 I Linphone   : Channel [0x97923a00]: SSL handshake in progress...
05-09 16:01:39.304  4948  4948 E Linphone   : Could not receive stream packet: Connection reset by peer
05-09 16:01:39.305  4948  4948 E Linphone   : Channel [0x97923a00]: SSL handshake failed : NET - Connection was reset by peer
05-09 16:01:39.305  4948  4948 E Linphone   : Cannot connect to [TLS://freepbx.server.com:5061]
05-09 16:01:39.305  4948  4948 I Linphone   : channel 0x97923a00: state ERROR

So, the SSL handshake is failing because the server is resetting the connection. But, I don’t see any logs about attempted or failed connections on the server side.

Here’s where it gets weird. When I turned on debugging logs in Asterisk I saw a TON of Examining AMI event: type messages. They all seem to be related to the last successful call before the system started failing. When I look at the timestamps from my ITSP I see that the call came around 15:27 local time and ended about a minute later.

In my asterisk logs I see a burst of messages around 15:27 related to the incoming call and then nothing until around 15:55 when I noticed the outage and enabled more verbose debugging. Here is a typical AMI event message:

[2018-05-09 15:55:08] DEBUG[6398] manager.c: Examining AMI event:
Event: RTCPSent
Privilege: reporting,all
Channel: PJSIP/1101-00000262
ChannelState: 6
ChannelStateDesc: Up
CallerIDNum: 1101
CallerIDName: Endpoint Name
ConnectedLineNum: +15558675309
ConnectedLineName: Caller Name Here
Language: en
AccountCode:
Context: macro-dial
Exten: s
Priority: 1
Uniqueid: 1525904848.613
Linkedid: 1525904847.612
To: 192.168.1.10:7077
From: 192.168.1.5:19783
SSRC: 0x1985c1a0
PT: 201(RR)
ReportCount: 1
Report0SourceSSRC: 0xd0b15fff
Report0FractionLost: 0
Report0CumulativeLost: 0
Report0HighestSequence: 3023
Report0SequenceNumberCycles: 0
Report0IAJitter: 8
Report0LSR: 4102905273
Report0DLSR: 1598.1920

So, minutes after the call was supposed to have ended, I’m seeing messages that say the channel is “Up”. Those messages continue until I run fwconsole restart. Stranger still, the extension listed in the RTCPSent event is not the one that the CDR reports show answered the call.

I’m totally stumped. Any ideas?

This just bit me again. It’s the strangest thing. I log in to the FreePBX dashboard and it says all systems are go (green checkmarks across the board). It says that I have the normal number of users and trunks “online” in the status widget. But, no phones can register and no incoming calls are received.

My ITSP shows incoming calls as “No Answer” not “Failed” like it normally would if my PBX was completely down (e.g. powered off).

I run fwconsole restart and it fails to kill asterisk. I then run kill -9 <asterisk pid> and then fwconsole start and we’re back in business.

I realize this kind of thing is almost impossible to debug with weeks/months between occurrences and no actual crash, but if anyone has any ideas I would love to hear them. I just can’t afford for my phone system to randomly drop out for no reason.

It’s a particularly nasty failure mode too because the SIP trunk provider doesn’t even recognize the PBX as being down, so their fallback/disaster recovery mechanisms never kick in.

I’m also facing the same issue.

See this post https://pbxinaflash.com/community/threads/asterisk-is-hanging-crashing-and-i-cant-stop-it-corrupt-database.9118/

Asterisk could behave strangely when there is an issue with DNS. If you have found any solution to this already, please let me know.

I don’t know how you are facing the same issue and can point to that post as proof of your issue. 1) It’s PBX In a Flash which isn’t FreePBX so we have no idea what that dude does to things. 2) It’s a post from EIGHT YEARS AGO. 3) It’s for IAX. So I’m not sure how that post would be relevant to FreePBX 14 running versions of Asterisk that did not exist eight years ago along with the fact you’re using Chan_SIP.

So why don’t you tell us what your actual issues are and we’ll see if they are the same.

It happened again this morning. I’m posting mostly just to keep this open on the off chance someone else has experienced something like this and can help.

I took a look at the debugging output again. This time I still saw some AMI events, but nothing about a previous call. They were all Event: SuccessfulAuth and I saw the exact same type of events continuing after I restarted, so I don’t think they will give any clues into what caused the crash.

I’ve decided that what I need is an external system to regularly ping the server and see if it can register. If it can’t, the script will simply kill asterisk and restart everything. Right now SIPp or pjsua are my top options, but I haven’t had the time to figure out how to use either of them.

Things that will cause this:

  • DNS Failure will cause Chan-SIP to lock up and no longer connect.
  • Your AMI events are almost certainly not the problem.
  • You need to look at the Asterisk log output (/var/log/asterisk/full) and see when the failure occurred.

Almost everything else that can cause this are external to the system, and all of them end up killing the Chan-SIP connection. DNS failures, firewall lock-outs, etc. are some of the culprits. Without more debug, though, it’s going to be you against the world…

Thanks for the insights. I took a look at /var/log/asterisk/full and it’s similar to before. Standard events about handling calls and checking trunks all at the VERBOSE or DEBUG level (no errors or even warnings). Then, absolutely no log entries until I arrive and restart everything.

To give you an idea, I have about 1700 entries in the log file from 04:00 to 16:33 when the outage happened. I have 0 entries from 16:33 to 18:04 when I logged in and started looking at debugging info.

I hadn’t considered a DNS failure impacting this. I use PJSIP exclusively (chan_sip disabled in FreePBX advanced settings). Do you know if the problems you mentioned effect PJSIP as well?

Nope - that’s one of the reasons PJ-SIP exists.

It just happened again right in the middle of the day. This is driving me crazy. The system is stable for months and then out of the blue it locks up with no warning or errors logged.

This time I connected to the Asterisk cli with asterisk -r and ran the following

core show channels verbose
Channel              Context              Extension        Prio State   Application  Data                      CallerID        Duration Accountcode PeerAccount BridgeID
PJSIP/MainTrunk      from-pstn            +15551234567       20 Ring    AGI          /var/www/html/admin/modul +15559876543    00:20:10
1 active channel
1 active call
139 calls processed

The interesting thing to me here is the duration. I had 1 active call that was apparently ringing (no ring was heard on any phone) for 20 minutes. This must be connected, but I can’t see how.

I ran core restart now and that effectively locked up asterisk (no commands generated any output). I exited the cli and tried to reattach but it just froze (Ctrl+C didn’t even work). Then I ran kill -9 $(pidof asterisk) followed by fwconsole restart and I’m up again.

This looks a lot like you are seeing the Asterisk bug recently announced, but that is a very recent issue and would not have been the cause before a few weeks ago. Read about it here:

Interesting. I am on 15.7.2, so that could be related for the most recent failure. But as you mentioned, I have been seeing this problem for a long time (almost a year now).

I might try switching to Asterisk 16. I guess I’ll know in 3-6 months if that solved the problem…

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.