FreePBX | Register | Issues | Wiki | Portal | Support

How to Diagnose Strange Failure

Tags: #<Tag:0x00007fcd1dedb308>

(Dominic) #1

I have a system running SNG7 Distro. Maybe once a week I will wind up in a situation where none of my phones can register. If I SSH in and running fwconsole reload it just hangs forever. If I run fwconsole restart, it tries to till asterisk for 30 seconds, then it tries to force kill it, then it says “asterisk is still running and we can’t stop it.” After that I do a full system reboot and everything runs normally for a while.

I can’t for the life of me figure out what is causing this. It always seems to happen in the middle of the night. Looking at /var/log/asterisk/full during the problem it looks like asterisk was just restarted (I see the build and version information, then some VERBOSE level messages about finding config files). Nothing shows up in the log when I attempt to register the phones.

Any ideas on what could be causing this or how I should go about debugging it?

(ADTopkek) #2

Using HTOP them kill -9 will kill asterisk and not require a system reboot.

Are you using chan_sip?

(Dominic) #3

Thanks for the tip. I figured I could manually kill asterisk that way, but I wasn’t sure I could get it back after I did. I’m not 100% sure how FreePBX manages the asterisk process.

I’m using PJSIP only (chan_sip is disabled in Advanced Settings).

(Itzik) #4

(Dominic) #5

Thanks for the link. I don’t currently have any core.* files in /tmp, but I’m guessing they would have been cleared when I rebooted. Do you know if there is a setting that needs to be configured to enable core dump files to be placed in /tmp when asterisk crashes?

(Dominic) #6

It just did it again. This time fwconsole restart was able to restart asterisk, so this wasn’t exactly the same behavior. But, it acted the same in that no phone was able to register. No core.* file in /tmp either, but that is to be expected in this scenario I think.

I cranked up several asterisk logging settings while it was down (core set debug 5, core set verbose 5, pjsip set logger on), and then I ran several registration attempts. Nothing showed in the asterisk logs when I attempted to register a phone.

I connected via ADB to the linphone-android app and saw the following logs when it attempted to register:

05-09 16:00:45.538  4948  4948 I Linphone   : transaction [0x9ab74e60] channel state changed to [RES_DONE]
05-09 16:00:45.538  4948  4948 I Linphone   : channel 0x97923a00: state CONNECTING
05-09 16:00:45.538  4948  4948 I Linphone   : transaction [0x9ab74e60] channel state changed to [CONNECTING]
05-09 16:00:45.539  4948  4948 I Linphone   : Trying to connect to [TLS://::ffff:]
05-09 16:00:45.626  4948  4948 I Linphone   : Channel [0x97923a00]: Connected at TCP level, now doing TLS handshake with
05-09 16:00:45.626  4948  4948 I Linphone   : Channel [0x97923a00]: SSL handshake in progress...
05-09 16:01:39.304  4948  4948 E Linphone   : Could not receive stream packet: Connection reset by peer
05-09 16:01:39.305  4948  4948 E Linphone   : Channel [0x97923a00]: SSL handshake failed : NET - Connection was reset by peer
05-09 16:01:39.305  4948  4948 E Linphone   : Cannot connect to [TLS://]
05-09 16:01:39.305  4948  4948 I Linphone   : channel 0x97923a00: state ERROR

So, the SSL handshake is failing because the server is resetting the connection. But, I don’t see any logs about attempted or failed connections on the server side.

Here’s where it gets weird. When I turned on debugging logs in Asterisk I saw a TON of Examining AMI event: type messages. They all seem to be related to the last successful call before the system started failing. When I look at the timestamps from my ITSP I see that the call came around 15:27 local time and ended about a minute later.

In my asterisk logs I see a burst of messages around 15:27 related to the incoming call and then nothing until around 15:55 when I noticed the outage and enabled more verbose debugging. Here is a typical AMI event message:

[2018-05-09 15:55:08] DEBUG[6398] manager.c: Examining AMI event:
Event: RTCPSent
Privilege: reporting,all
Channel: PJSIP/1101-00000262
ChannelState: 6
ChannelStateDesc: Up
CallerIDNum: 1101
CallerIDName: Endpoint Name
ConnectedLineNum: +15558675309
ConnectedLineName: Caller Name Here
Language: en
Context: macro-dial
Exten: s
Priority: 1
Uniqueid: 1525904848.613
Linkedid: 1525904847.612
SSRC: 0x1985c1a0
PT: 201(RR)
ReportCount: 1
Report0SourceSSRC: 0xd0b15fff
Report0FractionLost: 0
Report0CumulativeLost: 0
Report0HighestSequence: 3023
Report0SequenceNumberCycles: 0
Report0IAJitter: 8
Report0LSR: 4102905273
Report0DLSR: 1598.1920

So, minutes after the call was supposed to have ended, I’m seeing messages that say the channel is “Up”. Those messages continue until I run fwconsole restart. Stranger still, the extension listed in the RTCPSent event is not the one that the CDR reports show answered the call.

I’m totally stumped. Any ideas?

(Dominic) #7

This just bit me again. It’s the strangest thing. I log in to the FreePBX dashboard and it says all systems are go (green checkmarks across the board). It says that I have the normal number of users and trunks “online” in the status widget. But, no phones can register and no incoming calls are received.

My ITSP shows incoming calls as “No Answer” not “Failed” like it normally would if my PBX was completely down (e.g. powered off).

I run fwconsole restart and it fails to kill asterisk. I then run kill -9 <asterisk pid> and then fwconsole start and we’re back in business.

I realize this kind of thing is almost impossible to debug with weeks/months between occurrences and no actual crash, but if anyone has any ideas I would love to hear them. I just can’t afford for my phone system to randomly drop out for no reason.

It’s a particularly nasty failure mode too because the SIP trunk provider doesn’t even recognize the PBX as being down, so their fallback/disaster recovery mechanisms never kick in.