Consistent Asterisk/FreePBX Crash Issue

stevensedory · October 25, 2017, 12:27am

Thanks Dicko. However, we’ve experienced the issue on 13.15 as well. So it’s hard to decide what’s safe. Go to current 14 version?

Also, our procs are CPUs24 x Intel® Xeon® CPU E5-2620 and CPUs24 x Intel® Xeon® CPU L5640, and both seemed to be supported by CentOS 6.6, so that theory’s out the window.

dicko · October 25, 2017, 12:32am

Personally, I have found that using 13. pretty well anything with cdr-mysql will cause that , update to odbc, unload the cdr mysql stuff and maybe . . . ., but yes my machines are quite happy now under 13.18.rc? or 14.6.2 under ProxMox or Vultr ( both have been a PITA for a few months)

tm1000 · October 25, 2017, 5:50am

13.17.2-3.shmz65.1.183 is now live which as you can see has BETTER_BACKTRACES.

The same applies to Asterisk 14 as well

freepbxdev1*CLI> core show settings

PBX Core settings
-----------------
  Version:                     13.17.2
  Build Options:               DONT_OPTIMIZE, COMPILE_DOUBLE, BETTER_BACKTRACES, OPTIONAL_API

stevensedory · October 25, 2017, 6:30am

Great news!

stevensedory · October 25, 2017, 2:32pm

We have ran the FreePBX update scripts and can confirm that BETTER_BACKTRACES show in the build options. Thanks!

stevensedory · October 25, 2017, 2:34pm

Are you referring to the “bad magic number” FRACK Error I mentioned at the top of the thread?

If so, our next crash or FRACK, now that we have BETTER_BACKTRACES enabled, should show that, if it is indeed the cause, correct?

dicko · October 25, 2017, 2:37pm

no, a predicdtable asterisk crash on the second ‘core reload’ (I have never seen a frack in asterisk)

stevensedory · November 13, 2017, 1:34am

So we had another FRACK finally. It now does appear that the “Serious Network Trouble” Error issue we’ve been having seems related. Here’s part of our log:

[2017-11-11 06:03:50] WARNING[8721] chan_sip.c: Unable to cancel schedule ID 0. This is probably a bug (chan_sip.c: do_dialog_unlink_sched_items, line 3266).
[2017-11-11 06:03:50] ERROR[5146] /builddir/build/BUILD/asterisk-13.17.2/include/asterisk/utils.h: Memory Allocation Failure in function ast_str_create at line 655 of /builddir/build/BUILD/asterisk-13.17.2/include/asterisk/strings.h
[2017-11-11 06:03:50] WARNING[5146] chan_sip.c: sip_xmit of 0x7f0428c3af80 (len 139655827686296) to 108.23.78.98:4279 returned -2: Cannot allocate memory
[2017-11-11 06:03:50] ERROR[5146] chan_sip.c: Serious Network Trouble; __sip_xmit returns error for pkt data
[2017-11-11 06:03:50] ERROR[5146] astobj2.c: FRACK!, Failed assertion bad magic number 0x0 for object 0x7f04286eac38 (0)

More info on Asterisk bug tracker here: https://issues.asterisk.org/jira/browse/ASTERISK-27321

tm1000 · November 13, 2017, 3:13am

You didn’t upload the backtrace to the asterisk ticket. Please ensure you do this.

stevensedory · November 13, 2017, 5:10pm

So there was no core dump file, as Asterisk didn’t crash fully, but FRACKs only. How do I go about getting a backtrace for that?

tm1000 · November 14, 2017, 1:56am

You can’t since it didn’t crash it’s not really related to your original issue.

stevensedory · November 14, 2017, 6:17pm

Did you see the log I attached? Seems to have a lot more info around the FRACKs than previous ones, before BETTER_BACKTRACES was enabled. Do you see anything helpful there?

Particularly here:

[2017-11-11 06:03:50] ERROR[5146] /builddir/build/BUILD/asterisk-13.17.2/include/asterisk/utils.h: Memory Allocation Failure in function ast_str_create at line 655 of /builddir/build/BUILD/asterisk-13.17.2/include/asterisk/strings.h

stevensedory · January 26, 2018, 5:53pm

So we’ve been good since November or so.

Yesterday, seemingly out of no where, we had 8,000 FRACKs! And today so far, 4,000!

But sadly, there’s still no useful info. This is what the asterisk cli is showing every several seconds:

[2018-01-26 09:50:48] ERROR[20467]: astobj2.c:131 INTERNAL_OBJ: FRACK!, Failed assertion bad magic number 0x0 for object 0x3fe16a0 (0)
Got 18 backtrace records
#0: [0x607112] asterisk __ast_assert_failed() (0x60708a+88)
#1: [0x45e2c6] asterisk <unknown>()
#2: [0x45e2f3] asterisk <unknown>()
#3: [0x45f5f2] asterisk <unknown>()
#4: [0x45f829] asterisk __ao2_link() (0x45f7e6+43)
#5: [0x45fc9c] asterisk <unknown>()
#6: [0x45ff3f] asterisk __ao2_callback() (0x45fee0+5F)
#7: [0x7fed4997312d] chan_sip.so <unknown>()
#8: [0x7fed49972e6f] chan_sip.so <unknown>()
#9: [0x4dba68] asterisk ast_cli_command_full() (0x4db7f4+274)
#10: [0x4dbbcc] asterisk ast_cli_command_multiple_full() (0x4dbb34+98)
#11: [0x45512a] asterisk <unknown>()
#12: [0x603d14] asterisk <unknown>()

That said, no crashes yet… but calls seem to take a long time to initiate.

Has anything changed to where we can see what is behind the "unknown"s above?

tm1000 · January 26, 2018, 6:01pm

No. We have followed all of Digium’s recommendations.

stevensedory · January 26, 2018, 6:17pm

Dang…

fetoa · January 26, 2018, 6:20pm

Hi Steven,

Did you update the system recently? Did you change anything in your box?

Regards!

stevensedory · January 26, 2018, 6:30pm

Yes.

We recently read that on proxmox, which is where we have about 10 FreePBX distro VMs, the default processor type “kvm64” is essentially equivalent to a Pentium 4 in it’s CPU flag set.

We changed all our VMs to the “host” processor type, as we aren’t doing live migrations. We saw a huge drop in CPU load on all processes within the VMs after we did this.

Probably four hours later, all these FRACKs started, but just on one of the 10 VMs.

As I mentioned near or at the top of this post, we have only had this issue on servers that use TCP for SIP, and a non standard port (not 5060). Our UDP servers have never had the issue. That said, from the several months of research and feedback from experts, that shouldn’t matter. We strongly prefer TCP and the non standard port for security and NAT traversal.

EDIT: Also, eight or so others of the VMs are setup the same way as the one that crashed, and on the same proxmox host.

CORRECTION: the FRACKs started a little over a day after the processor type change. However, non of the other VMs are having this issue.

stevensedory · January 26, 2018, 9:50pm

So ya, we just keep getting these several times a minute. That said, is there something we can look at or monitor that would help identify the cause, since better backtraces isn’t working?

tm1000 · January 26, 2018, 10:10pm

Better Backtraces IS working. It is compiled into Asterisk as described on their wiki and has been cross referenced by two former Digium employees.

stevensedory · January 26, 2018, 10:48pm

So are we alone it ours not showing proper backtrace information? Or is this true for everyone on the FreePBX distro?