PRI is up but D-Channel is down

Callers started getting congestion errors when dialing out on our PRI
DAHDI show status showed

CLI> dahdi show status
Description                              Alarms  IRQ    bpviol CRC    Fra Codi Options  LBO
Wildcard TE131/TE133 Card 0              OK      226667 0      0      ESF B8ZS          0 db (CSU)/0-133 feet (DSX-1)

And all the channels showed in-service but no calls were able to come in or out
after digging through the logs i found:

[2014-08-20 12:54:21] VERBOSE[2230] sig_pri.c: == Primary D-Channel on span 1 down
[2014-08-20 12:54:21] WARNING[2230] sig_pri.c: Span 1: D-channel is down!

Restarting dahdi and asterisk didn’t clear the issue. The only way i was able to clear the issue was to completely restart the server. This is the 2nd time I’ve seen this happen this month. Any idea what could be happening? I was ready to blame the TELCO but as soon as I reboot everything clears up.

After rebooting the only difference i could find was DAHDI show status indicated:

Description                              Alarms  IRQ    bpviol CRC    Fra Codi Options  LBO
Wildcard TE131/TE133 Card 0              OK      0      0      0      ESF B8ZS          0 db (CSU)/0-133 feet (DSX-1)

With an IRQ value of 0 and previously it was 226667

What’s about IRQ conflicts (interrupt handling) on the PCI Bus of the system? or a (mis)configuration of the span (see span= parameter on /etc/dahdi/system.conf) regarding the Clock source (which should/may be your PRI Line Provider)?

Maybe reporting back the content of:

  • /etc/dahdi/system.conf
  • /etc/asterisk/chan_dahdi.conf
  • dahdi_scan (from CLI)
  • dmesg | grep wcte13xp
  • /var/log/messages (when the issue happens)

and version of software components involved (DAHDI at least: dahdi show version) would help Forum’s user to help you diagnose your (irregular) issue.

It couldn’t also be a problem on your end.

Thanks parnassus,
here are the files you requested:

/etc/dahdi/system.conf
https://www.dropbox.com/s/e2ed9oj0a54kk2j/system.conf.txt?dl=0
/etc/asterisk/chan_dahdi.conf
https://www.dropbox.com/s/k5gofcgnl6o71es/chan_dahdi.conf.txt?dl=0
dahdi_scan (from CLI)
https://www.dropbox.com/s/v0mlzmu1bd91vau/dahdi_scan.txt?dl=0
dmesg | grep wcte13xp
https://www.dropbox.com/s/yqen5qjyz5ctt58/dmesg%20|%20grep%20wcte13xp.txt?dl=0
/var/log/messages (when the issue happens)
https://www.dropbox.com/s/wp1y8ii61sz3hpc/messages.txt?dl=0

What color is the LED on the PRI card? It should be green if it is red then the line is down. If it is yellow then there’s a configuration problem. If it was working before and it isn’t now, it’s either one of two things: the card is bad or the T1 line is bad. In my experience, most problems with previously working T1 lines are at the service provider. Put a T1 loopback connector on and ask your service provider to run a check.

It looks like your system rebooted at Aug 20 14:16:31, it was showing signs of stress for at least an hour prior, given your previous info I would investigate your IRQ assignments, maybe with

watch -d cat /proc/interrupts

for shared interrupts (generally not a good idea) with your dahdi device and watch for changes next time it happens.

Your issue may looks similar to this one and, as @dicko wrote, there could be a problem with your Hardware in terms of IRQ assignments for shared interrupt or problems caused by Framebuffer (it looks active in your system “Console: switching to colour frame buffer device 128x48”) like described by Digium here.

Your system (HP ProLiant DL380 G6 with dual Intel Xeon E5504) looks powerful enough but, maybe, it’s preventing its IRQ interrupt handler from running properly or its IRQ interrupts are not being managed reliably.

If you are able to exclude that the T1 Line has issues (so TELCO side is OK and your TE133 is OK) then you can also check your HP System ROM (BIOS): read this (I risk to be a little bit paranoic but this part looks interesting too: “Addressed an issue where the platform may experience networking issues under heavy workloads with Operating Systems, such as Linux RedHat 6.2, and IRQ Balancing enabled in the Operating System. As a result of this issue, software may lose interrupts, receive spurious interrupts or cause a network disconnect.”) and ensure your system is up-to-date (Your DAHDI is up-to-date, your DAHDI Firmware too).

A PRI can’t be up without a D channel, the D channel is a part of the signalling.

The T1 that transports the PRI can be up and the PRI down.

Reboots are not a cure, did you try an amportal stop then service dahdi stop then an amportal start?

My bets is on timing and you are accumulating enough bit errors that Asterisk gives up on the D channel. The B channels are setup call by call so they would be more timing slip tolerant.

worked with Digium and they sent a patch that appears to have cleared the issue. Been up for 30 days solid now.

I believe we’re having the same issue as munozj. Munozj, please check your private message, I’m inquiring you about the patch to the Digium card. Thanks!

I’ve had a few inquiries about the patch. Digium released it on their public repo so here it is:

We have released to the master branch of our public repo a set of firmware updates that addresses “Latency: Underrun detected by hardware latency at max 20 ms.” (More info here: http://git.asterisk.org/gitweb/?p=dahdi/linux.git;a=commitdiff;h=ae5fa08abd1b898c0c080927e75c7249b3982c2d) The issue addressed is identified by the following messages in the kernel log “I/O error reported by firmware” followed by continuous underruns and latency bumps.

This firmware update is for the pcie cards only which include TE131, TE133, TE235, TE435, A4B, and A8B. The pci versions are unaffected. These firmware updates exist only on the master branch of our public repo.

Could you please install DAHDI from the master branch and confirm that this new software/firmware fixes the problem that you reported.

For more information about how to install from source.

https://wiki.asterisk.org/wiki/display/DAHDI/Quick+Start+From+Source

If you experience any issue, please let us know our support team will be able to assist you.

Original Case Number :
Bug Report: DAHDI-1135