Timing/Interrupt issues under load on Dual Xeon Supermicro board

Fri Dec 29 09:01:45 PST 2006

My company recently purchased several Silicon Mechanics servers with the
Supermicro X7DBR-i+ motherboards. They run the intel 5000P chipset. Two
dual-core 3.0ghz processors in each. 3ware SATA RAID controllers (twa).

I installed 6-STABLE (6.2-PRERELEASE). Under very heavy CPU load on one
or more of the cores (the load is generated by heavy parsing by perl
scripts), the system will start throwing a combination of RAID timeouts
and calcru negative runtime errors. Most of the servers, which have
little-moderate load, have no problems. Here are some examples of the
errors:

Dec 29 10:20:45 butcher-p13 kernel: twa0: ERROR: (0x05: 0x210B): Request
timed out!: request = 0xffffffff881f7610
Dec 29 10:20:45 butcher-p13 kernel: twa0: INFO: (0x16: 0x1108):
Resetting controller...:
Dec 29 10:20:45 butcher-p13 kernel: twa0: INFO: (0x04: 0x0001):
Controller reset occurred: resets=21
Dec 29 10:20:45 butcher-p13 kernel: twa0: INFO: (0x16: 0x1107):
Controller reset done!:
Dec 29 10:20:45 butcher-p13 kernel: twa0: ERROR: (0x05: 0x210B): Request
timed out!: request = 0xffffffff881f36d0
Dec 29 10:20:45 butcher-p13 kernel: twa0: INFO: (0x16: 0x1108):
Resetting controller...:
Dec 29 10:20:45 butcher-p13 kernel: twa0: INFO: (0x04: 0x0001):
Controller reset occurred: resets=22
Dec 29 10:20:45 butcher-p13 kernel: twa0: INFO: (0x16: 0x1107):
Controller reset done!:
Dec 29 10:20:45 butcher-p13 kernel: twa0: ERROR: (0x05: 0x210B): Request
timed out!: request = 0xffffffff881f7400
Dec 29 10:20:45 butcher-p13 kernel: twa0: INFO: (0x16: 0x1108):
Resetting controller...:
Dec 29 10:20:45 butcher-p13 kernel: twa0: INFO: (0x04: 0x0001):
Controller reset occurred: resets=23
Dec 29 10:20:45 butcher-p13 kernel: twa0: INFO: (0x16: 0x1107):
Controller reset done!:
Dec 29 10:20:45 butcher-p13 kernel: twa0: ERROR: (0x05: 0x210B): Request
timed out!: request = 0xffffffff881f59e0
Dec 29 10:20:45 butcher-p13 kernel: twa0: INFO: (0x16: 0x1108):
Resetting controller...:
Dec 29 10:20:45 butcher-p13 kernel: twa0: INFO: (0x04: 0x0001):
Controller reset occurred: resets=24
Dec 29 10:20:45 butcher-p13 kernel: twa0: INFO: (0x16: 0x1107):
Controller reset done!:
---
Dec 29 09:44:15 butcher-p13 kernel: calcru: runtime went backwards from
244314 usec to 236341 usec for pid 24 (irq19: em1 uhci1)
Dec 29 09:44:17 butcher-p13 kernel: calcru: negative runtime of -3744
usec for pid 852 (sshd)
Dec 29 09:44:17 butcher-p13 kernel: calcru: runtime went backwards from
244314 usec to 241130 usec for pid 24 (irq19: em1 uhci1)
Dec 29 09:44:17 butcher-p13 kernel: calcru: negative runtime of -3744
usec for pid 852 (sshd)
Dec 29 09:44:17 butcher-p13 kernel: calcru: runtime went backwards from
244314 usec to 241130 usec for pid 24 (irq19: em1 uhci1)
Dec 29 09:44:19 butcher-p13 kernel: calcru: negative runtime of -3583
usec for pid 852 (sshd)
Dec 29 09:44:19 butcher-p13 kernel: calcru: runtime went backwards from
244314 usec to 241611 usec for pid 24 (irq19: em1 uhci1)
Dec 29 09:44:19 butcher-p13 kernel: calcru: negative runtime of -3583
usec for pid 852 (sshd)
Dec 29 09:44:19 butcher-p13 kernel: calcru: runtime went backwards from
244314 usec to 241611 usec for pid 24 (irq19: em1 uhci1)
Dec 29 09:44:21 butcher-p13 kernel: calcru: negative runtime of -3486
usec for pid 852 (sshd)
Dec 29 09:44:21 butcher-p13 kernel: calcru: negative runtime of -3486
usec for pid 852 (sshd)
Dec 29 09:44:23 butcher-p13 kernel: calcru: negative runtime of -3436
usec for pid 852 (sshd)
Dec 29 09:44:23 butcher-p13 kernel: calcru: negative runtime of -3436
usec for pid 852 (sshd)

Notably, the RAID set suffers no data integrity issues. The timeouts, at
least as far as I can tell, have to do an interrupt timing issue. I'm
led to believe this because of the accompanying calcru errors and the
fact that the processors are under heavy load. Basically anything that's
trying to communicate while the CPUs are loaded up suffers. It'll drop
SSH sessions, etc.

I looked around the forums and list archives and saw several
recommendations to check the kern.timecounter.hardware setting. It's
running on ACPI-fast. Just for the sake of saying that I tried, I ran
the system with both TSC and i8254. The problem persisted.

After this, I called Silicon Mechanics who suggested enabling the
Multimedia Timer (HPET) in the BIOS. I did this, only to see that 6.2
didn't yet support it. I fired up 7.0-current with the HPET support to
no avail. Same issue.

After trying this, I went to Supermicro's site and upgraded the BIOS
from 1.1c to 1.2a. This wonderful upgrade caused the kernel (7.0) to
detect an interrupt storm on em0 and then crash spectacularly (hang). On
 6.2, it would just hang within 5 minutes of booting.

I was able to downgrade to 1.1b and am back to the baseline. The system
doesn't crash, but it pitches the above errors.

Just now I tried increasing the kern.hz loader tunable to 2000 and 5000,
but the problem still occurred. I figured that it was worth a shot.

I haven't been able to test other operating systems with the same load,
so I'm not sure if it's something inherent to the board or if FreeBSD
just isn't playing nicely.

Anything that might lead me in the right direction toward a solution is
most appreciated.

Thank you,

Aaron Pratt
myTriggers.com