8.1 amd64 lockup (maybe zfs or disk related)

Tue Feb 8 06:46:35 UTC 2011

On Mon, Feb 07, 2011 at 10:16:46PM -0800, Greg Bonett wrote:
> ok, I will start trying to locate the cause of the problem.  I've
> attached my dmesg output after boot.  I'm currently downloading a liveCD
> to run memtest from.  When you say "rebuild your kernel with debugging
> enabled" do you mean add the "makeoptions     DEBUG=-g" option to my
> kernel config and rebuild? 

No, but that would be a useful addition as well, assuming you have the
disk space on your root filesystem for modules/kernel with debugging
symbols.  These are the options you want to add to your kernel config:

# Debugging options
options         BREAK_TO_DEBUGGER       # Sending a serial BREAK drops to DDB
options         KDB                     # Enable kernel debugger support
options         KDB_TRACE               # Print stack trace automatically on panic
options         DDB                     # Support DDB
options         GDB                     # Support remote GDB

Documented here:
http://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug-options.html

> Also, I'll start logging my cpu temp and I'll see if it peaks before a
> lockup. (I have had one of six cores disabled thinking this might
> prevent overheating) 

Unlikely.  Present-day operating systems (including Windows for that
matter) are pretty good about halting processors (cores) which aren't in
use/aren't needed, which greatly helps with diminishing power usage and
temperatures.  Each CPU model is different, so you'd have to find
someone with an AMD Phenom II X6 1075T CPU and compare thermals.

> Thank you for your help talking me through this.
> 
> I've attached my dmesg output as dmesg.log.

Let's look at your storage controller setup:

atapci0: <JMicron JMB361 UDMA133 controller> irq 18
atapci1: <AHCI SATA controller> on atapci0
   ata2: <ATA channel 0> on atapci1
   ata3: <ATA channel 1> on atapci1
   ata4: <ATA channel 0> on atapci0
atapci2: <ATI IXP700/800 SATA300 controller> irq 19
atapci2: AHCI v1.20 controller with 4 6Gbps ports, PM supported
   ata5: <ATA channel 0> on atapci2
   ata6: <ATA channel 1> on atapci2
   ata7: <ATA channel 2> on atapci2
   ata8: <ATA channel 3> on atapci2
atapci3: <ATI IXP700/800 UDMA133 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xff00-0xff0f at device 20.1 on pci0
   ata0: <ATA channel 0> on atapci3
   ata1: <ATA channel 1> on atapci3

There have been recent discussions about "problems" on the ATI
IXP700/800 controllers.  I do not buy AMD systems, so I can't comment on
this controllers' reliability.  Just a FYI point.  Here's the thread:

http://lists.freebsd.org/pipermail/freebsd-stable/2011-February/thread.html#61348

I also tend to avoid JMicron controllers like the plague.  I've seen too
many problem reports with them over the years, regardless of OS.

Now for the disk layout (I'm excluding da0, which is a USB flash disk of
some kind).

 ad0: 953869MB <WDC WD10EARS-00Y5B1 80.00A80> at ata0-master UDMA133 SATA
 ad1: 953869MB <Seagate ST31000333AS CC1H> at ata0-slave UDMA133 SATA
 ad4: 1430799MB <WL1500GSA6472 05.00F.1> at ata2-master UDMA100 SATA 3Gb/s
 ad8: 15279MB <TRANSCEND 20091215> at ata4-master UDMA66 
acd0: CDRW <NEC CD-RW NR-7900A/1.08> at ata4-slave UDMA33 
ad10: 953869MB <WL1000GSA1672 05.00J05> at ata5-master UDMA100 SATA 3Gb/s
ad12: 953869MB <Seagate ST31000333AS CC1H> at ata6-master UDMA100 SATA 3Gb/s
ad14: 953869MB <SAMSUNG HD103UJ 1AA01118> at ata7-master UDMA100 SATA 3Gb/s
ad16: 953869MB <WL1000GSA1672 HA.00CHA> at ata8-master UDMA100 SATA 3Gb/s

You have a very large number of hard disks in this machine, so I sure
hope you do have a decent enough PSU to handle it all.

If I had to make a recommendation, it would be to decrease the number of
hard disks in the machine.  You have 8 of them -- one of which may be a
RAM drive or something similar -- and that isn't including your CDRW
drive.

I would also try getting rid of the JMicron controller; I would
recommend investing in a Silicon Image controller to replace it,
specifically one driven by the 3124, 3132, or 3531 chips.  Avoid the
3112, 3114, and 3512 chips:
http://en.wikipedia.org/wiki/Silicon_Image#Product_alerts

Next we have this:

> ad1: TIMEOUT - READ_DMA retrying (1 retry left) LBA=1
> GEOM: ad1: partition 1 does not start on a track boundary.
> GEOM: ad1: partition 1 does not end on a track boundary.
> GEOM: label/1TBdisk5: partition 1 does not start on a track boundary.
> GEOM: label/1TBdisk5: partition 1 does not end on a track boundary.

This doesn't look good, especially the READ_DMA timeout on ad1.  That's
a different disk than the one you told me about before.  LBA 1 is
literally the 2nd block on the disk, which is a little too close to
block 0 for comfort.  I'd love to see "smartctl -a /dev/ad1" output
here.

> calcru: runtime went backwards from 82 usec to 70 usec for pid 20 (flowcleaner)
> calcru: runtime went backwards from 363 usec to 317 usec for pid 8 (pagedaemon)
> calcru: runtime went backwards from 111 usec to 95 usec for pid 7 (xpt_thrd)
> calcru: runtime went backwards from 1892 usec to 1629 usec for pid 1 (init)
> calcru: runtime went backwards from 6786 usec to 6591 usec for pid 0 (kernel)

This is a problem that has plagued FreeBSD for some time.  It's usually
caused by EIST (est) being used, but that's on Intel platforms.  AMD has
something similar called Cool'n'Quiet (see cpufreq(4) man page).  Are
you running powerd(8) on this system?  If so, try disabling that and see
if these go away.

> GEOM_ELI: Device label/1tbgreendisk.eli created.
> GEOM_ELI: Encryption: AES-CBC 256
> GEOM_ELI:     Crypto: software
> {...}

There was no mention of geli(8) being used on this system until now.
There may be other complexities as a result of this; I don't know.

Good luck.

-- 
| Jeremy Chadwick                                   jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.               PGP 4BD6C0CB |