8.1 amd64 lockup (maybe zfs or disk related)

Tue Feb 8 06:16:49 UTC 2011

ok, I will start trying to locate the cause of the problem.  I've
attached my dmesg output after boot.  I'm currently downloading a liveCD
to run memtest from.  When you say "rebuild your kernel with debugging
enabled" do you mean add the "makeoptions     DEBUG=-g" option to my
kernel config and rebuild? 
Also, I'll start logging my cpu temp and I'll see if it peaks before a
lockup. (I have had one of six cores disabled thinking this might
prevent overheating) 

Thank you for your help talking me through this.

I've attached my dmesg output as dmesg.log.

On Mon, 2011-02-07 at 21:52 -0800, Jeremy Chadwick wrote:
> On Mon, Feb 07, 2011 at 09:34:36PM -0800, Greg Bonett wrote:
> > Thank you for the help.  I've implemented your
> > suggested /boot/loader.conf and /etc/sysctrl.conf tunings.
> > Unfortunately, after implementing these settings, I experienced another
> > lockup.  And by "lockup" I mean, nothing responding (sshd, keyboard, num
> > lock) - had to reset. 
> > 
> > I'm trying to isolate the cause of these lockups.  I rebooted the system
> > and tried to simulate high load condition WITHOUT mounting my zfs pool.
> > First I ran many instances of "dd if=/dev/random of=/dev/null bs=4m" to
> > get high CPU load.  The machine ran for many hours under this condition
> > without lockup.  Then I added a few "dd if=/dev/adX of=/dev/null bs=4m"
> > to simulate some io load.  After doing this it locked up immediately.  
> > Thinking I had figured out the source of the problem, I rebooted and
> > tried to replicate this experience but was not able to.  So far it has
> > been running for two hours with six "dd if=/dev/adX" commands (one for
> > each disk) and about a dozen "dd if=/dev/urandom" commands (to keep cpu
> > near 100%).  I'll let it keep running and see if it locks again without
> > ever mounting zfs.
> > 
> > any ideas?
> 
> No NumLock LED toggling is a pretty good indicator of a hardware-level
> problem.  An extra test would be to rebuild your kernel with debugging
> enabled so that when the machine locks, you could try pressing
> Ctrl-Alt-Esc at the VGA console and see if you drop to a db> prompt.  If
> so, that means the machine is actually alive (well, the kernel anyway).
> 
> As for causes: you could have bad memory (memtest86+ is a decent free
> test, but not infallible), you could have a PSU that doesn't have decent
> voltage ranges on its 3V, 5V, or 12V lines, you could have a PSU that
> doesn't provide enough power for all the devices connected to it, you
> could have a bad motherboard, your CPU could be overheating, you could
> be encountering a strange hardware/silicon bug, there could be a small
> or thin slice of metal laying across a single trace on the motherboard,
> etc...  The list is enormous.  Hardware problems often require a person
> to spend a lot of time and money, replacing a single part at a time,
> until the problem goes away.
> 
> The only thing we know for sure at this point is that your Western
> Digital drive behaves erratically with regards to excessive load
> cycling.  That is almost certainly the reason for your READ_DMA48
> errors.
> 
> So, you may actually be experiencing two separate issues at the
> same time.  It's hard to tell at this point.
> 
> In the meantime, can you please provide output from "dmesg" after the
> machine comes up?  I'm curious to know what sort of hardware is in this
> machine, especially with regards to its storage controller.
>