8.1 amd64 lockup (maybe zfs or disk related)

Tue Feb 8 05:52:42 UTC 2011

On Mon, Feb 07, 2011 at 09:34:36PM -0800, Greg Bonett wrote:
> Thank you for the help.  I've implemented your
> suggested /boot/loader.conf and /etc/sysctrl.conf tunings.
> Unfortunately, after implementing these settings, I experienced another
> lockup.  And by "lockup" I mean, nothing responding (sshd, keyboard, num
> lock) - had to reset. 
> 
> I'm trying to isolate the cause of these lockups.  I rebooted the system
> and tried to simulate high load condition WITHOUT mounting my zfs pool.
> First I ran many instances of "dd if=/dev/random of=/dev/null bs=4m" to
> get high CPU load.  The machine ran for many hours under this condition
> without lockup.  Then I added a few "dd if=/dev/adX of=/dev/null bs=4m"
> to simulate some io load.  After doing this it locked up immediately.  
> Thinking I had figured out the source of the problem, I rebooted and
> tried to replicate this experience but was not able to.  So far it has
> been running for two hours with six "dd if=/dev/adX" commands (one for
> each disk) and about a dozen "dd if=/dev/urandom" commands (to keep cpu
> near 100%).  I'll let it keep running and see if it locks again without
> ever mounting zfs.
> 
> any ideas?

No NumLock LED toggling is a pretty good indicator of a hardware-level
problem.  An extra test would be to rebuild your kernel with debugging
enabled so that when the machine locks, you could try pressing
Ctrl-Alt-Esc at the VGA console and see if you drop to a db> prompt.  If
so, that means the machine is actually alive (well, the kernel anyway).

As for causes: you could have bad memory (memtest86+ is a decent free
test, but not infallible), you could have a PSU that doesn't have decent
voltage ranges on its 3V, 5V, or 12V lines, you could have a PSU that
doesn't provide enough power for all the devices connected to it, you
could have a bad motherboard, your CPU could be overheating, you could
be encountering a strange hardware/silicon bug, there could be a small
or thin slice of metal laying across a single trace on the motherboard,
etc...  The list is enormous.  Hardware problems often require a person
to spend a lot of time and money, replacing a single part at a time,
until the problem goes away.

The only thing we know for sure at this point is that your Western
Digital drive behaves erratically with regards to excessive load
cycling.  That is almost certainly the reason for your READ_DMA48
errors.

So, you may actually be experiencing two separate issues at the
same time.  It's hard to tell at this point.

In the meantime, can you please provide output from "dmesg" after the
machine comes up?  I'm curious to know what sort of hardware is in this
machine, especially with regards to its storage controller.

-- 
| Jeremy Chadwick                                   jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.               PGP 4BD6C0CB |