8.1 amd64 lockup (maybe zfs or disk related)
greg at bonett.org
Tue Feb 8 06:16:49 UTC 2011
ok, I will start trying to locate the cause of the problem. I've
attached my dmesg output after boot. I'm currently downloading a liveCD
to run memtest from. When you say "rebuild your kernel with debugging
enabled" do you mean add the "makeoptions DEBUG=-g" option to my
kernel config and rebuild?
Also, I'll start logging my cpu temp and I'll see if it peaks before a
lockup. (I have had one of six cores disabled thinking this might
Thank you for your help talking me through this.
I've attached my dmesg output as dmesg.log.
On Mon, 2011-02-07 at 21:52 -0800, Jeremy Chadwick wrote:
> On Mon, Feb 07, 2011 at 09:34:36PM -0800, Greg Bonett wrote:
> > Thank you for the help. I've implemented your
> > suggested /boot/loader.conf and /etc/sysctrl.conf tunings.
> > Unfortunately, after implementing these settings, I experienced another
> > lockup. And by "lockup" I mean, nothing responding (sshd, keyboard, num
> > lock) - had to reset.
> > I'm trying to isolate the cause of these lockups. I rebooted the system
> > and tried to simulate high load condition WITHOUT mounting my zfs pool.
> > First I ran many instances of "dd if=/dev/random of=/dev/null bs=4m" to
> > get high CPU load. The machine ran for many hours under this condition
> > without lockup. Then I added a few "dd if=/dev/adX of=/dev/null bs=4m"
> > to simulate some io load. After doing this it locked up immediately.
> > Thinking I had figured out the source of the problem, I rebooted and
> > tried to replicate this experience but was not able to. So far it has
> > been running for two hours with six "dd if=/dev/adX" commands (one for
> > each disk) and about a dozen "dd if=/dev/urandom" commands (to keep cpu
> > near 100%). I'll let it keep running and see if it locks again without
> > ever mounting zfs.
> > any ideas?
> No NumLock LED toggling is a pretty good indicator of a hardware-level
> problem. An extra test would be to rebuild your kernel with debugging
> enabled so that when the machine locks, you could try pressing
> Ctrl-Alt-Esc at the VGA console and see if you drop to a db> prompt. If
> so, that means the machine is actually alive (well, the kernel anyway).
> As for causes: you could have bad memory (memtest86+ is a decent free
> test, but not infallible), you could have a PSU that doesn't have decent
> voltage ranges on its 3V, 5V, or 12V lines, you could have a PSU that
> doesn't provide enough power for all the devices connected to it, you
> could have a bad motherboard, your CPU could be overheating, you could
> be encountering a strange hardware/silicon bug, there could be a small
> or thin slice of metal laying across a single trace on the motherboard,
> etc... The list is enormous. Hardware problems often require a person
> to spend a lot of time and money, replacing a single part at a time,
> until the problem goes away.
> The only thing we know for sure at this point is that your Western
> Digital drive behaves erratically with regards to excessive load
> cycling. That is almost certainly the reason for your READ_DMA48
> So, you may actually be experiencing two separate issues at the
> same time. It's hard to tell at this point.
> In the meantime, can you please provide output from "dmesg" after the
> machine comes up? I'm curious to know what sort of hardware is in this
> machine, especially with regards to its storage controller.
More information about the freebsd-stable