ZFS lockup in "zfs" state

Mon Jun 2 06:40:23 UTC 2008

On Mon, Jun 02, 2008 at 04:04:12PM +1000, Andrew Hill wrote:
> On Mon, May 19, 2008 at 1:11 AM, Andrew Hill <lists at thefrog.net> wrote:
> 
> > i tend to find that the timeouts occur on one or two disks at once - e.g.
> > ad0 and 2 will complain of timeouts, and the system locks up shortly
> > thereafter...
> 
> after spitting out the usual errors from ad0 and ad2 (in this case) with
> TIMEOUTs and subsequent FAILUREs on READ_DMA[48] and WRITE_DMA[48]...
> 
> i got the following panic
> 
> vm_fault: pager read error, pid 1552 (tlsmgr)
> ad0: FAILURE - READ_DMA48 timed out LBA=352903900
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 437, size: 4096
> ad2: FAILURE - WRITE_DMA timed out LBA=239717693
> panic: ZFS: I/O failure (write on <unknown> off 0: zio 0xffffff001d47c810
> [L0 ZIL intent log] b000L/b000P DVA[0]=<0:c807795000:d000> zilog
> uncompressed LE contiguous birth=750230 fill=0
> cksum=69f76525a84e1816:f6d86fe1d94cd68c:39:8af): error 5
> KDB: enter: panic
> [thread pid 72 tid 100071 ]
> Stopped at      kdb_enter_why+0x3d:     movq    $0,0x39b248(%rip)
> db>

I would say the ZFS crash is a result of the ad0/ad2 timeouts.  The ZIL
log shows a hard checksum failure in the ZIL, which indicates a serious
problem -- very likely hardware-related (or rather, at a lower level
than ZFS).

You've read this already, but maybe you missed the DMA error part:

http://wiki.freebsd.org/JeremyChadwick/Commonly_reported_issues

The DMA errors can actually be legitimate too -- it's very hard to
troubleshoot if they're superfluous (e.g. a FreeBSD bug) or if they're
real.  If the problem is reproducable, then this is convenient with
regards to providing you additional help.

I really need to sit down and write a huge HOWTO doc for people on how
to diagnose whether or not their disks or cables are bad, etc...  It's a
very hard thing to document, because everyone's situation is different.

The first piece to start with is simplest, though: install
ports/sysutils/smartmontools and provide the output of "smartctl -a
/dev/ad0" and /dev/ad2.  Actual disk errors will very likely show up
there in one of the counters, or in the SMART log.  I'd personally like
to see the output from smartctl, because it's something you can do while
the system is up/working.

The next step would involve replacing your cables.  If the problem
continues, you've at least removed one piece of the puzzle.

Next, replace the disks -- especially if they were bought at the same
time, and are from the same vendor.  Hard disk vendors are known to have
bad batches of disks.  For sake of example, I just had two Western
Digital disks (which I bought at the same time) fail a short I/O test,
returning errors at different LBAs (blocks).  The 2nd one only started
showing problems a few weeks after the first.  I obviously got both of
them RMA'd.

Finally, replace the controller or motherboard.  Some people have
reported success with this.

> generally the lockups don't result in a panic (at least not in the short
> term of 5-10 minutes), so i can't be sure that this panic is necessarily
> caused by the same problem, but thought it might be worth posting in case it
> gives an indication of the location/cause of the deadlock

The DMA timeout errors you've seen, others have seen as well --
including me -- even when the hardware, disks, cabling, and controllers
are in a 100% working state.  (Even switching OSes results in no errors,
indicating there is a problem with FreeBSD in some way.)

If the problem is reproducable, you should get in contact with Scott
Long and let him poke at things.  (I mentioned this last time.  :-) )
I myself am not familiar with the FreeBSD kernel, the device drivers, or
working with the kernel at such a low level to debug things of this
nature.

> unfortunately i couldn't get a backtrace or core dump for 'political'
> reasons (the system was required for use by others) but i'll see if i can
> get a panic happening after-hours to get some more info...

I can't tell you what to do or how to do your job, but honestly you
should be pulling this system out of production and replacing it with a
different one, or a different implementation, or a different OS.  Your
users/employees are probably getting ticked off at the crashes, and it
probably irritates you too.  The added benefit is that you could get
Scott access to the box.

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |