ZFS lockup in "zfs" state

Mon Jun 2 19:31:54 UTC 2008

Hi.

I have the same problem very often with HDD  (READ_DMA UDMA ICRC error) which 
is in zfs pool. Before, this HDD was in mirror ar0 but not in ZFS pool and 
this hard disk sometimes have failed but with no any panic only detached from 
mirror. After I included this HDD to ZFS pool problem have apeared.    I am 
sure that this is problem with hard disk. 
Smartmontools notified me by mail that UDMA_CRC_Error_Count have increased 
after HDD failure and acording smartctl I can see that HDD have hardware 
problem. I replased  cable, tried to connect this HDD to another port - but 
no result: 100% hard disk problem.
I can not create kernel coredump during panic: savecore: no dumps found :(
Only logs are available:

In log file:
Jun  1 10:43:11 yalur kernel: ad16: WARNING - READ_DMA UDMA ICRC error 
(retrying request) LBA=233909187
Jun  1 10:43:20 yalur kernel: ad16: WARNING - SETFEATURES SET TRANSFER MODE 
taskqueue timeout - completing request directly
Jun  1 10:43:36 yalur kernel: ad16: WARNING - SETFEATURES SET TRANSFER MODE 
taskqueue timeout - completing request directly
Jun  1 10:43:36 yalur kernel: ad16: WARNING - SETFEATURES ENABLE RCACHE 
taskqueue timeout - completing request directly
Jun  1 10:43:36 yalur kernel: ad16: WARNING - SETFEATURES ENABLE WCACHE 
taskqueue timeout - completing request directly
Jun  1 10:43:36 yalur kernel: ad16: WARNING - SET_MULTI taskqueue timeout - 
completing request directly
Jun  1 10:43:36 yalur kernel: ad16: TIMEOUT - READ_DMA retrying (0 retries 
left) LBA=233909187

Jun  1 11:07:50 yalur syslogd: restart
Jun  1 11:07:50 yalur syslogd: kernel boot file is /boot/kernel/kernel
Jun  1 11:07:50 yalur kernel: ad16: FAILURE - device detached
Jun  1 11:07:50 yalur kernel: subdisk16: detached
Jun  1 11:07:50 yalur kernel: ad16: detached
Jun  1 11:07:50 yalur kernel:
Jun  1 11:07:50 yalur kernel:
Jun  1 11:07:50 yalur kernel: Fatal trap 12: page fault while in kernel mode
Jun  1 11:07:50 yalur kernel: cpuid = 0; apic id = 00
Jun  1 11:07:50 yalur kernel: fault virtual address     = 0x2c
Jun  1 11:07:50 yalur kernel: fault code                = supervisor write, 
page not present
Jun  1 11:07:50 yalur kernel: instruction pointer       = 0x20:0x805aab85
Jun  1 11:07:50 yalur kernel: stack pointer             = 0x28:0xed71ac5c
Jun  1 11:07:50 yalur kernel: frame pointer             = 0x28:0xed71ac70
Jun  1 11:07:50 yalur kernel: code segment              = base 0x0, limit 
0xfffff, type 0x1b
Jun  1 11:07:50 yalur kernel: = DPL 0, pres 1, def32 1, gran 1
Jun  1 11:07:50 yalur kernel: processor eflags  = interrupt enabled, resume, 
IOPL = 0
Jun  1 11:07:50 yalur kernel: current process           = 3 (g_up)
Jun  1 11:07:50 yalur kernel: trap number               = 12
Jun  1 11:07:50 yalur kernel: panic: page fault
Jun  1 11:07:50 yalur kernel: cpuid = 0

[root at yalur /home/ruslan]# zpool status
  pool: data
 state: ONLINE
 scrub: scrub completed with 0 errors on Mon Jun  2 12:05:52 2008
config:

        NAME        STATE     READ WRITE CKSUM
        data        ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            ad6     ONLINE       0     0     0
            ad8     ONLINE       0     0     0
            ad10    ONLINE       0     0     0
            ad4     ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            ad12    ONLINE       0     0     0
            ad14    ONLINE       0     0     0
            ad16    ONLINE       0     0     0
            ad20    ONLINE       0     0     0
        spares
          ad26      AVAIL

errors: No known data errors

В сообщении от Понедельник 02 июня 2008 Jeremy Chadwick написал(a):
> On Mon, Jun 02, 2008 at 04:04:12PM +1000, Andrew Hill wrote:
> > On Mon, May 19, 2008 at 1:11 AM, Andrew Hill <lists at thefrog.net> wrote:
> > > i tend to find that the timeouts occur on one or two disks at once -
> > > e.g. ad0 and 2 will complain of timeouts, and the system locks up
> > > shortly thereafter...
> >
> > after spitting out the usual errors from ad0 and ad2 (in this case) with
> > TIMEOUTs and subsequent FAILUREs on READ_DMA[48] and WRITE_DMA[48]...
> >
> > i got the following panic
> >
> > vm_fault: pager read error, pid 1552 (tlsmgr)
> > ad0: FAILURE - READ_DMA48 timed out LBA=352903900
> > swap_pager: indefinite wait buffer: bufobj: 0, blkno: 437, size: 4096
> > ad2: FAILURE - WRITE_DMA timed out LBA=239717693
> > panic: ZFS: I/O failure (write on <unknown> off 0: zio 0xffffff001d47c810
> > [L0 ZIL intent log] b000L/b000P DVA[0]=<0:c807795000:d000> zilog
> > uncompressed LE contiguous birth=750230 fill=0
> > cksum=69f76525a84e1816:f6d86fe1d94cd68c:39:8af): error 5
> > KDB: enter: panic
> > [thread pid 72 tid 100071 ]
> > Stopped at      kdb_enter_why+0x3d:     movq    $0,0x39b248(%rip)
> > db>
>
> I would say the ZFS crash is a result of the ad0/ad2 timeouts.  The ZIL
> log shows a hard checksum failure in the ZIL, which indicates a serious
> problem -- very likely hardware-related (or rather, at a lower level
> than ZFS).
>
> You've read this already, but maybe you missed the DMA error part:
>
> http://wiki.freebsd.org/JeremyChadwick/Commonly_reported_issues
>
> The DMA errors can actually be legitimate too -- it's very hard to
> troubleshoot if they're superfluous (e.g. a FreeBSD bug) or if they're
> real.  If the problem is reproducable, then this is convenient with
> regards to providing you additional help.
>
> I really need to sit down and write a huge HOWTO doc for people on how
> to diagnose whether or not their disks or cables are bad, etc...  It's a
> very hard thing to document, because everyone's situation is different.
>
> The first piece to start with is simplest, though: install
> ports/sysutils/smartmontools and provide the output of "smartctl -a
> /dev/ad0" and /dev/ad2.  Actual disk errors will very likely show up
> there in one of the counters, or in the SMART log.  I'd personally like
> to see the output from smartctl, because it's something you can do while
> the system is up/working.
>
> The next step would involve replacing your cables.  If the problem
> continues, you've at least removed one piece of the puzzle.
>
> Next, replace the disks -- especially if they were bought at the same
> time, and are from the same vendor.  Hard disk vendors are known to have
> bad batches of disks.  For sake of example, I just had two Western
> Digital disks (which I bought at the same time) fail a short I/O test,
> returning errors at different LBAs (blocks).  The 2nd one only started
> showing problems a few weeks after the first.  I obviously got both of
> them RMA'd.
>
> Finally, replace the controller or motherboard.  Some people have
> reported success with this.
>
> > generally the lockups don't result in a panic (at least not in the short
> > term of 5-10 minutes), so i can't be sure that this panic is necessarily
> > caused by the same problem, but thought it might be worth posting in case
> > it gives an indication of the location/cause of the deadlock
>
> The DMA timeout errors you've seen, others have seen as well --
> including me -- even when the hardware, disks, cabling, and controllers
> are in a 100% working state.  (Even switching OSes results in no errors,
> indicating there is a problem with FreeBSD in some way.)
>
> If the problem is reproducable, you should get in contact with Scott
> Long and let him poke at things.  (I mentioned this last time.  :-) )
> I myself am not familiar with the FreeBSD kernel, the device drivers, or
> working with the kernel at such a low level to debug things of this
> nature.
>
> > unfortunately i couldn't get a backtrace or core dump for 'political'
> > reasons (the system was required for use by others) but i'll see if i can
> > get a panic happening after-hours to get some more info...
>
> I can't tell you what to do or how to do your job, but honestly you
> should be pulling this system out of production and replacing it with a
> different one, or a different implementation, or a different OS.  Your
> users/employees are probably getting ticked off at the crashes, and it
> probably irritates you too.  The added benefit is that you could get
> Scott access to the box.

-- 
________________
С уважением
Ковтун Руслан mailto <yalur at mail.ru>