New ZFSv28 patchset for 8-STABLE

Sun Jan 9 12:18:03 UTC 2011

On Sun, Jan 09, 2011 at 12:49:27PM +0100, Attila Nagy wrote:
>  On 01/09/2011 10:00 AM, Attila Nagy wrote:
> > On 12/16/2010 01:44 PM, Martin Matuska wrote:
> >>Hi everyone,
> >>
> >>following the announcement of Pawel Jakub Dawidek (pjd at FreeBSD.org) I am
> >>providing a ZFSv28 testing patch for 8-STABLE.
> >>
> >>Link to the patch:
> >>
> >>http://people.freebsd.org/~mm/patches/zfs/v28/stable-8-zfsv28-20101215.patch.xz
> >>
> >>
> >I've got an IO hang with dedup enabled (not sure it's related,
> >I've started to rewrite all data on pool, which makes a heavy
> >load):
> >
> >The processes are in various states:
> >65747   1001      1  54   10 28620K 24360K tx->tx  0   6:58  0.00% cvsup
> >80383   1001      1  54   10 40616K 30196K select  1   5:38  0.00% rsync
> > 1501 www         1  44    0  7304K  2504K zio->i  0   2:09  0.00% nginx
> > 1479 www         1  44    0  7304K  2416K zio->i  1   2:03  0.00% nginx
> > 1477 www         1  44    0  7304K  2664K zio->i  0   2:02  0.00% nginx
> > 1487 www         1  44    0  7304K  2376K zio->i  0   1:40  0.00% nginx
> > 1490 www         1  44    0  7304K  1852K zfs     0   1:30  0.00% nginx
> > 1486 www         1  44    0  7304K  2400K zfsvfs  1   1:05  0.00% nginx
> >
> >And everything which wants to touch the pool is/becomes dead.
> >
> >Procstat says about one process:
> ># procstat -k 1497
> >  PID    TID COMM             TDNAME           KSTACK
> > 1497 100257 nginx            -                mi_switch
> >sleepq_wait __lockmgr_args vop_stdlock VOP_LOCK1_APV null_lock
> >VOP_LOCK1_APV _vn_lock nullfs_root lookup namei vn_open_cred
> >kern_openat syscallenter syscall Xfast_syscall
> No, it's not related. One of the disks in the RAIDZ2 pool went bad:
> (da4:arcmsr0:0:4:0): READ(6). CDB: 8 0 2 10 10 0
> (da4:arcmsr0:0:4:0): CAM status: SCSI Status Error
> (da4:arcmsr0:0:4:0): SCSI status: Check Condition
> (da4:arcmsr0:0:4:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered
> read error)
> and it seems it froze the whole zpool. Removing the disk by hand
> solved the problem.
> I've seen this previously on other machines with ciss.
> I wonder why ZFS didn't throw it out of the pool.

Hold on a minute.  An unrecoverable read error does not necessarily mean
the drive is bad, it could mean that the individual LBA that was
attempted to be read resulted in ASC 0x11 (MEDIUM ERROR) (e.g. a bad
block was encountered).  I would check SMART stats on the disk (since
these are probably SATA given use of arcmsr(4)) and provide those.
*That* will tell you if the disk is bad.  I'll help you decode the
attributes values if you provide them.

My understanding is that a single LBA read failure should not warrant
ZFS marking the disk UNAVAIL in the pool.  It should have incremented
the READ error counter and that's it.  Did you receive a *single* error
for the disk and then things went catatonic?

If the entire system got wedged (a soft wedge, e.g. kernel is still
alive but nothing's happening in userland), that could be a different
problem -- either with ZFS or arcmsr(4).  Does ZFS have some sort of
timeout value internal to itself where it will literally mark a disk
UNAVAIL in the case that repeated I/O transactions takes "too long"?
What is its error recovery methodology?

Speaking strictly about Solaris 10 and ZFS: I have seen many, many times
a system "soft wedge" after repeated I/O errors (read or write) are
spewed out on the console for a single SATA disk (via AHCI), but only
when the disk is used as a sole root filesystem disk (no mirror/raidz).
My impression is that ZFS isn't the problem in this scenario.  In most
cases, post-mortem debugging on my part shows that disks encountered
some CRC errors (indicating cabling issues, etc.), sometimes as few as
2, but "something else" went crazy -- or possibly ZFS couldn't mark the
disk UNAVAIL (if it has that logic) because it's a single disk
associated with root.  Hardware in this scenario are Hitachi SATA disks
with an ICH ESB2 controller, software is Solaris 10 (Generic_142901-06)
with ZFS v15.

-- 
| Jeremy Chadwick                                   jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.               PGP 4BD6C0CB |