zfs mirror recognizing disk failures

Tue Nov 16 10:37:55 UTC 2010

On Tue, Nov 16, 2010 at 11:24:48AM +0100, Olivier Smedts wrote:
> 2010/11/15 Michael Boers <michaelscotttech at gmail.com>:
> > Is there anything I can do to make a zfs mirror quicker to give up on a
> > flaky disk?
> >
> > I recently had a 100% zfs system crash when started to have some disk
> > errors.  I had hoped that by having a mirror, the system would survive this
> > type of error.  Instead it just hung.
> 
> You can offline the faulty drive.
> Also, I think you're interested in a feature like TLER :
> http://en.wikipedia.org/wiki/Time-Limited_Error_Recovery
> But typical (cheap) drives don't implement it.

TLER wouldn't have helped this problem.  TLER will cause the drive to
internally "time out" the request submit from the controller.  If you
read the below output, it appears that the CDB submit to the drive was
intentionally aborted, and retries were exhausted.  Continued command
submissions to mpt0 kept timing out.

There's absolutely nothing (that I'm aware of) that TLER provides which
will cause the drive to "disconnect itself from the bus".  Furthermore,
since TLER is on a per-command basis, there's no guarantee that repeated
commands send from the controller to the disk won't continue to
witnessed problems.  Just because TLER times out the command quicker
that the OS driver doesn't mean the drive will suddenly become usable.

So we're back to the original question, which is why mpt(4) didn't
choose to drop the SCSI drive from the LUN or bus, given the repetitive
nature of the failure and mpt's own internal timeouts getting reached.

And to answer the OP's original question: "is this a type of error zfs
can survive or do I need a hardware mirror to handle this type of
problem?", the answer is yes, ZFS can survive this situation perfectly
fine, but ZFS is at the whim of the storage controller and controller
driver you choose to use.  It's not the job of the filesystem to tell
the storage controller "I hate this disk, get rid of it".

> > Nov 11 10:05:01 caprica kernel: (da2:mpt0:0:3:0): SYNCHRONIZE CACHE(10).
> > CDB: 35 0 0 0 0 0 0 0 0 0
> > Nov 11 10:05:01 caprica kernel: (da2:mpt0:0:3:0): CAM Status: SCSI Status
> > Error
> > Nov 11 10:05:01 caprica kernel: (da2:mpt0:0:3:0): SCSI Status: Check
> > Condition
> > Nov 11 10:05:01 caprica kernel: (da2:mpt0:0:3:0): ABORTED COMMAND asc:0,0
> > Nov 11 10:05:01 caprica kernel: (da2:mpt0:0:3:0): No additional sense
> > information
> > Nov 11 10:05:01 caprica kernel: (da2:mpt0:0:3:0): Retries Exhausted
> > Nov 11 10:05:53 caprica kernel: mpt0: request 0xffffff80003c87a0:2838 timed
> > out for ccb 0xffffff0103acc000 (req->ccb 0xffffff0103acc000)
> > Nov 11 10:05:53 caprica kernel: mpt0: request 0xffffff80003c5110:2839 timed
> > out for ccb 0xffffff035cab0800 (req->ccb 0xffffff035cab0800)
> > Nov 11 10:05:53 caprica kernel: mpt0: attempting to abort req
> > 0xffffff80003c87a0:2838 function 0
> > Nov 11 10:05:53 caprica kernel: mpt0: request 0xffffff80003bef30:2840 timed
> > out for ccb 0xffffff0007986800 (req->ccb 0xffffff0007986800)
> > Nov 11 10:05:53 caprica kernel: mpt0: request 0xffffff80003c8560:2841 timed
> > out for ccb 0xffffff032d985000 (req->ccb 0xffffff032d985000)
> > Nov 11 10:05:53 caprica kernel: mpt0: request 0xffffff80003bf320:2842 timed
> > out for ccb 0xffffff0103af2000 (req->ccb 0xffffff0103af2000)
> > Nov 11 10:05:53 caprica kernel: mpt0: request 0xffffff80003cbda0:2843 timed
> > out for ccb 0xffffff0103b0b000 (req->ccb 0xffffff0103b0b000)
> > Nov 11 10:05:53 caprica kernel: mpt0: request 0xffffff80003bfd40:2844 timed
> > out for ccb 0xffffff00102bf800 (req->ccb 0xffffff00102bf800)
> > Nov 11 10:05:53 caprica kernel: mpt0: request 0xffffff80003cad50:2845 timed
> > out for ccb 0xffffff01e6f33000 (req->ccb 0xffffff01e6f33000)
> > Nov 11 10:05:53 caprica kernel: mpt0: request 0xffffff80003caf00:2846 timed
> > out for ccb 0xffffff01e6f24800 (req->ccb 0xffffff01e6f24800)
> > Nov 11 10:05:53 caprica kernel: mpt0: request 0xffffff80003ccd60:2847 timed
> > out for ccb 0xffffff01308a4000 (req->ccb 0xffffff01308a4000)
> >
> > Is this a type of error zfs can survive or do I need a hardware mirror to
> > handle this type of problem?

-- 
| Jeremy Chadwick                                   jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |