zfs mirror recognizing disk failures

Tue Nov 16 08:47:34 UTC 2010

On Mon, Nov 15, 2010 at 05:03:30PM -0500, Michael Boers wrote:
> Is there anything I can do to make a zfs mirror quicker to give up
> on a flaky disk?
> 
> I recently had a 100% zfs system crash when started to have some
> disk errors.  I had hoped that by having a mirror, the system would
> survive this type of error.  Instead it just hung.
> 
> Nov 11 10:05:01 caprica kernel: (da2:mpt0:0:3:0): SYNCHRONIZE
> CACHE(10). CDB: 35 0 0 0 0 0 0 0 0 0
> Nov 11 10:05:01 caprica kernel: (da2:mpt0:0:3:0): CAM Status: SCSI
> Status Error
> Nov 11 10:05:01 caprica kernel: (da2:mpt0:0:3:0): SCSI Status: Check
> Condition
> Nov 11 10:05:01 caprica kernel: (da2:mpt0:0:3:0): ABORTED COMMAND
> asc:0,0
> Nov 11 10:05:01 caprica kernel: (da2:mpt0:0:3:0): No additional
> sense information
> Nov 11 10:05:01 caprica kernel: (da2:mpt0:0:3:0): Retries Exhausted
> Nov 11 10:05:53 caprica kernel: mpt0: request
> 0xffffff80003c87a0:2838 timed out for ccb 0xffffff0103acc000
> (req->ccb 0xffffff0103acc000)
> Nov 11 10:05:53 caprica kernel: mpt0: request
> 0xffffff80003c5110:2839 timed out for ccb 0xffffff035cab0800
> (req->ccb 0xffffff035cab0800)
> Nov 11 10:05:53 caprica kernel: mpt0: attempting to abort req
> 0xffffff80003c87a0:2838 function 0
> Nov 11 10:05:53 caprica kernel: mpt0: request
> 0xffffff80003bef30:2840 timed out for ccb 0xffffff0007986800
> (req->ccb 0xffffff0007986800)
> Nov 11 10:05:53 caprica kernel: mpt0: request
> 0xffffff80003c8560:2841 timed out for ccb 0xffffff032d985000
> (req->ccb 0xffffff032d985000)
> Nov 11 10:05:53 caprica kernel: mpt0: request
> 0xffffff80003bf320:2842 timed out for ccb 0xffffff0103af2000
> (req->ccb 0xffffff0103af2000)
> Nov 11 10:05:53 caprica kernel: mpt0: request
> 0xffffff80003cbda0:2843 timed out for ccb 0xffffff0103b0b000
> (req->ccb 0xffffff0103b0b000)
> Nov 11 10:05:53 caprica kernel: mpt0: request
> 0xffffff80003bfd40:2844 timed out for ccb 0xffffff00102bf800
> (req->ccb 0xffffff00102bf800)
> Nov 11 10:05:53 caprica kernel: mpt0: request
> 0xffffff80003cad50:2845 timed out for ccb 0xffffff01e6f33000
> (req->ccb 0xffffff01e6f33000)
> Nov 11 10:05:53 caprica kernel: mpt0: request
> 0xffffff80003caf00:2846 timed out for ccb 0xffffff01e6f24800
> (req->ccb 0xffffff01e6f24800)
> Nov 11 10:05:53 caprica kernel: mpt0: request
> 0xffffff80003ccd60:2847 timed out for ccb 0xffffff01308a4000
> (req->ccb 0xffffff01308a4000)
> 
> Is this a type of error zfs can survive or do I need a hardware
> mirror to handle this type of problem?

This looks to me like a problem/quirk with mpt(4) and not ZFS.  What
happened after this point?  Didn't the mpt driver drop the disk off the
bus (in CAM)?  ZFS would notice that when it happens.  So, I think this
looks like a problem with either the mpt cards or the driver.

What I'm stating: ZFS shouldn't be responsible for "figuring out if
communication with the disk is messed up" -- that's the job of the
storage controller and the storage controller driver.

-- 
| Jeremy Chadwick                                   jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |