zfs mirror recognizing disk failures

Tue Nov 16 13:58:41 UTC 2010

On Tue, Nov 16, 2010 at 08:32:35AM -0500, Michael Boers wrote:
> To answer Jermey's question of "what happened next?"
> 
> The machine was not serving web requests
> The machine was not responsive via ssh
> The machine was pingable
> 
> after waiting about 15 minutes, I used the ipmi protocol to power
> down the machine.
> When it came back up, I found the enclosed errors in the log.
> 
> If I am following your comments correctly, the fault for this lies
> in the mpt system not giving up which could either be a driver or a
> firmware issue.  Is that correct?
> 
> How do I protect against that?

The fault, in my opinion -- and I urge others (especially those familiar
with the driver) to correct me, because I am often wrong -- lies with
either with the controller itself, or mpt(4), not truly "giving up"
after repetitive errors.  It could be a firmware bug/quirk, sure.  It
could be a lot of things, or a combination of things.  I don't want to
rule out anything.

For example, at my workplace we use Solaris with Adaptec controllers,
using a multitude of Fujitsu disks.  Everything is SCSI-3.  We regularly
(at least once a week, usually more than that) see disk problems where
either the disk falls off the bus unexpectedly, the drive itself
"wedges" (resulting in the controller getting stuck in an infinite loop
trying to talk to it) and won't unwedge without a full power-cycle (soft
reset doesn't work), or in certain bad block circumstances the drive
wedges long enough for the controller driver to break in a strange way
(resulting in a system panic).  Each situation appears to be different;
there's definitely situations where the disk is responsible, others
which look like the controller is responsible, and others which look
like driver issues.

I'm not familiar (read: have not used) mpt(4) controllers, but if my
memory serves me right, people post about problems with them from time
to time on FreeBSD.  Each incident has to be addressed separately.

If you're asking for a workaround or "what should I do", the solution is
to either change controllers (read: avoid mpt(4)), or figure out how/why
the disk became wedged (or if it even did in the first place).

Your original post contains no useful information about the hardware
itself (mpt handles many controllers yet we know not what model, we know
nothing about disk da2, etc.).  You're going to need to provide this.
Relevant dmesg output, camcontrol devlist, camcontrol inquiry, and
smartctl -a output for the disk would be useful (assuming the controller
supports passthrough).

Finally, be aware that trying to chase down a problem of this nature is
often time-consuming.  Sometimes it's not worth it at all, and instead
better spent replacing all of the hardware involved.  If it happens
again after that, change vendors or hardware controllers (or disks)
used.  That's just how it goes.  I tend to stick to Intel ICHxx or ESB
SATA controllers for this reason; they're well-tested on FreeBSD.  And I
don't use hardware RAID at all for many reasons (separate topic).

-- 
| Jeremy Chadwick                                   jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |