zfs mirror recognizing disk failures

Tue Nov 16 15:29:44 UTC 2010

On Nov 16, 2010, at 8:58 AM, Jeremy Chadwick wrote:

> On Tue, Nov 16, 2010 at 08:32:35AM -0500, Michael Boers wrote:
>> To answer Jermey's question of "what happened next?"
>>
>> The machine was not serving web requests
>> The machine was not responsive via ssh
>> The machine was pingable
>>
>> after waiting about 15 minutes, I used the ipmi protocol to power
>> down the machine.
>> When it came back up, I found the enclosed errors in the log.
>>
>> If I am following your comments correctly, the fault for this lies
>> in the mpt system not giving up which could either be a driver or a
>> firmware issue.  Is that correct?
>>
>> How do I protect against that?
>
> The fault, in my opinion -- and I urge others (especially those  
> familiar
> with the driver) to correct me, because I am often wrong -- lies with
> either with the controller itself, or mpt(4), not truly "giving up"
> after repetitive errors.  It could be a firmware bug/quirk, sure.  It
> could be a lot of things, or a combination of things.  I don't want to
> rule out anything.
>
> For example, at my workplace we use Solaris with Adaptec controllers,
> using a multitude of Fujitsu disks.  Everything is SCSI-3.  We  
> regularly
> (at least once a week, usually more than that) see disk problems where
> either the disk falls off the bus unexpectedly, the drive itself
> "wedges" (resulting in the controller getting stuck in an infinite  
> loop
> trying to talk to it) and won't unwedge without a full power-cycle  
> (soft
> reset doesn't work), or in certain bad block circumstances the drive
> wedges long enough for the controller driver to break in a strange way
> (resulting in a system panic).  Each situation appears to be  
> different;
> there's definitely situations where the disk is responsible, others
> which look like the controller is responsible, and others which look
> like driver issues.
>
> I'm not familiar (read: have not used) mpt(4) controllers, but if my
> memory serves me right, people post about problems with them from time
> to time on FreeBSD.  Each incident has to be addressed separately.
>
> If you're asking for a workaround or "what should I do", the  
> solution is
> to either change controllers (read: avoid mpt(4)), or figure out how/ 
> why
> the disk became wedged (or if it even did in the first place).
>
> Your original post contains no useful information about the hardware
> itself (mpt handles many controllers yet we know not what model, we  
> know
> nothing about disk da2, etc.).  You're going to need to provide this.
> Relevant dmesg output, camcontrol devlist, camcontrol inquiry, and
> smartctl -a output for the disk would be useful (assuming the  
> controller
> supports passthrough).

Thanks for the detailed response, it has given me some things to think  
about.  You are right, I had not posted too much about the machine in  
question.  For those interested now or who may run across this in the  
archives, I provide it now (edited and partially reconstructed from  
backups of the log files):

The machine is a Dell PowerEdge 2970 with SAS 6/iR Integrated, x6  
Backplane

Aug 24 05:40:41 caprica kernel: FreeBSD 8.0-RELEASE #0: Fri Jan 29  
14:17:29 EST 2010
Aug 24 05:40:41 caprica kernel: CPU: Quad-Core AMD Opteron(tm)  
Processor 2387 (2793.03-MHz K8-class CPU)
Aug 24 05:40:41 caprica kernel: real memory  = 17179869184 (16384 MB)
Aug 24 05:40:41 caprica kernel: FreeBSD/SMP: Multiprocessor System  
Detected: 4 CPUs
Aug 24 05:40:41 caprica kernel: FreeBSD/SMP: 1 package(s) x 4 core(s)
Aug 24 05:40:41 caprica kernel: mpt0: <LSILogic SAS/SATA Adapter> port  
0xec00-0xecff mem 0xe9fec000-0xe9feffff,0xe9ff0000-0xe9ffffff irq 37  
at device 0.0 on pci7
Aug 24 05:40:41 caprica kernel: mpt0: [ITHREAD]
Aug 24 05:40:41 caprica kernel: mpt0: MPI Version=1.5.18.0
Aug 24 05:40:41 caprica kernel: mpt0: Capabilities: ( RAID-0 RAID-1E  
RAID-1 )
Aug 24 05:40:41 caprica kernel: mpt0: 0 Active Volumes (2 Max)
Aug 24 05:40:41 caprica kernel: mpt0: 0 Hidden Drive Members (14 Max)
Aug 24 05:40:41 caprica kernel: ZFS filesystem version 13
Aug 24 05:40:41 caprica kernel: ZFS storage pool version 13
Aug 24 05:40:41 caprica kernel: Timecounters tick every 1.000 msec
Aug 24 05:40:41 caprica kernel: da0: <ATA WDC WD1602ABKS-1 3B04> Fixed  
Direct Access SCSI-5 device
Aug 24 05:40:41 caprica kernel: da0: 300.000MB/s transfers
Aug 24 05:40:41 caprica kernel: da0: Command Queueing enabled
Aug 24 05:40:41 caprica kernel: da0: 152587MB (312500000 512 byte  
sectors: 255H 63S/T 19452C)
Aug 24 05:40:41 caprica kernel: da1 at mpt0 bus 0 target 1 lun 0
Aug 24 05:40:41 caprica kernel: da1: <ATA WDC WD5002ABYS-1 3B04> Fixed  
Direct Access SCSI-5 device
Aug 24 05:40:41 caprica kernel: da1: 300.000MB/s transfers
Aug 24 05:40:41 caprica kernel: da1: Command Queueing enabled
Aug 24 05:40:41 caprica kernel: da1: 476940MB (976773168 512 byte  
sectors: 255H 63S/T 60801C)
Aug 24 05:40:41 caprica kernel: ses0 at mpt0 bus 0 target 8 lun 0
Aug 24 05:40:41 caprica kernel: ses0: <DP BACKPLANE 1.05> Fixed  
Enclosure Services SCSI-5 device
Aug 24 05:40:41 caprica kernel: ses0: 300.000MB/s transfers
Aug 24 05:40:41 caprica kernel: ses0: SCSI-3 SES Device

added the mirror disks later

Oct 15 10:47:21 caprica kernel: da2 at mpt0 bus 0 target 3 lun 0
Oct 15 10:47:21 caprica kernel: da2: <ATA WDC WD5002ABYS-1 3B04> Fixed  
Direct Access SCSI-5 device
Oct 15 10:47:21 caprica kernel: da2: 300.000MB/s transfers
Oct 15 10:47:21 caprica kernel: da2: Command Queueing enabled
Oct 15 10:47:21 caprica kernel: da2: 476940MB (976773168 512 byte  
sectors: 255H 63S/T 60801C)
Oct 15 10:47:21 caprica kernel: da3 at mpt0 bus 0 target 2 lun 0
Oct 15 10:47:21 caprica kernel: da3: <ATA WDC WD1602ABKS-1 3B05> Fixed  
Direct Access SCSI-5 device
Oct 15 10:47:21 caprica kernel: da3: 300.000MB/s transfers
Oct 15 10:47:21 caprica kernel: da3: Command Queueing enabled
Oct 15 10:47:21 caprica kernel: da3: 152587MB (312500000 512 byte  
sectors: 255H 63S/T 19452C)

started getting the occasional error on da3 (did not realize until  
after the crash.  Now using swatch to check for mpt errors)

Oct 18 03:43:58 caprica kernel: (da3:mpt0:0:2:0): WRITE(10). CDB: 2a 0  
2 4 58 a2 0 0 80 0
Oct 18 03:43:58 caprica kernel: (da3:mpt0:0:2:0): CAM Status: SCSI  
Status Error
Oct 18 03:43:58 caprica kernel: (da3:mpt0:0:2:0): SCSI Status: Check  
Condition
Oct 18 03:43:58 caprica kernel: (da3:mpt0:0:2:0): UNIT ATTENTION asc: 
29,0
Oct 18 03:43:58 caprica kernel: (da3:mpt0:0:2:0): Power on, reset, or  
bus device reset occurred
Oct 18 03:43:58 caprica kernel: (da3:mpt0:0:2:0): Retrying Command  
(per Sense Data)

Camcontrol output (partially reconstructed as the drives are currently  
on my desk)

<ATA WDC WD1602ABKS-1 3B04>        at scbus0 target 0 lun 0 (pass0,da0)
<ATA WDC WD5002ABYS-1 3B04>        at scbus0 target 1 lun 0 (pass1,da1)
<ATA WDC WD5002ABYS-1 3B04>        at scbus0 target 2 lun 0 (pass2,da2)
<ATA WDC WD1602ABKS-1 3B04>        at scbus0 target 3 lun 0 (pass2,da3)
<DP BACKPLANE 1.05>                at scbus0 target 8 lun 0 (ses0,pass4)

This is all I can provide at this time.  I appreciate all of the help  
provided thus far and in future.  I am going to check into BIOS  
updates for the SAS 6/iR and I am in the process of moving to 8.1 for  
better mpt support.

Thanks, again

>
> Finally, be aware that trying to chase down a problem of this nature  
> is
> often time-consuming.  Sometimes it's not worth it at all, and instead
> better spent replacing all of the hardware involved.  If it happens
> again after that, change vendors or hardware controllers (or disks)
> used.  That's just how it goes.  I tend to stick to Intel ICHxx or ESB
> SATA controllers for this reason; they're well-tested on FreeBSD.   
> And I
> don't use hardware RAID at all for many reasons (separate topic).
>
> -- 
> | Jeremy Chadwick                                   jdc at parodius.com |
> | Parodius Networking                       http://www.parodius.com/ |
> | UNIX Systems Administrator                  Mountain View, CA, USA |
> | Making life hard for others since 1977.              PGP: 4BD6C0CB |
>