zfs mirror recognizing disk failures
Michael Boers
michaelscotttech at gmail.com
Tue Nov 16 15:29:44 UTC 2010
On Nov 16, 2010, at 8:58 AM, Jeremy Chadwick wrote:
> On Tue, Nov 16, 2010 at 08:32:35AM -0500, Michael Boers wrote:
>> To answer Jermey's question of "what happened next?"
>>
>> The machine was not serving web requests
>> The machine was not responsive via ssh
>> The machine was pingable
>>
>> after waiting about 15 minutes, I used the ipmi protocol to power
>> down the machine.
>> When it came back up, I found the enclosed errors in the log.
>>
>> If I am following your comments correctly, the fault for this lies
>> in the mpt system not giving up which could either be a driver or a
>> firmware issue. Is that correct?
>>
>> How do I protect against that?
>
> The fault, in my opinion -- and I urge others (especially those
> familiar
> with the driver) to correct me, because I am often wrong -- lies with
> either with the controller itself, or mpt(4), not truly "giving up"
> after repetitive errors. It could be a firmware bug/quirk, sure. It
> could be a lot of things, or a combination of things. I don't want to
> rule out anything.
>
> For example, at my workplace we use Solaris with Adaptec controllers,
> using a multitude of Fujitsu disks. Everything is SCSI-3. We
> regularly
> (at least once a week, usually more than that) see disk problems where
> either the disk falls off the bus unexpectedly, the drive itself
> "wedges" (resulting in the controller getting stuck in an infinite
> loop
> trying to talk to it) and won't unwedge without a full power-cycle
> (soft
> reset doesn't work), or in certain bad block circumstances the drive
> wedges long enough for the controller driver to break in a strange way
> (resulting in a system panic). Each situation appears to be
> different;
> there's definitely situations where the disk is responsible, others
> which look like the controller is responsible, and others which look
> like driver issues.
>
> I'm not familiar (read: have not used) mpt(4) controllers, but if my
> memory serves me right, people post about problems with them from time
> to time on FreeBSD. Each incident has to be addressed separately.
>
> If you're asking for a workaround or "what should I do", the
> solution is
> to either change controllers (read: avoid mpt(4)), or figure out how/
> why
> the disk became wedged (or if it even did in the first place).
>
> Your original post contains no useful information about the hardware
> itself (mpt handles many controllers yet we know not what model, we
> know
> nothing about disk da2, etc.). You're going to need to provide this.
> Relevant dmesg output, camcontrol devlist, camcontrol inquiry, and
> smartctl -a output for the disk would be useful (assuming the
> controller
> supports passthrough).
Thanks for the detailed response, it has given me some things to think
about. You are right, I had not posted too much about the machine in
question. For those interested now or who may run across this in the
archives, I provide it now (edited and partially reconstructed from
backups of the log files):
The machine is a Dell PowerEdge 2970 with SAS 6/iR Integrated, x6
Backplane
Aug 24 05:40:41 caprica kernel: FreeBSD 8.0-RELEASE #0: Fri Jan 29
14:17:29 EST 2010
Aug 24 05:40:41 caprica kernel: CPU: Quad-Core AMD Opteron(tm)
Processor 2387 (2793.03-MHz K8-class CPU)
Aug 24 05:40:41 caprica kernel: real memory = 17179869184 (16384 MB)
Aug 24 05:40:41 caprica kernel: FreeBSD/SMP: Multiprocessor System
Detected: 4 CPUs
Aug 24 05:40:41 caprica kernel: FreeBSD/SMP: 1 package(s) x 4 core(s)
Aug 24 05:40:41 caprica kernel: mpt0: <LSILogic SAS/SATA Adapter> port
0xec00-0xecff mem 0xe9fec000-0xe9feffff,0xe9ff0000-0xe9ffffff irq 37
at device 0.0 on pci7
Aug 24 05:40:41 caprica kernel: mpt0: [ITHREAD]
Aug 24 05:40:41 caprica kernel: mpt0: MPI Version=1.5.18.0
Aug 24 05:40:41 caprica kernel: mpt0: Capabilities: ( RAID-0 RAID-1E
RAID-1 )
Aug 24 05:40:41 caprica kernel: mpt0: 0 Active Volumes (2 Max)
Aug 24 05:40:41 caprica kernel: mpt0: 0 Hidden Drive Members (14 Max)
Aug 24 05:40:41 caprica kernel: ZFS filesystem version 13
Aug 24 05:40:41 caprica kernel: ZFS storage pool version 13
Aug 24 05:40:41 caprica kernel: Timecounters tick every 1.000 msec
Aug 24 05:40:41 caprica kernel: da0: <ATA WDC WD1602ABKS-1 3B04> Fixed
Direct Access SCSI-5 device
Aug 24 05:40:41 caprica kernel: da0: 300.000MB/s transfers
Aug 24 05:40:41 caprica kernel: da0: Command Queueing enabled
Aug 24 05:40:41 caprica kernel: da0: 152587MB (312500000 512 byte
sectors: 255H 63S/T 19452C)
Aug 24 05:40:41 caprica kernel: da1 at mpt0 bus 0 target 1 lun 0
Aug 24 05:40:41 caprica kernel: da1: <ATA WDC WD5002ABYS-1 3B04> Fixed
Direct Access SCSI-5 device
Aug 24 05:40:41 caprica kernel: da1: 300.000MB/s transfers
Aug 24 05:40:41 caprica kernel: da1: Command Queueing enabled
Aug 24 05:40:41 caprica kernel: da1: 476940MB (976773168 512 byte
sectors: 255H 63S/T 60801C)
Aug 24 05:40:41 caprica kernel: ses0 at mpt0 bus 0 target 8 lun 0
Aug 24 05:40:41 caprica kernel: ses0: <DP BACKPLANE 1.05> Fixed
Enclosure Services SCSI-5 device
Aug 24 05:40:41 caprica kernel: ses0: 300.000MB/s transfers
Aug 24 05:40:41 caprica kernel: ses0: SCSI-3 SES Device
added the mirror disks later
Oct 15 10:47:21 caprica kernel: da2 at mpt0 bus 0 target 3 lun 0
Oct 15 10:47:21 caprica kernel: da2: <ATA WDC WD5002ABYS-1 3B04> Fixed
Direct Access SCSI-5 device
Oct 15 10:47:21 caprica kernel: da2: 300.000MB/s transfers
Oct 15 10:47:21 caprica kernel: da2: Command Queueing enabled
Oct 15 10:47:21 caprica kernel: da2: 476940MB (976773168 512 byte
sectors: 255H 63S/T 60801C)
Oct 15 10:47:21 caprica kernel: da3 at mpt0 bus 0 target 2 lun 0
Oct 15 10:47:21 caprica kernel: da3: <ATA WDC WD1602ABKS-1 3B05> Fixed
Direct Access SCSI-5 device
Oct 15 10:47:21 caprica kernel: da3: 300.000MB/s transfers
Oct 15 10:47:21 caprica kernel: da3: Command Queueing enabled
Oct 15 10:47:21 caprica kernel: da3: 152587MB (312500000 512 byte
sectors: 255H 63S/T 19452C)
started getting the occasional error on da3 (did not realize until
after the crash. Now using swatch to check for mpt errors)
Oct 18 03:43:58 caprica kernel: (da3:mpt0:0:2:0): WRITE(10). CDB: 2a 0
2 4 58 a2 0 0 80 0
Oct 18 03:43:58 caprica kernel: (da3:mpt0:0:2:0): CAM Status: SCSI
Status Error
Oct 18 03:43:58 caprica kernel: (da3:mpt0:0:2:0): SCSI Status: Check
Condition
Oct 18 03:43:58 caprica kernel: (da3:mpt0:0:2:0): UNIT ATTENTION asc:
29,0
Oct 18 03:43:58 caprica kernel: (da3:mpt0:0:2:0): Power on, reset, or
bus device reset occurred
Oct 18 03:43:58 caprica kernel: (da3:mpt0:0:2:0): Retrying Command
(per Sense Data)
Camcontrol output (partially reconstructed as the drives are currently
on my desk)
<ATA WDC WD1602ABKS-1 3B04> at scbus0 target 0 lun 0 (pass0,da0)
<ATA WDC WD5002ABYS-1 3B04> at scbus0 target 1 lun 0 (pass1,da1)
<ATA WDC WD5002ABYS-1 3B04> at scbus0 target 2 lun 0 (pass2,da2)
<ATA WDC WD1602ABKS-1 3B04> at scbus0 target 3 lun 0 (pass2,da3)
<DP BACKPLANE 1.05> at scbus0 target 8 lun 0 (ses0,pass4)
This is all I can provide at this time. I appreciate all of the help
provided thus far and in future. I am going to check into BIOS
updates for the SAS 6/iR and I am in the process of moving to 8.1 for
better mpt support.
Thanks, again
>
> Finally, be aware that trying to chase down a problem of this nature
> is
> often time-consuming. Sometimes it's not worth it at all, and instead
> better spent replacing all of the hardware involved. If it happens
> again after that, change vendors or hardware controllers (or disks)
> used. That's just how it goes. I tend to stick to Intel ICHxx or ESB
> SATA controllers for this reason; they're well-tested on FreeBSD.
> And I
> don't use hardware RAID at all for many reasons (separate topic).
>
> --
> | Jeremy Chadwick jdc at parodius.com |
> | Parodius Networking http://www.parodius.com/ |
> | UNIX Systems Administrator Mountain View, CA, USA |
> | Making life hard for others since 1977. PGP: 4BD6C0CB |
>
More information about the freebsd-fs
mailing list