Outage related to hard drive failure

Sat Feb 8 21:28:24 UTC 2014

Dear Brian,

Unfortunately I just can report about same issue I ran into a few weeks 
ago. We run a FreeNAS server which hosts the VM images, and serves them 
over NFS. One time I began receiving notifications that the virtual 
hosts served by this NAS went down. I checked the NAS, found that one 
drive attached to a mps/lsi HBA stopped responding to the HBA at all, 
and thus blocked the whole pool. That was also strange to me that 
neither a timeout event happened, so actually zfs thought that all 
drives are fine, just one blocked the whole pool IOs. And unfortunately 
I even could not offline that drive, only a hard reset helped. The drive 
was so unresponsive that the bootup for FreeNAS also took long, but at 
least that time zfs somehow noticed the drive is missing, and removed it 
from the pool. And after that the pool worked fine, of course in a 
degraded state, but healthy, we could initiate a replacement.

I only have the logs for the bootup:
mps0: mpssas_scsiio_timeout checking sc 0xffffff8000fd1000 cm 
0xffffff800100c0a0
(probe14:mps0:0:14:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 
500 command timeout cm 0xffffff800100c0a0 ccb 0xfffffe0010f03000
mps0: mpssas_alloc_tm freezing simq
mps0: timedout cm 0xffffff800100c0a0 allocated tm 0xffffff8000fe47b0
(probe14:mps0:0:14:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 
500 completed timedout cm 0xffffff800100c0a0 ccb 0xfffffe0010f03000 
during recovery ioc 8048 scsi 0 state c xfer 0
(noperiph:mps0:0:14:0): SMID 6 abort TaskMID 500 status 0x0 code 0x0 
count 1
(noperiph:mps0:0:14:0): SMID 6 finished recovery after aborting TaskMID 
500
mps0: mpssas_free_tm releasing simq
(probe14:mps0:0:14:0): INQUIRY. CDB: 12 00 00 00 24 00
(probe14:mps0:0:14:0): CAM status: Command timeout
(probe14:mps0:0:14:0): Retrying command
mps0: mpssas_scsiio_timeout checking sc 0xffffff8000fd1000 cm 
0xffffff8001032f50
(probe14:mps0:0:14:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 
986 command timeout cm 0xffffff8001032f50 ccb 0xfffffe0010f03000
mps0: mpssas_alloc_tm freezing simq
mps0: timedout cm 0xffffff8001032f50 allocated tm 0xffffff8000fe48f8
(probe14:mps0:0:14:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 
986 completed timedout cm 0xffffff8001032f50 ccb 0xfffffe0010f03000 
during recovery ioc 8048 scsi 0 state c xfer 0
(noperiph:mps0:0:14:0): SMID 7 abort TaskMID 986 status 0x0 code 0x0 
count 1
(noperiph:mps0:0:14:0): SMID 7 finished recovery after aborting TaskMID 
986
mps0: mpssas_free_tm releasing simq
(probe14:mps0:0:14:0): INQUIRY. CDB: 12 00 00 00 24 00
(probe14:mps0:0:14:0): CAM status: Command timeout
(probe14:mps0:0:14:0): Retrying command
mps0: mpssas_scsiio_timeout checking sc 0xffffff8000fd1000 cm 
0xffffff8001010c38
(probe14:mps0:0:14:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 
559 command timeout cm 0xffffff8001010c38 ccb 0xfffffe0010f03000
mps0: mpssas_alloc_tm freezing simq
mps0: timedout cm 0xffffff8001010c38 allocated tm 0xffffff8000fe4a40
(probe14:mps0:0:14:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 
559 completed timedout cm 0xffffff8001010c38 ccb 0xfffffe0010f03000 
during recovery ioc 8048 scsi 0 state c xfer 0
(noperiph:mps0:0:14:0): SMID 8 abort TaskMID 559 status 0x0 code 0x0 
count 1
(noperiph:mps0:0:14:0): SMID 8 finished recovery after aborting TaskMID 
559
mps0: mpssas_free_tm releasing simq
(probe14:mps0:0:14:0): INQUIRY. CDB: 12 00 00 00 24 00
(probe14:mps0:0:14:0): CAM status: Command timeout
(probe14:mps0:0:14:0): Retrying command
mps0: mpssas_scsiio_timeout checking sc 0xffffff8000fd1000 cm 
0xffffff8001007278
(probe14:mps0:0:14:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 
439 command timeout cm 0xffffff8001007278 ccb 0xfffffe0010f03000
mps0: mpssas_alloc_tm freezing simq
mps0: timedout cm 0xffffff8001007278 allocated tm 0xffffff8000fe4b88
(probe14:mps0:0:14:0): INQUIRY. CDB: 12 00 00 00 24 00 length 36 SMID 
439 completed timedout cm 0xffffff8001007278 ccb 0xfffffe0010f03000 
during recovery ioc 8048 scsi 0 state c xfer 0
(noperiph:mps0:0:14:0): SMID 9 abort TaskMID 439 status 0x0 code 0x0 
count 1
(noperiph:mps0:0:14:0): SMID 9 finished recovery after aborting TaskMID 
439
mps0: mpssas_free_tm releasing simq
(probe14:mps0:0:14:0): INQUIRY. CDB: 12 00 00 00 24 00
(probe14:mps0:0:14:0): CAM status: Command timeout
(probe14:mps0:0:14:0): Retrying command

A side note is that the drive was removed from its hot-swap bay, then 
reinserted, and since then it is working fine. That is a seagate 
ST32000645SS. And no errors reported.

Regards,

2014-02-08 21:02 időpontban Brian Gardner ezt írta:
> Last year upgraded my production servers from ufs on adaptec raid
> controllers to zfs raidz on lsi controllers.  Last week I had an
> outage, the culprit being a failed hard in my raidz.  My log was
> littered with kernel messages such as the ones below.  About 20
> minutes after the first message some aspects of my host where hung,
> and I was notified that my site was down.  I noticed these messages in
> my log but running a zfs status however showed only a few checksum
> errors and the failed drive didn't get degraded.  Manually degrading
> the drive solved my problems.  This seems very odd to me.  It's almost
> as if zfs wasn't getting messages from the lsi driver regarding these
> read/write failures.  Do I need to tune something in the mps/lsi or
> zfs drivers to help it deal with failures?  I'm running FreeBSD
> 8.3-Release-p3 with the generic kernel, nothing unusual about my setup
> other than I use jails extensively.  There are four drives in the
> raidz configuration in question:
> 
> zpool status
>   pool: storage
>  state: ONLINE
>   scan: resilvered 67.6G in 2h21m with 0 errors on Tue Aug  6 00:44:14 
> 2013
> config:
> 
> 	NAME        STATE     READ WRITE CKSUM
> 	storage     ONLINE       0     0     0
> 	  raidz1-0  ONLINE       0     0     0
> 	    da0     ONLINE       0     0     0
> 	    da1     ONLINE       0     0     0
> 	    da2     ONLINE       0     0     0
> 	    da3     ONLINE       0     0     0
> 
> Jan 29 19:03:18 host2 kernel: (da0:mpslsi0:0:0:0): WRITE(10). CDB: 2a
> 0 9 72 d8 13 0 0 1d 0
> Jan 29 19:03:18 host2 kernel: (da0:mpslsi0:0:0:0): CAM status: SCSI 
> Status Error
> Jan 29 19:03:18 host2 kernel: (da0:mpslsi0:0:0:0): SCSI status: Check 
> Condition
> Jan 29 19:03:18 host2 kernel: (da0:mpslsi0:0:0:0): SCSI sense:
> Deferred error: MEDIUM ERROR info:97ef13b asc:15,1 (Mechanical
> positioning error) actual retry count: 15
> Jan 29 19:04:00 host2 kernel: (da0:mpslsi0:0:0:0): READ(10). CDB: 28 0
> 9 75 b9 96 0 0 b 0
> Jan 29 19:04:00 host2 kernel: (da0:mpslsi0:0:0:0): CAM status: SCSI 
> Status Error
> Jan 29 19:04:00 host2 kernel: (da0:mpslsi0:0:0:0): SCSI status: Check 
> Condition
> Jan 29 19:04:00 host2 kernel: (da0:mpslsi0:0:0:0): SCSI sense: MEDIUM
> ERROR info:975b99b asc:15,1 (Mechanical positioning error) actual
> retry count: 15
> Jan 29 19:04:54 host2 kernel: mpslsi0: mpssas_scsiio_timeout checking
> sc 0xffffff80005ee000 cm 0xffffff800060d408
> Jan 29 19:04:54 host2 kernel: (da0:mpslsi0:0:0:0): READ(10). CDB: 28 0
> 2 e3 6a 1a 0 0 1 0 length 512 SMID 153 command timeout cm
> 0xffffff800060d408 ccb 0xffffff000a1ce800
> Jan 29 19:04:54 host2 kernel: mpslsi0: mpssas_alloc_tm freezing simq
> Jan 29 19:04:54 host2 kernel: mpslsi0: timedout cm 0xffffff800060d408
> allocated tm 0xffffff8000604718
> Jan 29 19:04:54 host2 kernel: (da0:mpslsi0:0:0:0): READ(10). CDB: 28 0
> 2 e3 6a 1a 0 0 1 0 length 512 SMID 153 completed timedout cm
> 0xffffff800060d408 ccb 0xffffff000a1ce800 during recovery ioc 8048
> scsi 0 state c xfer 0
> Jan 29 19:04:54 host2 kernel: (noperiph:mpslsi0:0:0:0): SMID 43 abort
> TaskMID 153 status 0x0 code 0x0 count 1
> Jan 29 19:04:54 host2 kernel: (noperiph:mpslsi0:0:0:0): SMID 43
> finished recovery after aborting TaskMID 153
> Jan 29 19:04:54 host2 kernel: mpslsi0: mpssas_free_tm releasing simq
> Jan 29 19:04:54 host2 kernel: mpslsi0: mpssas_scsiio_timeout checking
> sc 0xffffff80005ee000 cm 0xffffff800060d928
> Jan 29 19:04:54 host2 kernel: (da0:mpslsi0:0:0:0): READ(10). CDB: 28 0
> 9 41 c7 53 0 0 b 0 length 5632 SMID 157 command timeout cm
> 0xffffff800060d928 ccb 0xffffff0176ccb000
> Jan 29 19:04:54 host2 kernel: mpslsi0: mpssas_alloc_tm freezing simq
> Jan 29 19:04:54 host2 kernel: mpslsi0: timedout cm 0xffffff800060d928
> allocated tm 0xffffff8000604860
> Jan 29 19:04:54 host2 kernel: (da0:mpslsi0:0:0:0): READ(10). CDB: 28 0
> 9 41 c7 53 0 0 b 0 length 5632 SMID 157 completed timedout cm
> 0xffffff800060d928 ccb 0xffffff0176ccb000 during recovery ioc 8048
> scsi 0 state c xfer 0(noperiph:mpslsi0:0:0:0): SMID 44 abort TaskMID
> 157 status 0x0 code 0x0 count 1
> Jan 29 19:04:54 host2 kernel: (noperiph:mpslsi0:0:0:0): SMID 44
> finished recovery after aborting TaskMID 157
> Jan 29 19:04:54 host2 kernel: mpslsi0: mpssas_free_tm releasing simq
> Jan 29 19:04:54 host2 kernel: mpslsi0: mpssas_scsiio_timeout checking
> sc 0xffffff80005ee000 cm 0xffffff80006235a8
> Jan 29 19:04:54 host2 kernel: (da0:mpslsi0:0:0:0): WRITE(10). CDB: 2a
> 0 9 72 e0 d3 0 1 0 0 length 131072 SMID 429 command timeout cm
> 0xffffff80006235a8 ccb 0xffffff000c600000
> Jan 29 19:04:54 host2 kernel: mpslsi0: mpssas_alloc_tm freezing simq
> Jan 29 19:04:54 host2 kernel: mpslsi0: timedout cm 0xffffff80006235a8
> allocated tm 0xffffff80006049a8
> Jan 29 19:04:54 host2 kernel: mpslsi0: mpssas_scsiio_timeout checking
> sc 0xffffff80005ee000 cm 0xffffff800063e2e0
> Jan 29 19:04:54 host2 kernel: (da0:mpslsi0:0:0:0): WRITE(10). CDB: 2a
> 0 9 72 e1 d3 0 0 3 0 length 1536 SMID 764 command timeout cm
> 0xffffff800063e2e0 ccb 0xffffff01dfa19800
> Jan 29 19:04:54 host2 kernel: mpslsi0: queued timedout cm
> 0xffffff800063e2e0 for processing by tm 0xffffff80006049a8
> Jan 29 19:04:54 host2 kernel: (da0:mpslsi0:0:0:0): WRITE(10). CDB: 2a
> 0 9 72 e0 d3 0 1 0 0 length 131072 SMID 429 completed timedout cm
> 0xffffff80006235a8 ccb 0xffffff000c600000 during recovery ioc 8048
> scsi 0 state c xfe(noperiph:mpslsi0:0:0:0): SMID 45 abort TaskMID 429
> status 0x0 code 0x0 count 1
> Jan 29 19:04:54 host2 kernel: (noperiph:mpslsi0:0:0:0): SMID 45
> continuing recovery after aborting TaskMID 429
> Jan 29 19:04:54 host2 kernel: mpslsi0: mpssas_scsiio_timeout checking
> sc 0xffffff80005ee000 cm 0xffffff8000610890
> Jan 29 19:04:54 host2 kernel: (da0:mpslsi0:0:0:0): WRITE(10). CDB: 2a
> 0 9 72 e3 31 0 0 be 0 length 97280 SMID 194 command timeout cm
> 0xffffff8000610890 ccb 0xffffff013d0f3800
> Jan 29 19:04:54 host2 kernel: mpslsi0: queued timedout cm
> 0xffffff8000610890 for processing by tm 0xffffff80006049a8
> Jan 29 19:04:54 host2 kernel: (da0:mpslsi0:0:0:0): WRITE(10). CDB: 2a
> 0 9 72 e1 d3 0 0 3 0 length 1536 SMID 764 completed timedout cm
> 0xffffff800063e2e0 ccb 0xffffff01dfa19800 during recovery ioc 8048
> scsi 0 state c xfer (noperiph:mpslsi0:0:0:0): SMID 45 abort TaskMID
> 764 status 0x0 code 0x0 count 1
> _______________________________________________
> zfs-devel at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/zfs-devel
> To unsubscribe, send any mail to "zfs-devel-unsubscribe at freebsd.org"