Outage related to hard drive failure

Brian Gardner brian at timestudybuddy.com
Sat Feb 8 20:32:49 UTC 2014


Last year upgraded my production servers from ufs on adaptec raid controllers to zfs raidz on lsi controllers.  Last week I had an outage, the culprit being a failed hard in my raidz.  My log was littered with kernel messages such as the ones below.  About 20 minutes after the first message some aspects of my host where hung, and I was notified that my site was down.  I noticed these messages in my log but running a zfs status however showed only a few checksum errors and the failed drive didn't get degraded.  Manually degrading the drive solved my problems.  This seems very odd to me.  It's almost as if zfs wasn't getting messages from the lsi driver regarding these read/write failures.  Do I need to tune something in the mps/lsi or zfs drivers to help it deal with failures?  I'm running FreeBSD 8.3-Release-p3 with the generic kernel, nothing unusual about my setup other than I use jails extensively.  There are four drives in the raidz configuration in question:

zpool status
  pool: storage
 state: ONLINE
  scan: resilvered 67.6G in 2h21m with 0 errors on Tue Aug  6 00:44:14 2013
config:

	NAME        STATE     READ WRITE CKSUM
	storage     ONLINE       0     0     0
	  raidz1-0  ONLINE       0     0     0
	    da0     ONLINE       0     0     0
	    da1     ONLINE       0     0     0
	    da2     ONLINE       0     0     0
	    da3     ONLINE       0     0     0

Jan 29 19:03:18 host2 kernel: (da0:mpslsi0:0:0:0): WRITE(10). CDB: 2a 0 9 72 d8 13 0 0 1d 0 
Jan 29 19:03:18 host2 kernel: (da0:mpslsi0:0:0:0): CAM status: SCSI Status Error
Jan 29 19:03:18 host2 kernel: (da0:mpslsi0:0:0:0): SCSI status: Check Condition
Jan 29 19:03:18 host2 kernel: (da0:mpslsi0:0:0:0): SCSI sense: Deferred error: MEDIUM ERROR info:97ef13b asc:15,1 (Mechanical positioning error) actual retry count: 15
Jan 29 19:04:00 host2 kernel: (da0:mpslsi0:0:0:0): READ(10). CDB: 28 0 9 75 b9 96 0 0 b 0 
Jan 29 19:04:00 host2 kernel: (da0:mpslsi0:0:0:0): CAM status: SCSI Status Error
Jan 29 19:04:00 host2 kernel: (da0:mpslsi0:0:0:0): SCSI status: Check Condition
Jan 29 19:04:00 host2 kernel: (da0:mpslsi0:0:0:0): SCSI sense: MEDIUM ERROR info:975b99b asc:15,1 (Mechanical positioning error) actual retry count: 15
Jan 29 19:04:54 host2 kernel: mpslsi0: mpssas_scsiio_timeout checking sc 0xffffff80005ee000 cm 0xffffff800060d408
Jan 29 19:04:54 host2 kernel: (da0:mpslsi0:0:0:0): READ(10). CDB: 28 0 2 e3 6a 1a 0 0 1 0 length 512 SMID 153 command timeout cm 0xffffff800060d408 ccb 0xffffff000a1ce800
Jan 29 19:04:54 host2 kernel: mpslsi0: mpssas_alloc_tm freezing simq
Jan 29 19:04:54 host2 kernel: mpslsi0: timedout cm 0xffffff800060d408 allocated tm 0xffffff8000604718
Jan 29 19:04:54 host2 kernel: (da0:mpslsi0:0:0:0): READ(10). CDB: 28 0 2 e3 6a 1a 0 0 1 0 length 512 SMID 153 completed timedout cm 0xffffff800060d408 ccb 0xffffff000a1ce800 during recovery ioc 8048 scsi 0 state c xfer 0
Jan 29 19:04:54 host2 kernel: (noperiph:mpslsi0:0:0:0): SMID 43 abort TaskMID 153 status 0x0 code 0x0 count 1
Jan 29 19:04:54 host2 kernel: (noperiph:mpslsi0:0:0:0): SMID 43 finished recovery after aborting TaskMID 153
Jan 29 19:04:54 host2 kernel: mpslsi0: mpssas_free_tm releasing simq
Jan 29 19:04:54 host2 kernel: mpslsi0: mpssas_scsiio_timeout checking sc 0xffffff80005ee000 cm 0xffffff800060d928
Jan 29 19:04:54 host2 kernel: (da0:mpslsi0:0:0:0): READ(10). CDB: 28 0 9 41 c7 53 0 0 b 0 length 5632 SMID 157 command timeout cm 0xffffff800060d928 ccb 0xffffff0176ccb000
Jan 29 19:04:54 host2 kernel: mpslsi0: mpssas_alloc_tm freezing simq
Jan 29 19:04:54 host2 kernel: mpslsi0: timedout cm 0xffffff800060d928 allocated tm 0xffffff8000604860
Jan 29 19:04:54 host2 kernel: (da0:mpslsi0:0:0:0): READ(10). CDB: 28 0 9 41 c7 53 0 0 b 0 length 5632 SMID 157 completed timedout cm 0xffffff800060d928 ccb 0xffffff0176ccb000 during recovery ioc 8048 scsi 0 state c xfer 0(noperiph:mpslsi0:0:0:0): SMID 44 abort TaskMID 157 status 0x0 code 0x0 count 1
Jan 29 19:04:54 host2 kernel: (noperiph:mpslsi0:0:0:0): SMID 44 finished recovery after aborting TaskMID 157
Jan 29 19:04:54 host2 kernel: mpslsi0: mpssas_free_tm releasing simq
Jan 29 19:04:54 host2 kernel: mpslsi0: mpssas_scsiio_timeout checking sc 0xffffff80005ee000 cm 0xffffff80006235a8
Jan 29 19:04:54 host2 kernel: (da0:mpslsi0:0:0:0): WRITE(10). CDB: 2a 0 9 72 e0 d3 0 1 0 0 length 131072 SMID 429 command timeout cm 0xffffff80006235a8 ccb 0xffffff000c600000
Jan 29 19:04:54 host2 kernel: mpslsi0: mpssas_alloc_tm freezing simq
Jan 29 19:04:54 host2 kernel: mpslsi0: timedout cm 0xffffff80006235a8 allocated tm 0xffffff80006049a8
Jan 29 19:04:54 host2 kernel: mpslsi0: mpssas_scsiio_timeout checking sc 0xffffff80005ee000 cm 0xffffff800063e2e0
Jan 29 19:04:54 host2 kernel: (da0:mpslsi0:0:0:0): WRITE(10). CDB: 2a 0 9 72 e1 d3 0 0 3 0 length 1536 SMID 764 command timeout cm 0xffffff800063e2e0 ccb 0xffffff01dfa19800
Jan 29 19:04:54 host2 kernel: mpslsi0: queued timedout cm 0xffffff800063e2e0 for processing by tm 0xffffff80006049a8
Jan 29 19:04:54 host2 kernel: (da0:mpslsi0:0:0:0): WRITE(10). CDB: 2a 0 9 72 e0 d3 0 1 0 0 length 131072 SMID 429 completed timedout cm 0xffffff80006235a8 ccb 0xffffff000c600000 during recovery ioc 8048 scsi 0 state c xfe(noperiph:mpslsi0:0:0:0): SMID 45 abort TaskMID 429 status 0x0 code 0x0 count 1
Jan 29 19:04:54 host2 kernel: (noperiph:mpslsi0:0:0:0): SMID 45 continuing recovery after aborting TaskMID 429
Jan 29 19:04:54 host2 kernel: mpslsi0: mpssas_scsiio_timeout checking sc 0xffffff80005ee000 cm 0xffffff8000610890
Jan 29 19:04:54 host2 kernel: (da0:mpslsi0:0:0:0): WRITE(10). CDB: 2a 0 9 72 e3 31 0 0 be 0 length 97280 SMID 194 command timeout cm 0xffffff8000610890 ccb 0xffffff013d0f3800
Jan 29 19:04:54 host2 kernel: mpslsi0: queued timedout cm 0xffffff8000610890 for processing by tm 0xffffff80006049a8
Jan 29 19:04:54 host2 kernel: (da0:mpslsi0:0:0:0): WRITE(10). CDB: 2a 0 9 72 e1 d3 0 0 3 0 length 1536 SMID 764 completed timedout cm 0xffffff800063e2e0 ccb 0xffffff01dfa19800 during recovery ioc 8048 scsi 0 state c xfer (noperiph:mpslsi0:0:0:0): SMID 45 abort TaskMID 764 status 0x0 code 0x0 count 1


More information about the zfs-devel mailing list