zpool raidz2 stopped working after failure of one drive
Marek Salwerowicz
marek.salwerowicz at misal.pl
Sat Nov 19 20:22:32 UTC 2016
Hi all,
I run a following server:
- Supermicro 6047R-E1R36L
- 96 GB RAM
- 1x INTEL CPU E5-2640 v2 @ 2.00GHz
- FreeBSD 10.3-RELEASE-p11
Drive for OS:
- HW RAID1: 2x KINGSTON SV300S37A120G
zpool:
- 18x WD RED 4TB @ raidz2
- log: mirrored Intel 730 SSD
- cache: single Intel 730 SSD
Today after one drive's failure, the whole vdev was removed from the
zpool (basically the zpool was down, zpool / zfs commands were not
responding):
Nov 19 12:19:51 storage2 kernel: (da14:mps0:0:22:0): READ(10). CDB: 28
00 29 e7 b5 79 00 00 10 00
Nov 19 12:19:51 storage2 kernel: (da14:mps0:0:22:0): CAM status: SCSI
Status Error
Nov 19 12:19:51 storage2 kernel: (da14:mps0:0:22:0): SCSI status: Check
Condition
Nov 19 12:19:51 storage2 kernel: (da14:mps0:0:22:0): SCSI sense: MEDIUM
ERROR asc:11,0 (Unrecovered read error)
Nov 19 12:19:51 storage2 kernel: (da14:mps0:0:22:0): Info: 0x29e7b579
Nov 19 12:19:51 storage2 kernel: (da14:
Nov 19 12:19:52 storage2 kernel: mps0:0:22:0): Error 5, Unretryable error
Nov 19 12:20:03 storage2 kernel: mps0: mpssas_prepare_remove: Sending
reset for target ID 22
Nov 19 12:20:03 storage2 kernel: da14 at mps0 bus 0 scbus0 target 22 lun 0
Nov 19 12:20:04 storage2 kernel: da14: <ATA WDC WD4000FYYZ-0 1K02>
s/n WD-WCC131430652 detached
Nov 19 12:20:04 storage2 kernel: (da14:mps0:0:22:0): SYNCHRONIZE
CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 547
terminated ioc 804b scsi 0 st
Nov 19 12:20:13 storage2 kernel: ate c xfer 0
Nov 19 12:20:13 storage2 kernel: (da14:mps0:0:22:0): READ(6). CDB: 08 00
02 10 10 00 length 8192 SMID 292 terminated ioc 804b scsi 0 state c xfer 0
Nov 19 12:20:13 storage2 kernel: (da14:mps0:0:22:0): SYNCHRONIZE
CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Nov 19 12:20:13 storage2 kernel: (da14:mps0:0:22:0): READ(16). CDB: 88
00 00 00 00 01 d1 c0 bc 10 00 00 00 10 00 00 length 8192 SMID 248
terminated ioc 804b s(da14:mps0:0:22:0): CAM status: Unconditionally
Re-queue Request
Nov 19 12:20:13 storage2 kernel: csi 0 state c xfer 0
Nov 19 12:20:13 storage2 kernel: (da14: (da14:mps0:0:22:0): READ(16).
CDB: 88 00 00 00 00 01 d1 c0 ba 10 00 00 00 10 00 00 length 8192 SMID
905 terminated ioc 804b smps0:0:csi 0 state c xfer 0
Nov 19 12:20:13 storage2 kernel: 22:mps0: 0): IOCStatus = 0x4b while
resetting device 0x18
Nov 19 12:20:13 storage2 kernel: Error 5, Periph was invalidated
Nov 19 12:20:13 storage2 kernel: mps0: (da14:mps0:0:22:0): READ(6). CDB:
08 00 02 10 10 00
Nov 19 12:20:13 storage2 kernel: Unfreezing devq for target ID 22
Nov 19 12:20:13 storage2 kernel: (da14:mps0:0:22:0): CAM status:
Unconditionally Re-queue Request
Nov 19 12:20:13 storage2 kernel: (da14:mps0:0:22:0):
Nov 19 12:20:17 storage2 kernel: Error 5, Periph was invalidated
Nov 19 12:20:17 storage2 kernel: (da14:mps0:0:22:0): READ(16). CDB: 88
00 00 00 00 01 d1 c0 bc 10 00 00 00 10 00 00
Nov 19 12:20:17 storage2 kernel: (da14:mps0:0:22:0): CAM status:
Unconditionally Re-queue Request
Nov 19 12:20:17 storage2 kernel: (da14:mps0:0:22:0): Error 5, Periph was
invalidated
Nov 19 12:20:17 storage2 kernel: (da14:mps0:0:22:0): READ(16). CDB: 88
00 00 00 00 01 d1 c0 ba 10 00 00 00 10 00 00
Nov 19 12:20:17 storage2 kernel: (da14:mps0:0:22:0): CAM status:
Unconditionally Re-queue Request
Nov 19 12:20:17 storage2 kernel: (da14:mps0:0:22:0): Error 5, Periph was
invalidated
Nov 19 12:20:17 storage2 kernel: (da14:
Nov 19 12:20:17 storage2 devd: Executing 'logger -p kern.notice -t ZFS
'vdev is removed, pool_guid=15598571108475493154
vdev_guid=2747493726448938619''
Nov 19 12:20:17 storage2 ZFS: vdev is removed,
pool_guid=15598571108475493154 vdev_guid=2747493726448938619
Nov 19 12:20:17 storage2 kernel: mps0:0:22:
Nov 19 12:20:17 storage2 kernel: 0): Periph destroyed
There was no other option than hard-rebooting server.
SMART value "Raw_Read_Error_Rate " for the failed drive has increased 0
-> 1. I am about to replace it - it still has warranty.
I have now disabled the failing drive in zpool and it works fine (of
course, in DEGRADED state until I replace the drive)
However, I am concerned by the fact that one drive's failure has blocked
completely zpool.
Is it normal normal behaviour for zpools ?
Also, is there already auto hot-spare in ZFS? If I had a hot spare drive
in my zpool, would it be automatically replaced?
Cheers,
Marek
More information about the freebsd-fs
mailing list