Loss of disk made zfs pool unavailable

Thu Apr 21 06:26:03 UTC 2016

Hi all,

Yesterday I lost one disk in one of our zfs pools. dmesg shows the
following output:

Apr 20 16:07:37 sto01 kernel: (noperiph:mpr0:0:4294967295:0): SMID 1 Aborting command 0xfffffe00015db000
Apr 20 16:07:37 sto01 kernel: mpr0: Sending reset from mprsas_send_abort for target ID 49
Apr 20 16:07:40 sto01 kernel: mpr0: mprsas_prepare_remove: Sending reset for target ID 49
Apr 20 16:07:40 sto01 kernel: da22 at mpr0 bus 0 scbus12 target 49 lun 0
Apr 20 16:07:40 sto01 kernel: da22: <ATA ST4000NM0033-9ZM SN04> s/n             Z1Z9QZJB detached
Apr 20 16:07:41 sto01 kernel: (da22:mpr0:0:49:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Apr 20 16:07:41 sto01 kernel: mpr0: (da22:mpr0:0:49:0): CAM status: Command timeout
Apr 20 16:07:41 sto01 kernel: IOCStatus = 0x4b while resetting device 0x1d
Apr 20 16:07:41 sto01 kernel: (da22:mpr0: mpr0:0:Unfreezing devq for target ID 49
Apr 20 16:07:41 sto01 kernel: 49:0): Error 5, Periph was invalidated
Apr 20 16:07:41 sto01 kernel: mpr0: Unfreezing devq for target ID 49
Apr 20 16:07:41 sto01 kernel: (da22:mpr0:0:49:0): Periph destroyed
Apr 20 16:07:41 sto01 devd: Executing 'logger -p kern.notice -t ZFS 'vdev is removed, pool_guid=8487159098644794736 vdev_guid=5745784146163956924''
Apr 20 16:07:41 sto01 ZFS: vdev is removed, pool_guid=8487159098644794736 vdev_guid=5745784146163956924

This disk was part of a raidz1. This is not really a problem, I expect disks
to crash, that's why we are using zfs in this first place. What I did
not expect was the pool to become unusable and having to reboot the
server to be able to replace the disk.

After the disk crashed all zpool and zfs command where just hanging, no
output at all. The clients connected to this server also lost their
ability to read and write to the pool.

Is this the expected behavior or was I just unlucky?

I see that the zpool property failmode is set to "wait", would using
"continue" solved the issue with the server hanging?

When the server was rebooted I was able to replace the disk with a
spare, resilvering and scrub finished without any errors.

Some system info;
#freebsd-version -ku
10.3-RELEASE
10.3-RELEASE
#uname -a
FreeBSD sto01 10.3-RELEASE FreeBSD 10.3-RELEASE #0 r297264: Fri Mar 25 02:10:02 UTC 2016     root at releng1.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64

Regards,
Alexander
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-questions/attachments/20160421/459e6f1e/attachment.sig>