ZFS Commands In "D" State

Thu Jun 8 21:13:04 UTC 2017

We have a ZFS server that we've been running for a few months now.
The server is a backup server that receives ZFS sends from its primary
daily.  This mechanism has been working for us on several pairs of
servers for years in general, and for several months with this
particular piece of hardware.

A few days ago, our nightly ZFS send failed.  When I looked at the
server, I saw that the "zfs receive" command was in a "D" wait state:

1425  -  D       0:02.75 /sbin/zfs receive -v -F backup/export

I rebooted the system, checked that "zpool status" and "zfs list" both
came back correctly (which they did) and then re-started the "zfs
send" on the master server.  At first, the "zfs receive" command did
not enter the "D" state, but once the master server started sending
actual data (which I was able to ascertain because I was doing "zfs
send" with the -v option), the receiving process entered the "D" state
again, and another reboot was required.  Only about 2MB of data got
sent before this happened.

I've rebooted several times, always with the same result.  I did a
"zpool scrub os" (there's a separate zpool for the OS to live on) and
that completed in a few minutes, but when I did a "zpool scrub
backup", that process immediately went into the "D+" state:

895  0  D+     0:00.04 zpool scrub backup

We run smartd on this device, and that is showing no disk errors.  The
devd process is logging some stuff, but it doesn't appear to be very
helpful:

Jun  8 13:52:49 backup ZFS: vdev state changed,
pool_guid=2176924632732322522 vdev_guid=11754027336427262018
Jun  8 13:52:49 backup ZFS: vdev state changed,
pool_guid=2176924632732322522 vdev_guid=11367786800631979308
Jun  8 13:52:49 backup ZFS: vdev state changed,
pool_guid=2176924632732322522 vdev_guid=18407069648425063426
Jun  8 13:52:49 backup ZFS: vdev state changed,
pool_guid=2176924632732322522 vdev_guid=9496839124651172990
Jun  8 13:52:49 backup ZFS: vdev state changed,
pool_guid=2176924632732322522 vdev_guid=332784898986906736
Jun  8 13:52:50 backup ZFS: vdev state changed,
pool_guid=2176924632732322522 vdev_guid=16384086680948393578
Jun  8 13:52:50 backup ZFS: vdev state changed,
pool_guid=2176924632732322522 vdev_guid=10762348983543761591
Jun  8 13:52:50 backup ZFS: vdev state changed,
pool_guid=2176924632732322522 vdev_guid=8585274278710252761
Jun  8 13:52:50 backup ZFS: vdev state changed,
pool_guid=2176924632732322522 vdev_guid=17456777842286400332
Jun  8 13:52:50 backup ZFS: vdev state changed,
pool_guid=2176924632732322522 vdev_guid=10533897485373019500

No word on which state it changed "from" or "to".  Also, the system
only has three vdevs (the OS one, and then two raidz2 vdevs that make
up the "backup" pool, so I'm not sure how it's coming up with more
than 3 vdev GUIDs).

What's my next step in diagnosing this?

-- 

Tim Gustafson
BSOE Computing Director
tjg at ucsc.edu
831-459-5354
Baskin Engineering, Room 313A

To request BSOE IT support, please visit https://support.soe.ucsc.edu/
or send e-mail to help at soe.ucsc.edu.