hardware fault during ZFS send/receive blocks /dev/zfs indefinitely

Simon Campese freebsd_fs at campese.de
Tue May 19 14:28:49 UTC 2015


Hello,

I tried to send/receive a ZFS filesystem from a raidz2-pool to another
pool with just a single disk, when this disk failed. As a result, now
both, the zfs send and zfs receive processes are in uninterruptible
sleep state and all new zpool and zfs commands which I issue immediately
enter uninterruptible sleep. Is this just bad luck (i.e. my disk failed
in the wrong moment) or might this be a bug? 

Anyway, my only solution is to schedule a reboot soon as the machine is
a file server and the operational status of zfs is critical.  

I'm not very experienced with zfs or the FreeBSD kernel, so I just try
to supply as much relevant information as possible. Please tell me if
there is more I can do. 

The system I run is FreeBSD 10.1-RELEASE-p6, the machine is a small intel
file server (eight core Atom, 64G Ram, Supermicro board, two raidz2
pools connected via reflashed IBM M1015 controllers).  Here are the
relevant lines from "ps ax" (with anonymized pool/filesystem names):  

79776  1- DE        0:05.27 zfs send -R zpool0/fs0
79777  1- D+        0:09.80 zfs receive -e zpool1/fs1

a "lsof -p" for the send process yields no output, for the receive
process I get:

COMMAND   PID USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
zfs     79777 root  cwd   VDIR      37,1520762947       29    4 /
zfs     79777 root  rtd   VDIR      37,1520762947       29    4 /
zfs     79777 root  txt   VREG      37,1520762947    92952  608
/sbin/zfs
zfs     79777 root  txt   VREG      37,1520762947   118520  137
/libexec/ld-elf.so.1
zfs     79777 root  txt   VREG      37,1520762947    20768   41
/lib/libgeom.so.5
zfs     79777 root  txt   VREG      37,1520762947    20272   75
/lib/libjail.so.1
zfs     79777 root  txt   VREG      37,1520762947    82888   51
/lib/libnvpair.so.2
zfs     79777 root  txt   VREG      37,1520762947     5880   57
/lib/libumem.so.2
zfs     79777 root  txt   VREG      37,1520762947    70816   69
/lib/libutil.so.9
zfs     79777 root  txt   VREG      37,1520762947    37984   52
/lib/libuutil.so.2
zfs     79777 root  txt   VREG      37,1520762947    15840   49
/lib/libzfs_core.so.2
zfs     79777 root  txt   VREG      37,1520762947   279968   76
/lib/libzfs.so.2
zfs     79777 root  txt   VREG      37,1520762947  1631216  556
/lib/libc.so.7
zfs     79777 root  txt   VREG      37,1520762947   157664   77
/lib/libbsdxml.so.4
zfs     79777 root  txt   VREG      37,1520762947    10976   34
/lib/libsbuf.so.6
zfs     79777 root  txt   VREG      37,1520762947    66200   56
/lib/libmd.so.6
zfs     79777 root  txt   VREG      37,1520762947   105248   71
/lib/libthr.so.3
zfs     79777 root  txt   VREG      37,1520762947   181840   64
/lib/libm.so.5
zfs     79777 root  txt   VREG      37,1520762947     8784   45
/lib/libavl.so.2
zfs     79777 root    0u  PIPE 0xfffff800462db000    65536
->0xfffff800462db160, cnt=65224, out=312
zfs     79777 root    1u  VBAD
(revoked)
zfs     79777 root    2u  VBAD
(revoked)
zfs     79777 root    3u  VCHR               0,72      0t0   72 /dev/zfs
zfs     79777 root    4r  VCHR                0,8      0t0    8
/dev/zero
zfs     79777 root    5r  VREG      37,1520762947     4505  954
/etc/zfs/exports
zfs     79777 root    6u  VCHR               0,72      0t0   72 /dev/zfs
zfs     79777 root    7r  VCHR                0,8      0t0    8
/dev/zero
zfs     79777 root    8u  VCHR               0,72      0t0   72 /dev/zfs

Here are some more lines of "ps ax" showing some zpool and zfs commands
which immediately entered uninterruptible sleep:

81000 19- D+        0:00.00 zpool list
80828 15- D+        0:00.01 zpool destroy zpool1
80753 14- D+        0:00.09 zpool export zpool1

The errors showing up in /var/log/messages when my harddisk went west
are (excerpt):

May 19 14:57:47 srv0 kernel: ahcich7: Timeout on slot 2 port 0
May 19 14:57:47 srv0 kernel: ahcich7: is 00000000 cs 000007f0 ss
000007fc rs 000007fc tfd 40 serr 00000000 cmd 0004c317
May 19 14:57:47 srv0 kernel: (ada7:ahcich7:0:0:0):
WRITE_FPDMA_QUEUED. ACB: 61 00 8c bf 6a 40 00 00 00 01 00 00
May 19 14:57:47 srv0 kernel: (ada7:ahcich7:0:0:0): CAM status: Command
timeout
May 19 14:57:47 srv0 kernel: (ada7:ahcich7:0:0:0): Retrying command
May 19 14:58:17 srv0 kernel: ahcich7: Timeout on slot 18 port 0
May 19 14:58:17 srv0 kernel: ahcich7: is 00000000 cs 00040000 ss
00000000 rs 00040000 tfd c0 serr 00000000 cmd 0004d217
May 19 14:58:17 srv0 kernel: (ada7:ahcich7:0:0:0): WRITE_DMA. ACB: ca 00
8c c8 6a 40 00 00 00 00 00 00
May 19 14:58:17 srv0 kernel: (ada7:ahcich7:0:0:0): CAM status: Command
timeout
May 19 14:58:17 srv0 kernel: (ada7:ahcich7:0:0:0): Retrying command
May 19 14:58:47 srv0 kernel: ahcich7: Timeout on slot 19 port 0
May 19 14:58:47 srv0 kernel: ahcich7: is 00000000 cs 00000000 ss
00080000 rs 00080000 tfd 40 serr 00000000 cmd 0004d317
May 19 14:58:47 srv0 kernel: (ada7:ahcich7:0:0:0):
WRITE_FPDMA_QUEUED. ACB: 61 00 8c ca 6a 40 00 00 00 01 00 00
May 19 14:58:47 srv0 kernel: (ada7:ahcich7:0:0:0): CAM status: Command
timeout
May 19 14:58:47 srv0 kernel: (ada7:ahcich7:0:0:0): Retrying command
May 19 14:59:17 srv0 kernel: ahcich7: Timeout on slot 26 port 0
May 19 14:59:17 srv0 kernel: ahcich7: is 00000000 cs 04000000 ss
00000000 rs 04000000 tfd c0 serr 00000000 cmd 0004da17
May 19 14:59:17 srv0 kernel: (ada7:ahcich7:0:0:0): WRITE_DMA. ACB: ca 00
8c d4 6a 40 00 00 00 00 00 00
May 19 14:59:17 srv0 kernel: (ada7:ahcich7:0:0:0): CAM status: Command
timeout
May 19 14:59:17 srv0 kernel: (ada7:ahcich7:0:0:0): Retrying command
May 19 14:59:47 srv0 kernel: ahcich7: Timeout on slot 26 port 0
May 19 14:59:47 srv0 kernel: ahcich7: is 00000000 cs f0000007 ss
fc000007 rs fc000007 tfd 40 serr 00000000 cmd 0004db17
May 19 14:59:47 srv0 kernel: (ada7:ahcich7:0:0:0):
WRITE_FPDMA_QUEUED. ACB: 61 00 8c d5 6a 40 00 00 00 01 00 00
May 19 14:59:47 srv0 kernel: (ada7:ahcich7:0:0:0): CAM status: Command
timeout
May 19 14:59:47 srv0 kernel: (ada7:ahcich7:0:0:0): Retrying command
May 19 15:00:17 srv0 kernel: ahcich7: Timeout on slot 10 port 0
May 19 15:00:17 srv0 kernel: ahcich7: is 00000000 cs 0007f000 ss
0007fc00 rs 0007fc00 tfd 40 serr 00000000 cmd 0004cb17
May 19 15:00:17 srv0 kernel: (ada7:ahcich7:0:0:0):
WRITE_FPDMA_QUEUED. ACB: 61 00 8c e0 6a 40 00 00 00 01 00 00
May 19 15:00:17 srv0 kernel: (ada7:ahcich7:0:0:0): CAM status: Command
timeout
May 19 15:00:17 srv0 kernel: (ada7:ahcich7:0:0:0): Retrying command
May 19 15:00:48 srv0 kernel: ahcich7: Timeout on slot 0 port 0
May 19 15:00:48 srv0 kernel: ahcich7: is 00000000 cs c000001f ss
f800001f rs f800001f tfd 40 serr 00000000 cmd 0004dd17
May 19 15:00:48 srv0 kernel: (ada7:ahcich7:0:0:0):
WRITE_FPDMA_QUEUED. ACB: 61 0b 8c f3 6a 40 00 00 00 00 00 00
May 19 15:00:48 srv0 kernel: (ada7:ahcich7:0:0:0): CAM status: Command
timeout
May 19 15:00:48 srv0 kernel: (ada7:ahcich7:0:0:0): Retrying command

Lines of this form continued for some minutes and after a while, my geli
volume on this hdd began complaining as well:

May 19 15:03:08 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3593285120, length=131072)]
May 19 15:03:08 srv0 kernel:
May 19 15:03:08 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3593416192, length=131072)]
May 19 15:03:08 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3593547264, length=131072)]
May 19 15:03:08 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3593678336, length=131072)]
May 19 15:03:08 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3593809408, length=131072)]
May 19 15:03:08 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3593940480, length=131072)]
May 19 15:03:08 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3594071552, length=131072)]
May 19 15:03:08 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3594202624, length=131072)]
May 19 15:03:08 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3594333696, length=131072)]
May 19 15:03:08 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3594464768, length=131072)]
May 19 15:03:08 srv0 kernel: GEOM_ELI: g_eli_read_done() failed
label/bkp101.eli[READ(offset=270336, length=8192)]
May 19 15:03:08 srv0 kernel: GEOM_ELI: g_eli_read_done() failed
label/bkp101.eli[READ(offset=1500301238272, length=8192)]
May 19 15:03:08 srv0 kernel: GEOM_ELI: g_eli_read_done() failed
label/bkp101.eli[READ(offset=1500301500416, length=8192)]
May 19 15:03:08 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3594595840, length=131072)]
May 19 15:03:08 srv0 kernel: GEOM_ELI
May 19 15:03:08 srv0 kernel: : Crypto WRITE request failed (error=6).
May 19 15:03:08 srv0 kernel:
May 19 15:03:08 srv0 kernel: label/bkp101.eli[WRITE(offset=3595251200,
length=131072)]
May 19 15:03:09 srv0 kernel:
May 19 15:03:09 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3594857984, length=131072)]
May 19 15:03:09 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3595513344, length=131072)]
May 19 15:03:09 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3595382272, length=131072)]
May 19 15:03:09 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3595120128, length=131072)]
May 19 15:03:09 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3594989056, length=131072)]
May 19 15:03:09 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3594726912, length=131072)]
May 19 15:03:09 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3595644416, length=131072)]
May 19 15:03:09 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3595775488, length=131072)]

Is there any hope for me to resolve this issue without a reboot?

Thanks for your help,

Simon


More information about the freebsd-fs mailing list