hardware fault during ZFS send/receive blocks /dev/zfs indefinitely
Simon Campese
freebsd_fs at campese.de
Tue May 19 14:28:49 UTC 2015
Hello,
I tried to send/receive a ZFS filesystem from a raidz2-pool to another
pool with just a single disk, when this disk failed. As a result, now
both, the zfs send and zfs receive processes are in uninterruptible
sleep state and all new zpool and zfs commands which I issue immediately
enter uninterruptible sleep. Is this just bad luck (i.e. my disk failed
in the wrong moment) or might this be a bug?
Anyway, my only solution is to schedule a reboot soon as the machine is
a file server and the operational status of zfs is critical.
I'm not very experienced with zfs or the FreeBSD kernel, so I just try
to supply as much relevant information as possible. Please tell me if
there is more I can do.
The system I run is FreeBSD 10.1-RELEASE-p6, the machine is a small intel
file server (eight core Atom, 64G Ram, Supermicro board, two raidz2
pools connected via reflashed IBM M1015 controllers). Here are the
relevant lines from "ps ax" (with anonymized pool/filesystem names):
79776 1- DE 0:05.27 zfs send -R zpool0/fs0
79777 1- D+ 0:09.80 zfs receive -e zpool1/fs1
a "lsof -p" for the send process yields no output, for the receive
process I get:
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
zfs 79777 root cwd VDIR 37,1520762947 29 4 /
zfs 79777 root rtd VDIR 37,1520762947 29 4 /
zfs 79777 root txt VREG 37,1520762947 92952 608
/sbin/zfs
zfs 79777 root txt VREG 37,1520762947 118520 137
/libexec/ld-elf.so.1
zfs 79777 root txt VREG 37,1520762947 20768 41
/lib/libgeom.so.5
zfs 79777 root txt VREG 37,1520762947 20272 75
/lib/libjail.so.1
zfs 79777 root txt VREG 37,1520762947 82888 51
/lib/libnvpair.so.2
zfs 79777 root txt VREG 37,1520762947 5880 57
/lib/libumem.so.2
zfs 79777 root txt VREG 37,1520762947 70816 69
/lib/libutil.so.9
zfs 79777 root txt VREG 37,1520762947 37984 52
/lib/libuutil.so.2
zfs 79777 root txt VREG 37,1520762947 15840 49
/lib/libzfs_core.so.2
zfs 79777 root txt VREG 37,1520762947 279968 76
/lib/libzfs.so.2
zfs 79777 root txt VREG 37,1520762947 1631216 556
/lib/libc.so.7
zfs 79777 root txt VREG 37,1520762947 157664 77
/lib/libbsdxml.so.4
zfs 79777 root txt VREG 37,1520762947 10976 34
/lib/libsbuf.so.6
zfs 79777 root txt VREG 37,1520762947 66200 56
/lib/libmd.so.6
zfs 79777 root txt VREG 37,1520762947 105248 71
/lib/libthr.so.3
zfs 79777 root txt VREG 37,1520762947 181840 64
/lib/libm.so.5
zfs 79777 root txt VREG 37,1520762947 8784 45
/lib/libavl.so.2
zfs 79777 root 0u PIPE 0xfffff800462db000 65536
->0xfffff800462db160, cnt=65224, out=312
zfs 79777 root 1u VBAD
(revoked)
zfs 79777 root 2u VBAD
(revoked)
zfs 79777 root 3u VCHR 0,72 0t0 72 /dev/zfs
zfs 79777 root 4r VCHR 0,8 0t0 8
/dev/zero
zfs 79777 root 5r VREG 37,1520762947 4505 954
/etc/zfs/exports
zfs 79777 root 6u VCHR 0,72 0t0 72 /dev/zfs
zfs 79777 root 7r VCHR 0,8 0t0 8
/dev/zero
zfs 79777 root 8u VCHR 0,72 0t0 72 /dev/zfs
Here are some more lines of "ps ax" showing some zpool and zfs commands
which immediately entered uninterruptible sleep:
81000 19- D+ 0:00.00 zpool list
80828 15- D+ 0:00.01 zpool destroy zpool1
80753 14- D+ 0:00.09 zpool export zpool1
The errors showing up in /var/log/messages when my harddisk went west
are (excerpt):
May 19 14:57:47 srv0 kernel: ahcich7: Timeout on slot 2 port 0
May 19 14:57:47 srv0 kernel: ahcich7: is 00000000 cs 000007f0 ss
000007fc rs 000007fc tfd 40 serr 00000000 cmd 0004c317
May 19 14:57:47 srv0 kernel: (ada7:ahcich7:0:0:0):
WRITE_FPDMA_QUEUED. ACB: 61 00 8c bf 6a 40 00 00 00 01 00 00
May 19 14:57:47 srv0 kernel: (ada7:ahcich7:0:0:0): CAM status: Command
timeout
May 19 14:57:47 srv0 kernel: (ada7:ahcich7:0:0:0): Retrying command
May 19 14:58:17 srv0 kernel: ahcich7: Timeout on slot 18 port 0
May 19 14:58:17 srv0 kernel: ahcich7: is 00000000 cs 00040000 ss
00000000 rs 00040000 tfd c0 serr 00000000 cmd 0004d217
May 19 14:58:17 srv0 kernel: (ada7:ahcich7:0:0:0): WRITE_DMA. ACB: ca 00
8c c8 6a 40 00 00 00 00 00 00
May 19 14:58:17 srv0 kernel: (ada7:ahcich7:0:0:0): CAM status: Command
timeout
May 19 14:58:17 srv0 kernel: (ada7:ahcich7:0:0:0): Retrying command
May 19 14:58:47 srv0 kernel: ahcich7: Timeout on slot 19 port 0
May 19 14:58:47 srv0 kernel: ahcich7: is 00000000 cs 00000000 ss
00080000 rs 00080000 tfd 40 serr 00000000 cmd 0004d317
May 19 14:58:47 srv0 kernel: (ada7:ahcich7:0:0:0):
WRITE_FPDMA_QUEUED. ACB: 61 00 8c ca 6a 40 00 00 00 01 00 00
May 19 14:58:47 srv0 kernel: (ada7:ahcich7:0:0:0): CAM status: Command
timeout
May 19 14:58:47 srv0 kernel: (ada7:ahcich7:0:0:0): Retrying command
May 19 14:59:17 srv0 kernel: ahcich7: Timeout on slot 26 port 0
May 19 14:59:17 srv0 kernel: ahcich7: is 00000000 cs 04000000 ss
00000000 rs 04000000 tfd c0 serr 00000000 cmd 0004da17
May 19 14:59:17 srv0 kernel: (ada7:ahcich7:0:0:0): WRITE_DMA. ACB: ca 00
8c d4 6a 40 00 00 00 00 00 00
May 19 14:59:17 srv0 kernel: (ada7:ahcich7:0:0:0): CAM status: Command
timeout
May 19 14:59:17 srv0 kernel: (ada7:ahcich7:0:0:0): Retrying command
May 19 14:59:47 srv0 kernel: ahcich7: Timeout on slot 26 port 0
May 19 14:59:47 srv0 kernel: ahcich7: is 00000000 cs f0000007 ss
fc000007 rs fc000007 tfd 40 serr 00000000 cmd 0004db17
May 19 14:59:47 srv0 kernel: (ada7:ahcich7:0:0:0):
WRITE_FPDMA_QUEUED. ACB: 61 00 8c d5 6a 40 00 00 00 01 00 00
May 19 14:59:47 srv0 kernel: (ada7:ahcich7:0:0:0): CAM status: Command
timeout
May 19 14:59:47 srv0 kernel: (ada7:ahcich7:0:0:0): Retrying command
May 19 15:00:17 srv0 kernel: ahcich7: Timeout on slot 10 port 0
May 19 15:00:17 srv0 kernel: ahcich7: is 00000000 cs 0007f000 ss
0007fc00 rs 0007fc00 tfd 40 serr 00000000 cmd 0004cb17
May 19 15:00:17 srv0 kernel: (ada7:ahcich7:0:0:0):
WRITE_FPDMA_QUEUED. ACB: 61 00 8c e0 6a 40 00 00 00 01 00 00
May 19 15:00:17 srv0 kernel: (ada7:ahcich7:0:0:0): CAM status: Command
timeout
May 19 15:00:17 srv0 kernel: (ada7:ahcich7:0:0:0): Retrying command
May 19 15:00:48 srv0 kernel: ahcich7: Timeout on slot 0 port 0
May 19 15:00:48 srv0 kernel: ahcich7: is 00000000 cs c000001f ss
f800001f rs f800001f tfd 40 serr 00000000 cmd 0004dd17
May 19 15:00:48 srv0 kernel: (ada7:ahcich7:0:0:0):
WRITE_FPDMA_QUEUED. ACB: 61 0b 8c f3 6a 40 00 00 00 00 00 00
May 19 15:00:48 srv0 kernel: (ada7:ahcich7:0:0:0): CAM status: Command
timeout
May 19 15:00:48 srv0 kernel: (ada7:ahcich7:0:0:0): Retrying command
Lines of this form continued for some minutes and after a while, my geli
volume on this hdd began complaining as well:
May 19 15:03:08 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3593285120, length=131072)]
May 19 15:03:08 srv0 kernel:
May 19 15:03:08 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3593416192, length=131072)]
May 19 15:03:08 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3593547264, length=131072)]
May 19 15:03:08 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3593678336, length=131072)]
May 19 15:03:08 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3593809408, length=131072)]
May 19 15:03:08 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3593940480, length=131072)]
May 19 15:03:08 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3594071552, length=131072)]
May 19 15:03:08 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3594202624, length=131072)]
May 19 15:03:08 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3594333696, length=131072)]
May 19 15:03:08 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3594464768, length=131072)]
May 19 15:03:08 srv0 kernel: GEOM_ELI: g_eli_read_done() failed
label/bkp101.eli[READ(offset=270336, length=8192)]
May 19 15:03:08 srv0 kernel: GEOM_ELI: g_eli_read_done() failed
label/bkp101.eli[READ(offset=1500301238272, length=8192)]
May 19 15:03:08 srv0 kernel: GEOM_ELI: g_eli_read_done() failed
label/bkp101.eli[READ(offset=1500301500416, length=8192)]
May 19 15:03:08 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3594595840, length=131072)]
May 19 15:03:08 srv0 kernel: GEOM_ELI
May 19 15:03:08 srv0 kernel: : Crypto WRITE request failed (error=6).
May 19 15:03:08 srv0 kernel:
May 19 15:03:08 srv0 kernel: label/bkp101.eli[WRITE(offset=3595251200,
length=131072)]
May 19 15:03:09 srv0 kernel:
May 19 15:03:09 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3594857984, length=131072)]
May 19 15:03:09 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3595513344, length=131072)]
May 19 15:03:09 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3595382272, length=131072)]
May 19 15:03:09 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3595120128, length=131072)]
May 19 15:03:09 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3594989056, length=131072)]
May 19 15:03:09 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3594726912, length=131072)]
May 19 15:03:09 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3595644416, length=131072)]
May 19 15:03:09 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
(error=6). label/bkp101.eli[WRITE(offset=3595775488, length=131072)]
Is there any hope for me to resolve this issue without a reboot?
Thanks for your help,
Simon
More information about the freebsd-fs
mailing list