hardware fault during ZFS send/receive blocks /dev/zfs indefinitely

Tue May 19 16:03:10 UTC 2015

(trimmed)

On 05/19/2015 10:20, Simon Campese wrote:
> Hello,
> 
> I tried to send/receive a ZFS filesystem from a raidz2-pool to another
> pool with just a single disk, when this disk failed. As a result, now
> both, the zfs send and zfs receive processes are in uninterruptible
> sleep state and all new zpool and zfs commands which I issue immediately
> enter uninterruptible sleep. Is this just bad luck (i.e. my disk failed
> in the wrong moment) or might this be a bug? 
> 
> Anyway, my only solution is to schedule a reboot soon as the machine is
> a file server and the operational status of zfs is critical.  
> 
> I'm not very experienced with zfs or the FreeBSD kernel, so I just try
> to supply as much relevant information as possible. Please tell me if
> there is more I can do. 
> 
> The system I run is FreeBSD 10.1-RELEASE-p6, the machine is a small intel
> file server (eight core Atom, 64G Ram, Supermicro board, two raidz2
> pools connected via reflashed IBM M1015 controllers).  Here are the
> relevant lines from "ps ax" (with anonymized pool/filesystem names):  
> 
> The errors showing up in /var/log/messages when my harddisk went west
> are (excerpt):
> 
> May 19 15:00:48 srv0 kernel: ahcich7: Timeout on slot 0 port 0
> May 19 15:00:48 srv0 kernel: ahcich7: is 00000000 cs c000001f ss
> f800001f rs f800001f tfd 40 serr 00000000 cmd 0004dd17
> May 19 15:00:48 srv0 kernel: (ada7:ahcich7:0:0:0):
> WRITE_FPDMA_QUEUED. ACB: 61 0b 8c f3 6a 40 00 00 00 00 00 00
> May 19 15:00:48 srv0 kernel: (ada7:ahcich7:0:0:0): CAM status: Command
> timeout
> May 19 15:00:48 srv0 kernel: (ada7:ahcich7:0:0:0): Retrying command
> 
> Lines of this form continued for some minutes and after a while, my geli
> volume on this hdd began complaining as well:
> 
> May 19 15:03:09 srv0 kernel: GEOM_ELI: Crypto WRITE request failed
> (error=6). label/bkp101.eli[WRITE(offset=3595775488, length=131072)]
> 
> Is there any hope for me to resolve this issue without a reboot?
> 
> Thanks for your help,
> 
> Simon

Can you try using the geli and/or glabel command to force detach
label/bkp101.eli so zfs treats it as a failure?  Also I'm not sure how
geli and glabel will treat it but you could try sysctl
kern.cam.ada.retry_count=0 to make the kernel give up on the disk
quicker and the "failure" might cascade up to zfs where it should
hopefully give up on the disk.  I think the problem here is ZFS does not
know about the incomplete failures on the lower layers.