This diskfailure should not panic a system, but just disconnect disk from ZFS

Mon Jun 22 01:21:53 UTC 2015

> Or do I have to high hopes of ZFS?
> And is a hung disk a 'catastrophic pool failure'?

Yes to both.

I encountered this exact same issue a couple years ago (and complained 
about it to this list as well, although I didn't get a complete answer 
at the time. I can provide links to the conversation if interested).

Basically, the heart of the issue is the way the kernel/drivers/ZFS 
deals with IO and DMA. There's currently no way to tell what's going on 
with the disks and what outstanding IO to the pool can be dropped or 
ignored. As-currently-designed there's no safe way to just kick out the 
pool and keep going, so the only options are to wait, panic, or wait and 
then panic. Fixing this would require a major rewrite of a lot of code, 
which isn't going to happen any time soon. The failmode setting and 
deadman timer were implemented as a bandage to prevent the system from 
hanging forever.

See this page for more info:
http://comments.gmane.org/gmane.os.illumos.zfs/61

> All failmode settings result in a seriously handicapped system...

Yes. Again, this is a design issue/flaw with how DMA works. There's no 
real way to continue on gracefully when a pool completely dies due to 
hung IO.

We're all pretty much stuck with this problem, at least for quite a while.

> Is waiting only meant to wait a limited time? And then panic anyways?

By default yes. However, if you know that on your system the issue will 
eventually resolve itself given several hours (and you want to wait that 
long) you can change the deadman timeout or disable it completely. Look 
at "vfs.zfs.deadman_enabled" and "vfs.zfs.deadman_synctime".