A failed drive causes system to hang

Jeremy Chadwick jdc at koitsu.org
Thu Apr 11 21:24:10 UTC 2013


On Thu, Apr 11, 2013 at 10:47:32PM +0200, Radio m?odych bandytw wrote:
> Seeing a ZFS thread, I decided to write about a similar problem that
> I experience.
> I have a failing drive in my array. I need to RMA it, but don't have
> time and it fails rarely enough to be a yet another annoyance.
> The failure is simple: it fails to respond.
> When it happens, the only thing I found I can do is switch consoles.
> Any command fails, login fails, apps hang.
> 
> On the 1st console I see a series of messages like:
> 
> (ada0:ahcich0:0:0:0): CAM status: Command timeout
> (ada0:ahcich0:0:0:0): Error 5, Periph was invalidated
> (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED
> 
> I use RAIDZ1 and I'd expect that none single failure would cause the
> system to fail...

You need to provide full output from "dmesg", and you need to define
what the word "fails" means (re: "any command fails", "login fails").

I've already demonstrated that loss of a disk in raidz1 (or even 2 disks
in raidz2) does not cause ""the system to fail"" on stable/9.  However,
if you lose enough members or vdevs to cause catastrophic failure, there
may be anomalies depending on how your system is set up:

http://lists.freebsd.org/pipermail/freebsd-fs/2013-March/016814.html

If the pool has failmode=wait, any I/O to that pool will block (wait)
indefinitely.  This is the default.

If the pool has failmode=continue, existing write I/O operations will
fail with EIO (I/O error) (and hopefully applications/daemons will
handle that gracefully -- if not, that's their fault) but any subsequent
I/O (read or write) to that pool will block (wait) indefinitely.

If the pool has failmode=panic, the kernel will immediately panic.

If the CAM layer is what's wedged, that may be a different issue (and
not related to ZFS).  I would suggest running stable/9 as many
improvements in this regard have been committed recently (some related
to CAM, others related to ZFS and its new "deadman" watcher).

Bottom line: terse output of the problem does not help.  Be verbose,
provide all output (commands you type, everything!), as well as any
physical actions you take.

-- 
| Jeremy Chadwick                                   jdc at koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Mountain View, CA, US                                            |
| Making life hard for others since 1977.             PGP 4BD6C0CB |


More information about the freebsd-fs mailing list