ZFS: Failed pool causes system to hang

Thu Mar 21 18:11:07 UTC 2013

>> I'm not
>> messing with partitions yet because I don't want to complicate things.
>> (I will eventually be going that route though as the controller tends
>> to renumber drives in a first-come-first-serve order that makes some
>> things difficult).
>
> Solving this is easy, WITHOUT use of partitions or labels.  There is a
> feature of CAM(4) called "wired down" or "wiring down", where you can in
> essence statically map a SATA port to a static device number regardless
> if a disk is inserted at the time the kernel boots

My wording implied the wrong thing here: the dev ID mapping issue is 
*one* of the reasons I'm going to go with partitions. Another other 
being the "replacement disk is one sector too small" issue, and that gpt 
labels give me the ability to reference drives by arbitrary string, 
which makes it easier because I don't have to remember which dev ID 
corresponds to which physical bay.

I probably want to know about this trick anyway though, it looks useful.

> I can help you with this, but I need to see a dmesg (everything from
> boot to the point mountroot gets done).

Can do, but I'll need to reinstall again first. Gimmie a little while.

>> I'm experiencing fatal issues with pools hanging my machine requiring a
>> hard-reset.
>
> This, to me, means something very different than what was described in a
> subsequent follow-up:

Well, what I meant here is that when the pool fails it takes the entire 
machine down with it in short order. Having a machine become 
unresponsive and require a panel-button hard reset (with subsequent 
fsck-ing and possible corruption) counts as a fatal problem in my book. 
I don't accept this type of behavior in *any* system, even a windows 
desktop.

> S1. In your situation, when a ZFS pool loses enough vdev or vdev members
> to cause permanent pool damage (as in completely 100% unrecoverable,
> such as losing 3 disks of a raidz2 pool), any I/O to the pool results in
> that applications hanging.

Sorta. Yes the command I issued hangs, but so do a lot of other things 
as well. I can't kill -9 any of them or reboot or anything.

>The system is still functional/usable (e.g.
> I/O to other pools and non-ZFS filesystems works fine),

Assuming I do those *first*. Once something touches the pool, all bets 
are off. 'ps' and 'top' seem safe, but things like 'cd' are a gamble. 
Admittedly though, I haven't spent any time testing exactly what does 
and doesn't work and if there's a pattern to it.

> A1. This is because "failmode=wait" on the pool, which is the default
> property value.  This is by design; there is no ZFS "timeout" for this
> sort of thing.  "failmode=continue" is what you're looking for (keep
> reading).
>
> S2. If the pool uses "failmode=continue", there is no change in
> behaviour, (i.e. EIO is still never returned).
>
> A2. That sounds like a bug then.  I test your claim below, and you might
> be surprised at the findings.

As far as I'm aware, "wait" will hang all i/o read or write, whereas 
"continue" is supposed to hang only write. My problem (as near I can 
tell) is that nothing informs or limits processes from trying to write 
to the pool, so "continue" effectively only delays the inevitable by 
several seconds.

> S3. If the previously-yanked disks are reinserted, the issue remains.
>
> A3. What you're looking for is the "autoreplace" pool property.

No it's not. I *don't* want the pool trying to suck up a freshly 
inserted drive without my explicit say so. I only mentioned this because 
some other thread I was reading implied that zfs would come back to life 
if it could talk to the drive again.

> And in the other window where dd is running, it immediately terminates
> with EIO:

IIRC I only tried popping a third disk during activity once... It was 
during scp from another machine and it just paused. During all other 
tests, I've waited to make sure everything settles down first.

> One thing to note (and it's important) above is that da2 is still
> considered "ONLINE".  More on that in a moment.

Yeah I noticed that in my testing.

> root at testbox:/root # zpool replace array da2
> cannot open 'da2': no such GEOM provider
> must be a full path or shorthand device name
>
> This would indicate a separate/different bug, probably in CAM or its
> related pieces.

I don't even get as far as this. Most of the time, once something caused 
the hang, not a lot works past that point. Assuming I followed your 
example to the letter and typed 'ls' first, 'zpool replace' would have 
just hung as well without printing anything.

> I'll end this Email with (hopefully) an educational statement:  I hope
> my analysis shows you why very thorough, detailed output/etc. needs to
> be provided when reporting a problem, and not just some "general"
> description.  This is why hard data/logs/etc. are necessary, and why
> every single step of the way needs to be provided, including physical
> tasks performed.

Oh I agree, but etiquette dictates I don't spam people with 5kb of 
unsolicited text including every possible detail about everything, 
especially when I'm not even sure if it's the right mailing list.

> P.S. -- I started this Email at 23:15 PDT.  It's now 01:52 PDT.  To whom
> should I send a bill for time rendered?  ;-)

Ha, I think I have you beat there :)
I'll frequently spend hours writing single emails.

______________________________________
it has a certain smooth-brained appeal