ZFS 'read-only' device / pool scan / import?

Tue Oct 19 15:30:43 UTC 2010

--On 19 October 2010 08:16 -0700 Jeremy Chadwick <freebsd at jdc.parodius.com> 
wrote:

> Experts here might be able to help, but you're really going to need to
> provide every little detail, in chronological order.  What commands were
> done, what output was seen, what physical actions took place, etc..
>
> 1) Restoring from backups is probably your best bet (IMHO; this is what I
> would do as well).

I didn't provide much detail - as there isn't much detail left to provide 
(the pools been destroyed / rebuilt) - how it got messed up is almost 
certainly a case of human error / controller 'oddity' with failed devices 
[which is now suitably noted for that machine!]...

It was more a 'for future reference' kind of question - does attempting to 
import a pool (or even running something as simple as a 'zfs status' when 
ZFS has not been 'loaded') actually write to the disks? i.e. could it cause 
a pool that is currently 'messed up' to become permanently 'messed up' - 
because ZFS will change metadata on the pool, if 'at the time' it deems 
devices to be faulted / corrupt etc. - And, if it does - is there any way 
of doing a 'test mount/import' (i.e. with the underlying devices only being 
opened 'read only' - or does [as I suspect] ZFS *need* r/w access to those 
devices as part of the work to actually import/mount.

> There's a lot of other things I could add to the item list here
> (probably reach 9 or 10 if I tried), but in general the above sounds
> like its what happened.  raidz2 would have been able to save you in this
> situation, but would require at least 4 disks.

It was RAIDZ2 - it got totally screwed:

"
    vol         UNAVAIL      0     0     0  insufficient replicas
          raidz2    UNAVAIL      0     0     0  insufficient replicas
            da3     FAULTED      0     0     0  corrupted data
            da4     FAULTED      0     0     0  corrupted data
            da5     FAULTED      0     0     0  corrupted data
            da6     FAULTED      0     0     0  corrupted data
            da7     FAULTED      0     0     0  corrupted data
            da8     FAULTED      0     0     0  corrupted data
          raidz2    UNAVAIL      0     0     0  insufficient replicas
            da1     ONLINE       0     0     0
            da2     ONLINE       0     0     0
            da9     FAULTED      0     0     0  corrupted data
            da10    FAULTED      0     0     0  corrupted data
            da11    FAULTED      0     0     0  corrupted data
            da11    ONLINE       0     0     0
"

As there is such a large aspect of human error (and controller behaviour), 
I don't think it's worth digging into any deeper. It's the first pool we've 
ever "lost" under ZFS, and like I said a combination of the controller 
collapsing devices, and humans replacing wrong disks, 'twas doomed to fail 
from the start.

We've replaced failed drives on this system before - but never rebooted 
after a failure, before a replacement - and never replaced the wrong drive 
:)

Definitely a good advert for backups though :)

-Karl