zfs tasting dropped a stripe out of my pool. help getting it back?

Thu Jul 14 17:37:52 UTC 2011

Hi,
Whilst the way zfs looks for it's data everywhere can be useful when devices change,
I've been rather stung by it.
I have a raidz2 with 4x2TB and 2x 2x1TB stripes to make 6x2TB in total.
I currently have this:

   pool: pool2
  state: DEGRADED
status: One or more devices could not be used because the label is missing or
         invalid.  Sufficient replicas exist for the pool to continue
         functioning in a degraded state.
action: Replace the device using 'zpool replace'.
    see: http://www.sun.com/msg/ZFS-8000-4J
  scan: resilvered 1.83M in 0h0m with 0 errors on Thu Jul 14 14:59:22 2011
config:

         NAME                      STATE     READ WRITE CKSUM
         pool2                     DEGRADED     0     0     0
           raidz2-0                DEGRADED     0     0     0
             gpt/2TB_drive0        ONLINE       0     0     0
             gpt/2TB_drive1        ONLINE       0     0     0
             gpt/2TB_drive2        ONLINE       0     0     0
             13298804679359865221  UNAVAIL      0     0     0  was /dev/gpt/1TB_drive0
             12966661380732156057  UNAVAIL      0     0     0  was /dev/gpt/1TB_drive2
             gpt/2TB_drive3        ONLINE       0     0     0
         cache
           gpt/cache0              ONLINE       0     0     0

The two UNAVAIL entries used to be stripes. The system helpfully removed them for me.
These are the stripes that used to be in the pool:

# gstripe status
                Name  Status  Components
stripe/1TB_drive0+1      UP  gpt/1TB_drive1
                              gpt/1TB_drive0
stripe/1TB_drive2+3      UP  gpt/1TB_drive3
                              gpt/1TB_drive2

They still exist and have all the data in them.

It started, when I booted up with the drive that has gpt/1TB_drive1 missing and zfs helpfully replaced the
stripe/1TB_drive0+1 device with gpt/1TB_drive0 and told me it had corrupt data on it.

Am I right in thinking, that cos one drive was missing which meant that stripe/1TB_drive0+1
was then also missing, that zfs tasted around and found gpt/1TB_drive0 had what look like
the right header on it. However, 64k in, it would find incorrect data, as the next 64k was
on the missing part of the stripe on gpt/1TB_drive1?

I was contemplating how to get the stripe back into the pool without having to do a complete
resilver on it. Seemed unnecessary to have to do that when the data was all there.

I thought an export and import might help it find it. However, that for some reason did the same
to the other stripe stripe/1TB_drive2+3 and it got replaced with gpt/1TB_drive2.

Now I am left without parity.
Any ideas on what commands will bring this back?
I know I can do a replace on both, but if there is some undetected corruption on the other devices then I will
lose some data, as any parity that could fix it is currently missing. I do scrub regularly, but I'd prefer not
to take that chance. Especially as I have all the data sitting there!

I hoping someone has some magic zfs commands to make all this go away :)

What can I do to prevent this in future? I've run pools with stripes for years without this happening.
It seems zfs has started to look far and wide for it's devices? In the past if the stripe was broken,
it would just tell me the device was missing. When the stripe was back, then all was fine. However,
this tasting everywhere seems like stripes are now a no-no for zpools?

Thanks.