Problem with zpool

Adam Stylinski
Thu Aug 26 18:33:06 UTC 2010

Ok so I was sending snapshots into the pool and I may have sent one that was
bad because I got a "bad magic number" error at some point.  At the time I
figured it was my disk (ad6) which was going bad.  It would not allow me no
matter what I to offline this ad6 device, it claimed insufficient replicas
existed.  So I removed the device and put a device on the same port on the
same controller to replace it.  I then ran the replace command only to have
it sit there pretending to replace but zpool would sit on top with status
g_wait.  I googled around and found a guy on one of the mailing lists with a
similar bug, they said it was fixed in a revision to zfs (can't remember
which mfc) but upgrading to fbsd 8.1 would fix the problem.  This is a v13
pool, and now that I've upgraded to 8.1, I'm running a scrub which forced
the resilver and it's claiming to "replace".  Well as it turns out I had a
cronjob for freebsd-update cron which happened to pop in for 8.0 some time
during freebsd-update.  So the result was I was on an 8.1 kernel but an 8.0
userland, which did allow me to scrub (I guess the ABIs didn't break this
update between zpool and libzpool).

It finished the scrub, but it still claims this:

              8991447011275450347  UNAVAIL      0 10.0K     0  was
              ada2                 ONLINE       0     0     0  14.1G

Meaning it's trying to write to the old device which it won't let me
offline, while still resilvering to ada2.  While it did tell me that it
resilvered successfully with no checksum errors or read or write errors,
it's still there.  I am now scrubbing with 8.1's userland and kernel, do you
guys think it will finally allow me to remove the device it's replacing?
 Also, it claims that there is corruption in the pool, but the files that
are "affected" are 100% fine, as I've md5'd them against remote copies.  I
realize the metadata could be bad, so I'm not sure how to go about fixing
that.  Anyway, please don't tl;dr and let me know whatever advice you can
give.  I have some moderate level backups of some of the data (the entire
pool has 2.7TB occupied), but I'd like to avoid a destroy and create
process.    So far this is what zpool status reports:

[adam at nasbox ~]$ zpool status
  pool: share
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
 scrub: resilver in progress for 0h9m, 2.16% done, 6h53m to go

        NAME                       STATE     READ WRITE CKSUM
        share                      DEGRADED     0     0     0
          raidz1                   DEGRADED     0     0     0
            ada1                   ONLINE       0     0     0  60.9M
            ada3                   ONLINE       0     0     0  60.9M
            replacing              DEGRADED     0     0     0
              8991447011275450347  UNAVAIL      0 10.0K     0  was
              ada2                 ONLINE       0     0     0  14.1G
            ada4                   ONLINE       0     0     0  56.7M
          raidz1                   ONLINE       0     0     0
            da0                    ONLINE       0     0     0
            da2                    ONLINE       0     0     0
            da3                    ONLINE       0     0     0
            da1                    ONLINE       0     0     0
          raidz1                   ONLINE       0     0     0
            aacd0                  ONLINE       0     0     0
            aacd1                  ONLINE       0     0     0
            aacd2                  ONLINE       0     0     0
            aacd3                  ONLINE       0     0     0
        logs                       DEGRADED     0     0     0
          ada0                     ONLINE       0     0     0

