Perhaps odd, perhaps trouble, perhaps not? (ZFS and Mirrored Configurations)

Karl Denninger karl at denninger.net
Mon Oct 24 18:54:40 UTC 2016


Contemplate the following:

1. Mirrored ZFS pool called "external" with *three* components (call
them "primary", "Sec1" and "Sec2")

2. Write data to said pool.

3. "zpool offline external Sec2"

4. Physically removed "Sec2" and place it in a vault somewhere;
"Primary" and "Sec1" remain in the computer.

5. Run system for some fairly long period of time (days, weeks, perhaps
months)

6. "Zpool scrub external" (make sure both drives are ok); note zero
errors at completion.

6. "zpool offline external Sec1"

7. Remove Sec1, exchange with "Sec2" at vault, place "Sec1" back in
computer.

8. "zpool online external (long series of numbers that zpool status says
was Sec1 last time it was mounted)"

Wait for resilver to complete, which _*should*_ only require that
_*changed*_ sectors on "Primary" since Sec1 was removed be rewritten to
"Sec1".

Well, this works if the time is relatively short.  However, what I have
observed is that if the time is relatively long or some unknown event(s)
take place in the interium then you can get this:

zpool status external
  pool: external
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Oct 24 12:32:40 2016
        487G scanned out of 2.26T at 127M/s, 4h6m to go
        96.1G resilvered, 21.05% done
config:

        NAME                     STATE     READ WRITE CKSUM
        external                 DEGRADED     0     0     0
          mirror-0               DEGRADED     0     0     0
            label/Primary.eli    ONLINE       0     0     0
            9568232509714437622  OFFLINE      0     0     0  was
/dev/label/Sec1.eli
            label/Sec2.eli       ONLINE       0     0  555K  (resilvering)

Note the enormous number of checksum errors (all on the *just-attached*
disk.)  Do you think its possible that I have *that many* actual
checksum errors on those blocks and yet *zero* I/O errors logged by the
driver *or* the "smart" data when queried from the disk itself *and*
geli is not complaining about corruption of the data its reading
either?  Uh, not damn likely, and further, the re-write is successful
(e.g. there are no write errors winding up being logged during this
process.)

Any idea what's going on here when this happens?  I suspect that there
is some event that has cleared the log of "pending" changes between the
devices in that mirror, such that you can no longer successfully online
the device without at least part of it (but not all) being re-written --
but I have no idea what event that would be (and thus how to avoid it
happening, if it can be avoided.)

Why do something like this in the first place?  Because it's very a
convenient way to take a device offsite for fire/disaster protection and
yet have the resync be very fast if there is a lot of data on that
dataset that has not changed, since (in theory) the system only needs to
rewrite the portion of that disk that has changed blocks.

-- 
Karl Denninger
karl at denninger.net <mailto:karl at denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2996 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20161024/8de7b45c/attachment.bin>


More information about the freebsd-fs mailing list