Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20) [[UPDATE w/more tests]]

Sun Apr 28 22:02:07 UTC 2019

On 4/20/2019 15:56, Steven Hartland wrote:
> Thanks for extra info, the next question would be have you eliminated
> that corruption exists before the disk is removed?
>
> Would be interesting to add a zpool scrub to confirm this isn't the
> case before the disk removal is attempted.
>
>     Regards
>     Steve
>
> On 20/04/2019 18:35, Karl Denninger wrote:
>>
>> On 4/20/2019 10:50, Steven Hartland wrote:
>>> Have you eliminated geli as possible source?
>> No; I could conceivably do so by re-creating another backup volume
>> set without geli-encrypting the drives, but I do not have an extra
>> set of drives of the capacity required laying around to do that.  I
>> would have to do it with lower-capacity disks, which I can attempt if
>> you think it would help.  I *do* have open slots in the drive
>> backplane to set up a second "test" unit of this sort.  For reasons
>> below it will take at least a couple of weeks to get good data on
>> whether the problem exists without geli, however.
>>
Ok, following up on this with more data....

First step taken was to create a *second* backup pool (I have the
backplane slots open, fortunately) with three different disks but *no
encryption.*

I ran both side-by-side for several days, with the *unencrypted* one
operating with one disk detached and offline (pulled physically) just as
I do normally.  Then I swapped the two using the same paradigm.

The difference was *dramatic* -- the resilver did *not* scan the entire
disk; it only copied the changed blocks and was finished FAST.  A
subsequent scrub came up 100% clean.

Next I put THOSE disks in the vault (so as to make sure I didn't get
hosed if something went wrong) and re-initialized the pool in question,
leaving only the "geli" alone (in other words I zpool destroy'd the pool
and then re-created it with all three disks connected and
geli-attached.)  The purpose for doing this was to eliminate the
possibility of old corruption somewhere, or some sort of problem with
multiple, spanning years, in-place "zpool upgrade" commands.  Then I ran
a base backup to initialize all three volumes, detached one and yanked
it out of the backplane, as would be the usual, leaving the other two
online and operating.

I ran backups as usual for most of last week after doing this, with the
61.eli and 62-1.eli volumes online, and 62-2 physically out of the
backplane.

Today I swapped them again as I usually do (e.g. offline 62.1, geli
detach, camcontrol standby and then yank it -- then insert the 62-2
volume, geli attach and zpool online) and this is happening:

[\u at NewFS /home/karl]# zpool status backup
  pool: backup
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Apr 28 12:57:47 2019
        2.48T scanned at 202M/s, 1.89T issued at 154M/s, 3.27T total
        1.89T resilvered, 57.70% done, 0 days 02:37:14 to go
config:

        NAME                      STATE     READ WRITE CKSUM
        backup                    DEGRADED     0     0     0
          mirror-0                DEGRADED     0     0     0
            gpt/backup61.eli      ONLINE       0     0     0
            11295390187305954877  OFFLINE      0     0     0  was
/dev/gpt/backup62-1.eli
            gpt/backup62-2.eli    ONLINE       0     0     0

errors: No known data errors

The "3.27T" number is accurate (by "zpool list") for the space in use.

There is not a snowball's chance in Hades that anywhere near 1.89T of
that data (thus far, and it ain't done as you can see!) was modified
between when all three disks were online and when the 62-2.eli volume
was swapped back in for 62.1.eli.  No possible way.  Maybe some
100-200Gb of data has been touched across the backed-up filesystems in
the last three-ish days but there's just flat-out no way it's more than
that; this would imply an entropy of well over 50% of the writeable data
on this box in less than a week!  That's NOT possible.  Further it's not
100%; it shows 2.48T scanned but 1.89T actually written to the other drive.

So something is definitely foooged here and it does appear that geli is
involved in it.  Whatever is foooging zfs the resilver process thinks it
has to recopy MOST (but not all!) of the blocks in use, it appears, from
the 61.eli volume to the 62-2.eli volume.

The question is what would lead ZFS to think it has to do that -- it
clearly DOES NOT as a *much* smaller percentage of the total TXG set on
61.eli was modified while 62-2.eli was offline and 62.1.eli was online.

Again I note that on 11.1 and previous this resilver was a rapid
operation; whatever was actually changed got copied but the system never
copied *nearly everything* on a resilver, including data that had not
been changed at all, on a mirrored set.

Obviously on a Raidz volume you have to go through the entire data
structure because parity has to be recomputed and blocks regenerated but
on a mirror only the changed TXGs need to be looked at and copied.  TXGs
that were on both disks at the time second one was taken offline do not
need to be touched *at all* since they're already on the target!

What's going on here?

-- 
Karl Denninger
karl at denninger.net <mailto:karl at denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4897 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20190428/35ca43da/attachment.bin>