Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20) [[UPDATE w/more tests]]

Mon Apr 29 14:26:43 UTC 2019

On Sun, Apr 28, 2019 at 4:03 PM Karl Denninger <karl at denninger.net> wrote:

> On 4/20/2019 15:56, Steven Hartland wrote:
> > Thanks for extra info, the next question would be have you eliminated
> > that corruption exists before the disk is removed?
> >
> > Would be interesting to add a zpool scrub to confirm this isn't the
> > case before the disk removal is attempted.
> >
> >     Regards
> >     Steve
> >
> > On 20/04/2019 18:35, Karl Denninger wrote:
> >>
> >> On 4/20/2019 10:50, Steven Hartland wrote:
> >>> Have you eliminated geli as possible source?
> >> No; I could conceivably do so by re-creating another backup volume
> >> set without geli-encrypting the drives, but I do not have an extra
> >> set of drives of the capacity required laying around to do that.  I
> >> would have to do it with lower-capacity disks, which I can attempt if
> >> you think it would help.  I *do* have open slots in the drive
> >> backplane to set up a second "test" unit of this sort.  For reasons
> >> below it will take at least a couple of weeks to get good data on
> >> whether the problem exists without geli, however.
> >>
> Ok, following up on this with more data....
>
> First step taken was to create a *second* backup pool (I have the
> backplane slots open, fortunately) with three different disks but *no
> encryption.*
>
> I ran both side-by-side for several days, with the *unencrypted* one
> operating with one disk detached and offline (pulled physically) just as
> I do normally.  Then I swapped the two using the same paradigm.
>
> The difference was *dramatic* -- the resilver did *not* scan the entire
> disk; it only copied the changed blocks and was finished FAST.  A
> subsequent scrub came up 100% clean.
>
> Next I put THOSE disks in the vault (so as to make sure I didn't get
> hosed if something went wrong) and re-initialized the pool in question,
> leaving only the "geli" alone (in other words I zpool destroy'd the pool
> and then re-created it with all three disks connected and
> geli-attached.)  The purpose for doing this was to eliminate the
> possibility of old corruption somewhere, or some sort of problem with
> multiple, spanning years, in-place "zpool upgrade" commands.  Then I ran
> a base backup to initialize all three volumes, detached one and yanked
> it out of the backplane, as would be the usual, leaving the other two
> online and operating.
>
> I ran backups as usual for most of last week after doing this, with the
> 61.eli and 62-1.eli volumes online, and 62-2 physically out of the
> backplane.
>
> Today I swapped them again as I usually do (e.g. offline 62.1, geli
> detach, camcontrol standby and then yank it -- then insert the 62-2
> volume, geli attach and zpool online) and this is happening:
>
> [\u at NewFS /home/karl]# zpool status backup
>   pool: backup
>  state: DEGRADED
> status: One or more devices is currently being resilvered.  The pool will
>         continue to function, possibly in a degraded state.
> action: Wait for the resilver to complete.
>   scan: resilver in progress since Sun Apr 28 12:57:47 2019
>         2.48T scanned at 202M/s, 1.89T issued at 154M/s, 3.27T total
>         1.89T resilvered, 57.70% done, 0 days 02:37:14 to go
> config:
>
>         NAME                      STATE     READ WRITE CKSUM
>         backup                    DEGRADED     0     0     0
>           mirror-0                DEGRADED     0     0     0
>             gpt/backup61.eli      ONLINE       0     0     0
>             11295390187305954877  OFFLINE      0     0     0  was
> /dev/gpt/backup62-1.eli
>             gpt/backup62-2.eli    ONLINE       0     0     0
>
> errors: No known data errors
>
> The "3.27T" number is accurate (by "zpool list") for the space in use.
>
> There is not a snowball's chance in Hades that anywhere near 1.89T of
> that data (thus far, and it ain't done as you can see!) was modified
> between when all three disks were online and when the 62-2.eli volume
> was swapped back in for 62.1.eli.  No possible way.  Maybe some
> 100-200Gb of data has been touched across the backed-up filesystems in
> the last three-ish days but there's just flat-out no way it's more than
> that; this would imply an entropy of well over 50% of the writeable data
> on this box in less than a week!  That's NOT possible.  Further it's not
> 100%; it shows 2.48T scanned but 1.89T actually written to the other drive.
>
> So something is definitely foooged here and it does appear that geli is
> involved in it.  Whatever is foooging zfs the resilver process thinks it
> has to recopy MOST (but not all!) of the blocks in use, it appears, from
> the 61.eli volume to the 62-2.eli volume.
>
> The question is what would lead ZFS to think it has to do that -- it
> clearly DOES NOT as a *much* smaller percentage of the total TXG set on
> 61.eli was modified while 62-2.eli was offline and 62.1.eli was online.
>
> Again I note that on 11.1 and previous this resilver was a rapid
> operation; whatever was actually changed got copied but the system never
> copied *nearly everything* on a resilver, including data that had not
> been changed at all, on a mirrored set.
>
> Obviously on a Raidz volume you have to go through the entire data
> structure because parity has to be recomputed and blocks regenerated but
> on a mirror only the changed TXGs need to be looked at and copied.  TXGs
> that were on both disks at the time second one was taken offline do not
> need to be touched *at all* since they're already on the target!
>
> What's going on here?
>

Silly question, and maybe I missed you doing this, but have you tried
dd'ing zeros each of the eli volumes that you are putitng the zpool on?
Does that make the problem go away? Maybe there's something weird going on
with un-written sectors and/or reusing sectors with weird content that is
accentuated with geli, or that causes an error path to trigger that doesn't
with !geli.

Otherwise, this testing sounds quite complete.

Warner