Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)

Karl Denninger karl at denninger.net
Sat Apr 20 17:36:06 UTC 2019


On 4/20/2019 10:50, Steven Hartland wrote:
> Have you eliminated geli as possible source?
No; I could conceivably do so by re-creating another backup volume set
without geli-encrypting the drives, but I do not have an extra set of
drives of the capacity required laying around to do that.  I would have
to do it with lower-capacity disks, which I can attempt if you think it
would help.  I *do* have open slots in the drive backplane to set up a
second "test" unit of this sort.  For reasons below it will take at
least a couple of weeks to get good data on whether the problem exists
without geli, however.
>
> I've just setup an old server which has a LSI 2008 running and old FW
> (11.0) so was going to have a go at reproducing this.
>
> Apart from the disconnect steps below is there anything else needed
> e.g. read / write workload during disconnect?

Yes.  An attempt to recreate this on my sandbox machine using smaller
disks (WD RE-320s) and a decent amount of read/write activity (tens to
~100 gigabytes) on a root mirror of three disks with one taken offline
did not succeed.  It *reliably* appears, however, on my backup volumes
with every drive swap.  The sandbox machine is physically identical
other than the physical disks; both are Xeons with ECC RAM in them.

The only operational difference is that the backup volume sets have a
*lot* of data written to them via zfs send|zfs recv over the intervening
period where with "ordinary" activity from I/O (which was the case on my
sandbox) the I/O pattern is materially different.  The root pool on the
sandbox where I tried to reproduce it synthetically *is* using geli (in
fact it boots native-encrypted.)

The "ordinary" resilver on a disk swap typically covers ~2-3Tb and is a
~6-8 hour process.

The usual process for the backup pool looks like this:

Have 2 of the 3 physical disks mounted; the third is in the bank vault.

Over the space of a week, the backup script is run daily.  It first
imports the pool and then for each zfs filesystem it is backing up
(which is not all of them; I have a few volatile ones that I don't care
if I lose, such as object directories for builds and such, plus some
that are R/O data sets that are backed up separately) it does:

If there is no "... at zfs-base": zfs snapshot -r ... at zfs-base; zfs send -R
... at zfs-base | zfs receive -Fuvd $BACKUP

else

zfs rename -r ... at zfs-base ... at zfs-old
zfs snapshot -r ... at zfs-base

zfs send -RI ... at zfs-old ... at zfs-base |zfs recv -Fudv $BACKUP

.... if ok then zfs destroy -vr ... at zfs-old otherwise print a complaint
and stop.

When all are complete it then does a "zpool export backup" to detach the
pool in order to reduce the risk of "stupid root user" (me) accidents.

In short I send an incremental of the changes since the last backup,
which in many cases includes a bunch of automatic snapshots that are
taken on frequent basis out of the cron.  Typically there are a week's
worth of these that accumulate between swaps of the disk to the vault,
and the offline'd disk remains that way for a week.  I also wait for the
zpool destroy on each of the targets to drain before continuing, as not
doing so back in the 9 and 10.x days was a good way to stimulate an
instant panic on re-import the next day due to kernel stack page
exhaustion if the previous operation destroyed hundreds of gigabytes of
snapshots (which does routinely happen as part of the backed up data is
Macrium images from PCs, so when a new month comes around the PC's
backup routine removes a huge amount of old data from the filesystem.)

Trying to simulate the checksum errors in a few hours' time thus far has
failed.  But every time I swap the disks on a weekly basis I get a
handful of checksum errors on the scrub.  If I export and re-import the
backup mirror after that the counters are zeroed -- the checksum error
count does *not* remain across an export/import cycle although the
"scrub repaired" line remains.

For example after the scrub completed this morning I exported the pool
(the script expects the pool exported before it begins) and ran the
backup.  When it was complete:

root at NewFS:~/backup-zfs # zpool status backup
  pool: backup
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: scrub repaired 188K in 0 days 09:40:18 with 0 errors on Sat Apr
20 08:45:09 2019
config:

        NAME                      STATE     READ WRITE CKSUM
        backup                    DEGRADED     0     0     0
          mirror-0                DEGRADED     0     0     0
            gpt/backup61.eli      ONLINE       0     0     0
            gpt/backup62-1.eli    ONLINE       0     0     0
            13282812295755460479  OFFLINE      0     0     0  was
/dev/gpt/backup62-2.eli

errors: No known data errors

It knows it fixed the checksums but the error count is zero -- I did NOT
"zpool clear".

This may have been present in 11.2; I didn't run that long enough in
this environment to know.  It definitely was *not* present in 11.1 and
before; the same data structure and script for backups has been in use
for a very long time without any changes and this first appeared when I
upgraded from 11.1 to 12.0 on this specific machine, with the exact same
physical disks being used for over a year (they're currently 6Tb units;
the last change out for those was ~1.5 years ago when I went from 4Tb to
6Tb volumes.)  I have both HGST-NAS and He-Enterprise disks in the
rotation and both show identical behavior so it doesn't appear to be
related to a firmware problem in one disk .vs. the other (e.g. firmware
that fails to flush the on-drive cache before going to standby even
though it was told to.)

>
> mps0: <Avago Technologies (LSI) SAS2008> port 0xe000-0xe0ff mem
> 0xfaf3c000-0xfaf3ffff,0xfaf40000-0xfaf7ffff irq 26 at device 0.0 on pci3
> mps0: Firmware: 11.00.00.00, Driver: 21.02.00.00-fbsd
> mps0: IOCCapabilities:
> 185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR>
>
>     Regards
>     Steve
>
> On 20/04/2019 15:39, Karl Denninger wrote:
>> I can confirm that 20.00.07.00 does *not* stop this.
>> The previous write/scrub on this device was on 20.00.07.00.  It was
>> swapped back in from the vault yesterday, resilvered without incident,
>> but a scrub says....
>>
>> root at NewFS:/home/karl # zpool status backup
>>    pool: backup
>>   state: DEGRADED
>> status: One or more devices has experienced an unrecoverable error.  An
>>          attempt was made to correct the error.  Applications are
>> unaffected.
>> action: Determine if the device needs to be replaced, and clear the
>> errors
>>          using 'zpool clear' or replace the device with 'zpool replace'.
>>     see: http://illumos.org/msg/ZFS-8000-9P
>>    scan: scrub repaired 188K in 0 days 09:40:18 with 0 errors on Sat Apr
>> 20 08:45:09 2019
>> config:
>>
>>          NAME                      STATE     READ WRITE CKSUM
>>          backup                    DEGRADED     0     0     0
>>            mirror-0                DEGRADED     0     0     0
>>              gpt/backup61.eli      ONLINE       0     0     0
>>              gpt/backup62-1.eli    ONLINE       0     0    47
>>              13282812295755460479  OFFLINE      0     0     0  was
>> /dev/gpt/backup62-2.eli
>>
>> errors: No known data errors
>>
>> So this is firmware-invariant (at least between 19.00.00.00 and
>> 20.00.07.00); the issue persists.
>>
>> Again, in my instance these devices are never removed "unsolicited" so
>> there can't be (or at least shouldn't be able to) unflushed data in the
>> device or kernel cache.  The procedure is and remains:
>>
>> zpool offline .....
>> geli detach .....
>> camcontrol standby ...
>>
>> Wait a few seconds for the spindle to spin down.
>>
>> Remove disk.
>>
>> Then of course on the other side after insertion and the kernel has
>> reported "finding" the device:
>>
>> geli attach ...
>> zpool online ....
>>
>> Wait...
>>
>> If this is a boogered TXG that's held in the metadata for the
>> "offline"'d device (maybe "off by one"?) that's potentially bad in that
>> if there is an unknown failure in the other mirror component the
>> resilver will complete but data has been irrevocably destroyed.
>>
>> Granted, this is a very low probability scenario (the area where the bad
>> checksums are has to be where the corruption hits, and it has to happen
>> between the resilver and access to that data.)  Those are long odds but
>> nonetheless a window of "you're hosed" does appear to exist.
>>
>
-- 
Karl Denninger
karl at denninger.net <mailto:karl at denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4897 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20190420/5f025157/attachment.bin>


More information about the freebsd-stable mailing list