Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)

Tue Apr 9 19:01:38 UTC 2019

I've run into something often -- and repeatably -- enough since updating
to 12-STABLE that I suspect there may be a code problem lurking in the
ZFS stack or in the driver and firmware compatibility with various HBAs
based on the LSI/Avago devices.

The scenario is this -- I have data sets that are RaidZ2 that are my
"normal" working set; one is comprised of SSD volumes and one of
spinning rust volumes.  These all are normal and scrubs never show
problems.  I've had physical failures with them over the years (although
none since moving to 12-STABLE as of yet) and have never had trouble
with resilvers or other misbehavior.

I also have a "backup" pool that is a 3-member mirror, to which the
volatile (that is, the zfs filesystems not set read-only) has zfs send's
done to.  Call them backup-i, backup-e1 and backup-e2.

All disks in these pools are geli-encrypted running on top of a
freebsd-zfs partition inside a GPT partition table using -s 4096 (4k)
geli "sectors".

Two of the backup mirror members are always in the machine; backup-i
(the base internal drive) is never removed.  The third is in a bank
vault.  Every week the vault drive is exchanged with the other, so that
the "first" member is never removed from the host, but the other two
(-e1 and -e2) alternate.  If the building burns I have a full copy of
all the volatile data in the vault.  (I also have mirrored copies, 2
each, of all the datasets that are operationally read-only in the vault
too; those get updated quarterly if there are changes to the
operationally read-only portion of the data store.)  The drive in the
vault is swapped weekly, so a problem should be detected almost
immediately before it can bugger me.

Before removing the disk intended to go to the vault I "offline" it,
spin it down (camcontrol standby) which issues a standby immediate to
the drive insuring that its cache is flushed and the spindle spun down
and then pull it.  I go exchange them at the bank, insert the other one,
and "zpool online...." it, which automatically resilvers it.

The disk resilvers and all is well -- no errors.

Or is it all ok?

If I run a scrub on the pool as soon as the resilver completes the disk
I just inserted will /invariably /have a few checksum errors on it that
the scrub fixes.  It's not a large number, anywhere from a couple dozen
to a hundred or so, but it's not zero -- and it damn well should be as
the resilver JUST COMPLETED with no errors which means the ENTIRE DISK'S
IN USE AREA was examined, compared, and blocks not on the "new member"
or changed copied over.  The "-i" disk (the one that is never pulled)
NEVER is the one with the checksum errors on it -- it's ALWAYS the one I
just inserted and which was resilvered to.

If I zpool clear the errors and scrub again all is fine -- no errors. 
If I scrub again before pulling the disk the next time to do the swap
all is fine as well.  I swap the two, resilver, and I'll get a few more
errors on the next scrub, ALWAYS on the disk I just put in.

Smartctl shows NO errors on the disk.  No ECC, no reallocated sectors,
no interface errors, no resets, nothing.  Smartd is running and never
posts any real-time complaints, other than the expected one a minute or
two after I yank the drive to take it to the bank.  There are no
CAM-related errors printing on the console either.  So ZFS says there's
a *silent* data error (bad checksum; never a read or write error) in a
handful of blocks but the disk says there have been no errors, the
driver does not report any errors, there have been no power failures as
the disk was in a bank vault and thus it COULDN'T have had a write-back
cache corruption event or similar occur.

I never had trouble with this under 11.1 or before and have been using
this paradigm for something on the order of five years running on this
specific machine without incident.  Now I'm seeing it repeatedly and
*reliably* under 12.0-STABLE.  I swapped the first disk that did it,
thinking it was physically defective -- the replacement did it on the
next swap.  In fact I've yet to record a swap-out on 12-STABLE that
*hasn't* done this and yet it NEVER happened under 11.1.  At the same
time I can run scrubs until the cows come home on the multiple Raidz2
packs on the same controller and never get any checksum errors on any of
them.

The firmware in the card was 19.00.00.00 -- again, this firmware *has
been stable for years.* 

I have just rolled the firmware on the card forward to 20.00.07.00,
which is the "latest" available.  I had previously not moved to 20.x
because earlier versions had known issues (some severe and potentially
fatal to data integrity) and 19 had been working without problem -- I
thus had no reason to move to 20.00.07.00.

But there apparently are some fairly significant timing differences
between the driver code in 11.1 and 11.2/12.0, as I discovered when the
SAS expander I used to have in these boxes started returning timeout
errors that were false.  Again -- this same configuration was completely
stable under 11.1 and previous over a period of years.

With 20.00.07.00 I have yet to have this situation recur -- so far --
but I have limited time with 20.00.07.00 and as such my confidence that
the issue is in fact resolved by the card firmware change is only modest
at this point.  Over the next month or so, if it doesn't happen again,
my confidence will of course improve.

Checksum errors on ZFS volumes are extraordinarily uncool for the
obvious reason -- they imply the disk thinks the data is fine (since it
is not recording any errors on the interface or at the drive level) BUT
ZFS thinks the data off that particular record was corrupt as the
checksum fails.  Silent corruption is the worst sort in that it can hide
for months or even years before being discovered, long after your
redundant copies have been re-used or overwritten.

Assuming I do not see a recurrence with the 20.00.07.00 firmware I would
suggest that UPDATING, the Release Notes or Errata have an entry added
that for 12.x forward card firmware revisions prior to 20.00.07.00 carry
*strong* cautions and that those with these HBAs be strongly urged to
flash the card forward to 20.00.07.00 before upgrading or installing. 
If you get a surprise of this sort and have no second copy that is not
impacted you could find yourself severely hosed.

-- 
Karl Denninger
karl at denninger.net <mailto:karl at denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4897 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20190409/08ae901b/attachment.bin>