zpool question -- resilvering doesn't fully check on-disk data for corruption?

Wed Apr 15 06:03:15 UTC 2020

Hi,

I have recently seen a bad drive on my home storage server.  The bad
drive had some timeouts occasionally that would cause the CAM subsystem
to kick it off eventually, like:

(ada1:ahcich11:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00
00 00
(ada1:ahcich11:0:0:0): CAM status: Command timeout
(ada1:ahcich11:0:0:0): Retrying command, 0 more tries remain
ada1 at ahcich11 bus 0 scbus11 target 0 lun 0
ada1: <WDC WD40EFRX-68WT0N0 80.00A80> s/n WD-WMC4E0090978 detached
(ada1:ahcich11:0:0:0): Periph destroyed

When this happens, a full 'camcontrol reset all' and 'camcontrol rescan
all' would bring it back, and ZFS would correctly start a resilvering
process as expected.  After the resilvering, zpool would detect several
checksum errors (also expected).

As a precautional measure, I usually would start another zpool scrub to
check data integration again when this happens.  To my surprise, in the
last few times when that drive was timing out, the zpool scrub would
also find some checksum errors and correct these (the drive is in a
RAID-Z pool).  A second run of 'zpool scrub' after that would no longer
be able to find any checksum errors.

I initially thought that is probably because there were some bad blocks
on the bad hard drive and didn't pay much attention as I already ordered
a new hard drive to replace it, but when the new drive arrived, I have
initiated a 'zpool replace' with both bad and new drive attached (which
will start a resilver too; I didn't perform a zpool scrub the last time
when the timeout happens because the scrub was very slow and I feared
that I might end up causing more damage to the bad drive before the new
drive arrived).  When the new drive arrived, however, to my surprise,
the zpool scrub after the replacement resilver have detected new
checksum errors on the newly attached drive.

Is this expected?  (My understanding is that both resilver and scrub
would read all data from a RAID-Z pool, therefore checking checksums for
all blocks, and for replacing, so checksum errors shouldn't really
happen for the new drive, because the written data was already
checksummed?  The system is equipped with ECC RAM, etc.; I know there is
a possibility that the disk controller or the disk itself may still
introduce bit flips, etc. if I'm really unlucky, but if that's the case
I think I should have seen errors more often...)

Cheers,

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 865 bytes
Desc: OpenPGP digital signature
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20200414/efa528d9/attachment.sig>