[Bug 262421] zfs checksum errors and panic with invalid abd_t

From: <bugzilla-noreply_at_freebsd.org>
Date: Tue, 08 Mar 2022 14:34:05 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=262421

            Bug ID: 262421
           Summary: zfs checksum errors and panic with invalid abd_t
           Product: Base System
           Version: 13.0-STABLE
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Only Me
          Priority: ---
         Component: kern
          Assignee: bugs@FreeBSD.org
          Reporter: jfc@mit.edu

During a scrub my zfs pool reported a few dozen checksum errors per
disk, about 1 per 200 GB scanned:

$ zpool status -v data
  pool: data
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub in progress since Sun Mar  6 19:16:15 2022
        13.6T scanned at 942M/s, 11.7T issued at 202M/s, 18.2T total
        2.42M repaired, 64.64% done, 09:16:24 to go
config:

        NAME        STATE     READ WRITE CKSUM
        data        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            ada0    ONLINE       0     0    18  (repairing)
            ada1    ONLINE       0     0    17  (repairing)
            ada2    ONLINE       0     0    12  (repairing)
            ada3    ONLINE       0     0    23  (repairing)
        cache
          ada4p5    ONLINE       0     0     0

errors: No known data errors

This affects all disks so it is not a single bad disk (unless the cache disk
is bad).  More likely it is data corruption in the controller, the data path
from controller to kernel ZFS code, or the ZFS data structures.

After several hours the system crashed with

VERIFY3(abd->abd_size <= SPA_MAXBLOCKSIZE) failed (930062841 <= 16777216)

This indicates a corrupt abd_t structure (see abd.c line 113).
savecore did not generate a stack trace.

After rebooting the checksum error counters had reset to zero and the scrub
finished without error.  Probably something mysterious and irreproducible in
the state of my kernel that one time.

My kernel was up to date on stable/13:

FreeBSD flaviventris 13.1-PRERELEASE FreeBSD 13.1-PRERELEASE #8
stable/13-n249920-d1f3afc4a47: Mon Mar  7 10:10:37 EST 2022    
root@flaviventris:/usr/obj/usr/src/amd64.amd64/sys/CALIGATA amd64

Worth noting:

1. I have dedup enabled.

2. I have encryption enabled.

3. Since the previous scrub I did a zfs dump | zfs restore of close to
50% of the pool size to enable encryption. The pool was very nearly full
when I had both an encrypted and an unencrypted copy around.  Now it is
half full.

4. In /etc/make.conf I set "CPUTYPE?=amdfam10", appropriate for the
HP MicroServer hardware.

ada0 to ada3 are identical spinning disks, ada4 (cache) is SSD.

ahci0: <Marvell 88SE9230 AHCI SATA controller> port
0xe050-0xe057,0xe040-0xe043,0xe030-0xe037,0xe020-0xe0
23,0xe000-0xe01f mem 0xfea40000-0xfea407ff at device 0.0 on pci1
ahci0: AHCI v1.20 with 8 6Gbps ports, Port Multiplier not supported
ahci0: quirks=0x1000900<NOBSYRES,ALTSIG,IOMMU_BUSWIDE>
ada3: <ST10000VN0008-2JJ101 SC60> ACS-4 ATA SATA 3.x device
ada3: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada3: Command Queueing enabled
ada3: 9537536MB (19532873728 512 byte sectors)
ada4: <Samsung SSD 860 EVO 1TB RVT03B6Q> ACS-4 ATA SATA 3.x device
ada4: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes)
ada4: Command Queueing enabled
ada4: 953869MB (1953525168 512 byte sectors)

-- 
You are receiving this mail because:
You are the assignee for the bug.