[Bug 262421] zfs checksum errors and panic with invalid abd_t
Date: Tue, 08 Mar 2022 14:34:05 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=262421
Bug ID: 262421
Summary: zfs checksum errors and panic with invalid abd_t
Product: Base System
Version: 13.0-STABLE
Hardware: amd64
OS: Any
Status: New
Severity: Affects Only Me
Priority: ---
Component: kern
Assignee: bugs@FreeBSD.org
Reporter: jfc@mit.edu
During a scrub my zfs pool reported a few dozen checksum errors per
disk, about 1 per 200 GB scanned:
$ zpool status -v data
pool: data
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: scrub in progress since Sun Mar 6 19:16:15 2022
13.6T scanned at 942M/s, 11.7T issued at 202M/s, 18.2T total
2.42M repaired, 64.64% done, 09:16:24 to go
config:
NAME STATE READ WRITE CKSUM
data ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ada0 ONLINE 0 0 18 (repairing)
ada1 ONLINE 0 0 17 (repairing)
ada2 ONLINE 0 0 12 (repairing)
ada3 ONLINE 0 0 23 (repairing)
cache
ada4p5 ONLINE 0 0 0
errors: No known data errors
This affects all disks so it is not a single bad disk (unless the cache disk
is bad). More likely it is data corruption in the controller, the data path
from controller to kernel ZFS code, or the ZFS data structures.
After several hours the system crashed with
VERIFY3(abd->abd_size <= SPA_MAXBLOCKSIZE) failed (930062841 <= 16777216)
This indicates a corrupt abd_t structure (see abd.c line 113).
savecore did not generate a stack trace.
After rebooting the checksum error counters had reset to zero and the scrub
finished without error. Probably something mysterious and irreproducible in
the state of my kernel that one time.
My kernel was up to date on stable/13:
FreeBSD flaviventris 13.1-PRERELEASE FreeBSD 13.1-PRERELEASE #8
stable/13-n249920-d1f3afc4a47: Mon Mar 7 10:10:37 EST 2022
root@flaviventris:/usr/obj/usr/src/amd64.amd64/sys/CALIGATA amd64
Worth noting:
1. I have dedup enabled.
2. I have encryption enabled.
3. Since the previous scrub I did a zfs dump | zfs restore of close to
50% of the pool size to enable encryption. The pool was very nearly full
when I had both an encrypted and an unencrypted copy around. Now it is
half full.
4. In /etc/make.conf I set "CPUTYPE?=amdfam10", appropriate for the
HP MicroServer hardware.
ada0 to ada3 are identical spinning disks, ada4 (cache) is SSD.
ahci0: <Marvell 88SE9230 AHCI SATA controller> port
0xe050-0xe057,0xe040-0xe043,0xe030-0xe037,0xe020-0xe0
23,0xe000-0xe01f mem 0xfea40000-0xfea407ff at device 0.0 on pci1
ahci0: AHCI v1.20 with 8 6Gbps ports, Port Multiplier not supported
ahci0: quirks=0x1000900<NOBSYRES,ALTSIG,IOMMU_BUSWIDE>
ada3: <ST10000VN0008-2JJ101 SC60> ACS-4 ATA SATA 3.x device
ada3: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada3: Command Queueing enabled
ada3: 9537536MB (19532873728 512 byte sectors)
ada4: <Samsung SSD 860 EVO 1TB RVT03B6Q> ACS-4 ATA SATA 3.x device
ada4: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes)
ada4: Command Queueing enabled
ada4: 953869MB (1953525168 512 byte sectors)
--
You are receiving this mail because:
You are the assignee for the bug.