Date: Tue, 08 Mar 2022 14:34:05 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=262421 Bug ID: 262421 Summary: zfs checksum errors and panic with invalid abd_t Product: Base System Version: 13.0-STABLE Hardware: amd64 OS: Any Status: New Severity: Affects Only Me Priority: --- Component: kern Assignee: bugs@FreeBSD.org Reporter: firstname.lastname@example.org During a scrub my zfs pool reported a few dozen checksum errors per disk, about 1 per 200 GB scanned: $ zpool status -v data pool: data state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P scan: scrub in progress since Sun Mar 6 19:16:15 2022 13.6T scanned at 942M/s, 11.7T issued at 202M/s, 18.2T total 2.42M repaired, 64.64% done, 09:16:24 to go config: NAME STATE READ WRITE CKSUM data ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 ada0 ONLINE 0 0 18 (repairing) ada1 ONLINE 0 0 17 (repairing) ada2 ONLINE 0 0 12 (repairing) ada3 ONLINE 0 0 23 (repairing) cache ada4p5 ONLINE 0 0 0 errors: No known data errors This affects all disks so it is not a single bad disk (unless the cache disk is bad). More likely it is data corruption in the controller, the data path from controller to kernel ZFS code, or the ZFS data structures. After several hours the system crashed with VERIFY3(abd->abd_size <= SPA_MAXBLOCKSIZE) failed (930062841 <= 16777216) This indicates a corrupt abd_t structure (see abd.c line 113). savecore did not generate a stack trace. After rebooting the checksum error counters had reset to zero and the scrub finished without error. Probably something mysterious and irreproducible in the state of my kernel that one time. My kernel was up to date on stable/13: FreeBSD flaviventris 13.1-PRERELEASE FreeBSD 13.1-PRERELEASE #8 stable/13-n249920-d1f3afc4a47: Mon Mar 7 10:10:37 EST 2022 root@flaviventris:/usr/obj/usr/src/amd64.amd64/sys/CALIGATA amd64 Worth noting: 1. I have dedup enabled. 2. I have encryption enabled. 3. Since the previous scrub I did a zfs dump | zfs restore of close to 50% of the pool size to enable encryption. The pool was very nearly full when I had both an encrypted and an unencrypted copy around. Now it is half full. 4. In /etc/make.conf I set "CPUTYPE?=amdfam10", appropriate for the HP MicroServer hardware. ada0 to ada3 are identical spinning disks, ada4 (cache) is SSD. ahci0: <Marvell 88SE9230 AHCI SATA controller> port 0xe050-0xe057,0xe040-0xe043,0xe030-0xe037,0xe020-0xe0 23,0xe000-0xe01f mem 0xfea40000-0xfea407ff at device 0.0 on pci1 ahci0: AHCI v1.20 with 8 6Gbps ports, Port Multiplier not supported ahci0: quirks=0x1000900<NOBSYRES,ALTSIG,IOMMU_BUSWIDE> ada3: <ST10000VN0008-2JJ101 SC60> ACS-4 ATA SATA 3.x device ada3: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes) ada3: Command Queueing enabled ada3: 9537536MB (19532873728 512 byte sectors) ada4: <Samsung SSD 860 EVO 1TB RVT03B6Q> ACS-4 ATA SATA 3.x device ada4: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes) ada4: Command Queueing enabled ada4: 953869MB (1953525168 512 byte sectors) -- You are receiving this mail because: You are the assignee for the bug.