Re: unusual ZFS issue
- Reply: Rich : "Re: unusual ZFS issue"
- In reply to: Lexi Winter : "unusual ZFS issue"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Fri, 15 Dec 2023 00:05:05 UTC
On 14/12/2023 22:17, Lexi Winter wrote:
> hi list,
>
> i’ve just hit this ZFS error:
>
> # zfs list -rt snapshot data/vm/media/disk1
> cannot iterate filesystems: I/O error
> NAME USED AVAIL REFER MOUNTPOINT
> data/vm/media/disk1@autosnap_2023-12-13_12:00:00_hourly 0B - 6.42G -
> data/vm/media/disk1@autosnap_2023-12-14_10:16:00_hourly 0B - 6.46G -
> data/vm/media/disk1@autosnap_2023-12-14_11:17:00_hourly 0B - 6.46G -
> data/vm/media/disk1@autosnap_2023-12-14_12:04:00_monthly 0B - 6.46G -
> data/vm/media/disk1@autosnap_2023-12-14_12:15:00_hourly 0B - 6.46G -
> data/vm/media/disk1@autosnap_2023-12-14_13:14:00_hourly 0B - 6.46G -
> data/vm/media/disk1@autosnap_2023-12-14_14:38:00_hourly 0B - 6.46G -
> data/vm/media/disk1@autosnap_2023-12-14_15:11:00_hourly 0B - 6.46G -
> data/vm/media/disk1@autosnap_2023-12-14_17:12:00_hourly 316K - 6.47G -
> data/vm/media/disk1@autosnap_2023-12-14_17:29:00_daily 2.70M - 6.47G -
>
> the pool itself also reports an error:
>
> # zpool status -v
> pool: data
> state: ONLINE
> status: One or more devices has experienced an error resulting in data
> corruption. Applications may be affected.
> action: Restore the file in question if possible. Otherwise restore the
> entire pool from backup.
> see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
> scan: scrub in progress since Thu Dec 14 18:58:21 2023
> 11.5T / 18.8T scanned at 1.46G/s, 6.25T / 18.8T issued at 809M/s
> 0B repaired, 33.29% done, 04:30:20 to go
> config:
>
> NAME STATE READ WRITE CKSUM
> data ONLINE 0 0 0
> raidz2-0 ONLINE 0 0 0
> da4p1 ONLINE 0 0 0
> da6p1 ONLINE 0 0 0
> da5p1 ONLINE 0 0 0
> da7p1 ONLINE 0 0 0
> da1p1 ONLINE 0 0 0
> da0p1 ONLINE 0 0 0
> da3p1 ONLINE 0 0 0
> da2p1 ONLINE 0 0 0
> logs
> mirror-2 ONLINE 0 0 0
> ada0p4 ONLINE 0 0 0
> ada1p4 ONLINE 0 0 0
> cache
> ada1p5 ONLINE 0 0 0
> ada0p5 ONLINE 0 0 0
>
> errors: Permanent errors have been detected in the following files:
>
> (it doesn’t list any files, the output ends there.)
>
> my assumption is that this indicates some sort of metadata corruption issue, but i can’t find anything that might have caused it. none of the disks report any errors, and while all the disks are on the same SAS controller, i would have expected controller errors to be flagged as CKSUM errors.
>
> my best guess is that this might be caused by a CPU or memory issue, but the system has ECC memory and hasn’t reported any issues.
>
> - has anyone else encountered anything like this?
I've never seen "cannot iterate filesystems: I/O error". Could it be
that the system has too many snapshots / not enough memory to list them?
But I have seen the pool report an error in an unknown file and not
shows any READ / WRITE / CKSUM errors. This is from my notes taken 10
years ago:
=============================
# zpool status -v
pool: tank
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1 ONLINE 0 0 0
ad0 ONLINE 0 0 0
ad1 ONLINE 0 0 0
ad2 ONLINE 0 0 0
ad3 ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
<0x2da>:<0x258ab13>
=============================
As you can see there are no CKSUM errors. There is something that should
be a path to filename: <0x2da>:<0x258ab13>
Maybe it was error in a snapshot which was already deleted? Just my guess.
I ran a scrub on that pool, it finished without any error and then the
status of the pool was OK.
Similar error reappeared after a month and then after about 6 month. The
machine had ECC RAM. After these 3 incidents, I never saw it again. I
still have this machine in working condition, just the disk drives were
replaced from 4x 1TB to 4x 4TB and then 4x 8TB :)
Kind regards
Miroslav Lachman