Re: unusual ZFS issue

From: Miroslav Lachman <000.fbsd_at_quip.cz>
Date: Fri, 15 Dec 2023 00:05:05 UTC
On 14/12/2023 22:17, Lexi Winter wrote:
> hi list,
> 
> i’ve just hit this ZFS error:
> 
> # zfs list -rt snapshot data/vm/media/disk1
> cannot iterate filesystems: I/O error
> NAME                                                       USED  AVAIL  REFER  MOUNTPOINT
> data/vm/media/disk1@autosnap_2023-12-13_12:00:00_hourly      0B      -  6.42G  -
> data/vm/media/disk1@autosnap_2023-12-14_10:16:00_hourly      0B      -  6.46G  -
> data/vm/media/disk1@autosnap_2023-12-14_11:17:00_hourly      0B      -  6.46G  -
> data/vm/media/disk1@autosnap_2023-12-14_12:04:00_monthly     0B      -  6.46G  -
> data/vm/media/disk1@autosnap_2023-12-14_12:15:00_hourly      0B      -  6.46G  -
> data/vm/media/disk1@autosnap_2023-12-14_13:14:00_hourly      0B      -  6.46G  -
> data/vm/media/disk1@autosnap_2023-12-14_14:38:00_hourly      0B      -  6.46G  -
> data/vm/media/disk1@autosnap_2023-12-14_15:11:00_hourly      0B      -  6.46G  -
> data/vm/media/disk1@autosnap_2023-12-14_17:12:00_hourly    316K      -  6.47G  -
> data/vm/media/disk1@autosnap_2023-12-14_17:29:00_daily    2.70M      -  6.47G  -
> 
> the pool itself also reports an error:
> 
> # zpool status -v
>    pool: data
>   state: ONLINE
> status: One or more devices has experienced an error resulting in data
> 	corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
> 	entire pool from backup.
>     see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
>    scan: scrub in progress since Thu Dec 14 18:58:21 2023
> 	11.5T / 18.8T scanned at 1.46G/s, 6.25T / 18.8T issued at 809M/s
> 	0B repaired, 33.29% done, 04:30:20 to go
> config:
> 
> 	NAME        STATE     READ WRITE CKSUM
> 	data        ONLINE       0     0     0
> 	  raidz2-0  ONLINE       0     0     0
> 	    da4p1   ONLINE       0     0     0
> 	    da6p1   ONLINE       0     0     0
> 	    da5p1   ONLINE       0     0     0
> 	    da7p1   ONLINE       0     0     0
> 	    da1p1   ONLINE       0     0     0
> 	    da0p1   ONLINE       0     0     0
> 	    da3p1   ONLINE       0     0     0
> 	    da2p1   ONLINE       0     0     0
> 	logs
> 	  mirror-2  ONLINE       0     0     0
> 	    ada0p4  ONLINE       0     0     0
> 	    ada1p4  ONLINE       0     0     0
> 	cache
> 	  ada1p5    ONLINE       0     0     0
> 	  ada0p5    ONLINE       0     0     0
> 
> errors: Permanent errors have been detected in the following files:
> 
> (it doesn’t list any files, the output ends there.)
> 
> my assumption is that this indicates some sort of metadata corruption issue, but i can’t find anything that might have caused it.  none of the disks report any errors, and while all the disks are on the same SAS controller, i would have expected controller errors to be flagged as CKSUM errors.
> 
> my best guess is that this might be caused by a CPU or memory issue, but the system has ECC memory and hasn’t reported any issues.
> 
> - has anyone else encountered anything like this?

I've never seen "cannot iterate filesystems: I/O error". Could it be 
that the system has too many snapshots / not enough memory to list them?

But I have seen the pool report an error in an unknown file and not 
shows any READ / WRITE / CKSUM errors. This is from my notes taken 10 
years ago:

=============================
# zpool status -v

   pool: tank

  state: ONLINE

status: One or more devices has experienced an error resulting in data

         corruption.  Applications may be affected.

action: Restore the file in question if possible.  Otherwise restore the

         entire pool from backup.

    see: http://www.sun.com/msg/ZFS-8000-8A

  scrub: none requested

config:



         NAME        STATE     READ WRITE CKSUM

         tank        ONLINE       0     0     0

           raidz1    ONLINE       0     0     0

             ad0     ONLINE       0     0     0

             ad1     ONLINE       0     0     0

             ad2     ONLINE       0     0     0

             ad3     ONLINE       0     0     0



errors: Permanent errors have been detected in the following files:



         <0x2da>:<0x258ab13>
=============================

As you can see there are no CKSUM errors. There is something that should 
be a path to filename: <0x2da>:<0x258ab13>
Maybe it was error in a snapshot which was already deleted? Just my guess.
I ran a scrub on that pool, it finished without any error and then the 
status of the pool was OK.
Similar error reappeared after a month and then after about 6 month. The 
machine had ECC RAM. After these 3 incidents, I never saw it again. I 
still have this machine in working condition, just the disk drives were 
replaced from 4x 1TB to 4x 4TB and then 4x 8TB :)

Kind regards
Miroslav Lachman