Re: unusual ZFS issue

From: Rich <rincebrain_at_gmail.com>
Date: Fri, 15 Dec 2023 06:41:22 UTC
Native encryption decryption errors won't show up as r/w/c errors, but will
show up as "things with errors" in the status output.

That wouldn't be triggered by scrub noticing them, though, since scrub
doesn't decrypt things.

Just the only thing I know of offhand where it'll decide there are errors
but the counters will be zero...

- Rich

On Thu, Dec 14, 2023 at 7:05 PM Miroslav Lachman <000.fbsd@quip.cz> wrote:

> On 14/12/2023 22:17, Lexi Winter wrote:
> > hi list,
> >
> > i’ve just hit this ZFS error:
> >
> > # zfs list -rt snapshot data/vm/media/disk1
> > cannot iterate filesystems: I/O error
> > NAME                                                       USED  AVAIL
> REFER  MOUNTPOINT
> > data/vm/media/disk1@autosnap_2023-12-13_12:00:00_hourly      0B      -
> 6.42G  -
> > data/vm/media/disk1@autosnap_2023-12-14_10:16:00_hourly      0B      -
> 6.46G  -
> > data/vm/media/disk1@autosnap_2023-12-14_11:17:00_hourly      0B      -
> 6.46G  -
> > data/vm/media/disk1@autosnap_2023-12-14_12:04:00_monthly     0B      -
> 6.46G  -
> > data/vm/media/disk1@autosnap_2023-12-14_12:15:00_hourly      0B      -
> 6.46G  -
> > data/vm/media/disk1@autosnap_2023-12-14_13:14:00_hourly      0B      -
> 6.46G  -
> > data/vm/media/disk1@autosnap_2023-12-14_14:38:00_hourly      0B      -
> 6.46G  -
> > data/vm/media/disk1@autosnap_2023-12-14_15:11:00_hourly      0B      -
> 6.46G  -
> > data/vm/media/disk1@autosnap_2023-12-14_17:12:00_hourly    316K      -
> 6.47G  -
> > data/vm/media/disk1@autosnap_2023-12-14_17:29:00_daily    2.70M      -
> 6.47G  -
> >
> > the pool itself also reports an error:
> >
> > # zpool status -v
> >    pool: data
> >   state: ONLINE
> > status: One or more devices has experienced an error resulting in data
> >       corruption.  Applications may be affected.
> > action: Restore the file in question if possible.  Otherwise restore the
> >       entire pool from backup.
> >     see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
> >    scan: scrub in progress since Thu Dec 14 18:58:21 2023
> >       11.5T / 18.8T scanned at 1.46G/s, 6.25T / 18.8T issued at 809M/s
> >       0B repaired, 33.29% done, 04:30:20 to go
> > config:
> >
> >       NAME        STATE     READ WRITE CKSUM
> >       data        ONLINE       0     0     0
> >         raidz2-0  ONLINE       0     0     0
> >           da4p1   ONLINE       0     0     0
> >           da6p1   ONLINE       0     0     0
> >           da5p1   ONLINE       0     0     0
> >           da7p1   ONLINE       0     0     0
> >           da1p1   ONLINE       0     0     0
> >           da0p1   ONLINE       0     0     0
> >           da3p1   ONLINE       0     0     0
> >           da2p1   ONLINE       0     0     0
> >       logs
> >         mirror-2  ONLINE       0     0     0
> >           ada0p4  ONLINE       0     0     0
> >           ada1p4  ONLINE       0     0     0
> >       cache
> >         ada1p5    ONLINE       0     0     0
> >         ada0p5    ONLINE       0     0     0
> >
> > errors: Permanent errors have been detected in the following files:
> >
> > (it doesn’t list any files, the output ends there.)
> >
> > my assumption is that this indicates some sort of metadata corruption
> issue, but i can’t find anything that might have caused it.  none of the
> disks report any errors, and while all the disks are on the same SAS
> controller, i would have expected controller errors to be flagged as CKSUM
> errors.
> >
> > my best guess is that this might be caused by a CPU or memory issue, but
> the system has ECC memory and hasn’t reported any issues.
> >
> > - has anyone else encountered anything like this?
>
> I've never seen "cannot iterate filesystems: I/O error". Could it be
> that the system has too many snapshots / not enough memory to list them?
>
> But I have seen the pool report an error in an unknown file and not
> shows any READ / WRITE / CKSUM errors. This is from my notes taken 10
> years ago:
>
> =============================
> # zpool status -v
>
>    pool: tank
>
>   state: ONLINE
>
> status: One or more devices has experienced an error resulting in data
>
>          corruption.  Applications may be affected.
>
> action: Restore the file in question if possible.  Otherwise restore the
>
>          entire pool from backup.
>
>     see: http://www.sun.com/msg/ZFS-8000-8A
>
>   scrub: none requested
>
> config:
>
>
>
>          NAME        STATE     READ WRITE CKSUM
>
>          tank        ONLINE       0     0     0
>
>            raidz1    ONLINE       0     0     0
>
>              ad0     ONLINE       0     0     0
>
>              ad1     ONLINE       0     0     0
>
>              ad2     ONLINE       0     0     0
>
>              ad3     ONLINE       0     0     0
>
>
>
> errors: Permanent errors have been detected in the following files:
>
>
>
>          <0x2da>:<0x258ab13>
> =============================
>
> As you can see there are no CKSUM errors. There is something that should
> be a path to filename: <0x2da>:<0x258ab13>
> Maybe it was error in a snapshot which was already deleted? Just my guess.
> I ran a scrub on that pool, it finished without any error and then the
> status of the pool was OK.
> Similar error reappeared after a month and then after about 6 month. The
> machine had ECC RAM. After these 3 incidents, I never saw it again. I
> still have this machine in working condition, just the disk drives were
> replaced from 4x 1TB to 4x 4TB and then 4x 8TB :)
>
> Kind regards
> Miroslav Lachman
>
>
>