Re: FreeBSD 13.2-STABLE can not boot from damaged mirror AND pool stuck in "resilver" state even without new devices.

From: Warner Losh <imp_at_bsdimp.com>
Date: Sun, 07 Jan 2024 21:15:14 UTC
On Sun, Jan 7, 2024 at 1:57 PM Lev Serebryakov <lev@freebsd.org> wrote:

> On 07.01.2024 21:49, Lev Serebryakov wrote:
>
> > On 07.01.2024 19:34, Warner Losh wrote:
> >
> >> I must have missed it. What were the diagnostics?
>
>   Oh, and two "nvlist inconsistency" before that vvvv
>
> > zio_read error: 5
> > zio_read error: 5
> > zio_read error: 5
>

5 is EIO which the loader uses internally for any error that the disk
reports.
I've not read through all the code involved here, but I think that means
there
might be read errors for real.

Though the nvlist inconsistency might be an issue.

So, if this is a mirror, then ada0 blank and ada1 with good data, in theory
you should be fine. However, perhaps ZFS is finding that there's an error
from
ada1 for real. Does all of ada1 read with a simple dd?

Not sure about the losing devices you described later on.

> ZFS: i/o error - all block copies unavailable
> > ZFS: can't read MOS of pool zroot
> >
> >
> >   To be honest, I thinks there is something else. Because sequence of
> events were (sorry, too long, but I think, tht every detail matters here):
>

Yea. There's something that's failing, which zio_read is woefully under
reporting for our diagnostic efforts. And/or something is
getting confused by the blank disk and/or the partially resilvered disk.


> (1) Update to 13.2 from 12.4. With installation of new gptzfsboot with
> gpart on both disks. It could place new /boot far away, but see (2)
> > (2) Reboot, which completed, but showed that ada0 has problems
> > (3) Replacement of ada0 by DC technicians, new disk is 512/4096, old
> disk is 512/512, pool has ashift=9
> > (4) Server refuses to boot from ada1 (ada0 is empty) with diagnostics
> (see above)
> > (5) Linux rescue system, passing 2 devices to qemu with FreeBSD (because
> Linux shows that ZFS is on whole disk, not on partition!).
> > (6) Re-creation of GPT on ada0, start of resilver (with sub-optimal
> ashift!).
> > (7) Interruption of resilver with reboot, because it is painfully slow
> under qemu.
> > (8) Wipe of ada0 (at this point resilver status of pool becomes crazy)
> to put live FreeBSD image to boot somehow.
> > (9) Many tries to cancel resilver and boot from single-disk "historical"
> pool on ada1, no success. I've attributed it to the strange state of pool:
> one component, no mirrior, but "resilvering".
> > (10) Boot from small UFS partition (which replaces swap partition).
> > (11) Pool on ada1 (old, live, 512/512 disk) is still "Reslivering"
> without any additional components (with zero speed, of course).
> > (12) Prepare partitions on ada0 again, creating new pool with ashift=12,
> send|receive.
> > (13) Removing partition table on ada1 (with old pool, ashift=9, still
> resilvering after many-many reboots with only one device in it).
>
>   And pleas note: this pool on ada1 (old, live disk) was NOT upgraded
> after 12-STABLE. It was old, 12-STABLE "level" pool with all new features
> disabled.
>

Yea, this isn't *THAT*OtHER* problem :).

Warner


> --
> // Lev Serebryakov
>
>