Repairing a bad ZFS free list

Reply: Rich : "Re: Repairing a bad ZFS free list"
Go to: [ bottom of page ] [ top of archives ] [ this month ]

From: John F Carr <jfc_at_mit.edu>
Date: Sun, 06 Feb 2022 18:09:24 UTC

I have a corrupt root ZFS pool on my ARM server (Ampere eMAG) running
a recent version of stable/13.  Is there any way to repair my system
short of wiping the disk and reinstalling?

All filesystems mount and there are no errors reported by zpool, but
there is bad metadata, apparently a block having been allocated twice.
Running "zfs destroy" tends to cause crashes like

panic: VERIFY3(l->blk_birth == r->blk_birth) failed (9269896 == 9269889)

The assertion is in dsl_deadlist.c:livelist_compare().  There are two
livelist_entry_t objects containing blkptr_t objects with the same
DVA_GET_VDEV and DVA_GET_OFFSET but distinct blk_birth.  Apparently
this is a bad thing.

spa_livelist_delete_cb appears in the stack trace.  I think the kernel is telling
me the same block has been allocated twice and it doesn't want to free it twice.

This problem persists across reboot.  Since I want to use poudriere
"stop running zfs destroy" is not a good workaround.

Is it safe to disable the assertion, or will that spread the
corruption even further?

In the old days I would use clri or fsdb to make the problematic part
of a UFS filesystem go away.  How do I repair ZFS?

This crash has been reported as bug
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=261538