kernel panic loop after zpool remove

Thu Jun 18 17:55:34 UTC 2020

I managed to recover the data from this zpool.  You can read about it on
the forum if you are curious:
https://forums.FreeBSD.org/threads/kernel-panic-loop-after-zpool-remove.75632/post-466625

Andrew

On Thu, Jun 4, 2020 at 10:10 AM Andrew Chanler <achanler at gmail.com> wrote:

> Hi,
>
> Is there anyway to get zfs and the zpools into a recovery 'safe-mode'
> where it stops trying to do any manipulations of the zpools?  I'm stuck in
> a kernel panic loop after 'zpool remove'.  With the recent discussions
> about openzfs I was also considering trying to switch to openzfs but don't
> want to get myself into more trouble!  I just need to stop the remove vdev
> operation and this 'metaslab free' code flow from running on the new vdev
> so I can keep the system up long enough to read the data out.
>
> Thank you,
> Andrew
>
>
> The system is running 'FreeBSD 12.1-RELEASE-p5 GENERIC amd64'.
> -------
> panic: solaris assert: ((offset) & ((1ULL << vd->vdev_ashift) - 1)) == 0
> (0x400 == 0x0), file:
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c, line:
> 3593
> cpuid = 1
> time = 1590769420
> KDB: stack backtrace:
> #0 0xffffffff80c1d307 at kdb_backtrace+0x67
> #1 0xffffffff80bd063d at vpanic+0x19d
> #2 0xffffffff80bd0493 at panic+0x43
> #3 0xffffffff82a6922c at assfail3+0x2c
> #4 0xffffffff828a3b83 at metaslab_free_concrete+0x103
> #5 0xffffffff828a4dd8 at metaslab_free+0x128
> #6 0xffffffff8290217c at zio_dva_free+0x1c
> #7 0xffffffff828feb7c at zio_execute+0xac
> #8 0xffffffff80c2fae4 at taskqueue_run_locked+0x154
> #9 0xffffffff80c30e18 at taskqueue_thread_loop+0x98
> #10 0xffffffff80b90c53 at fork_exit+0x83
> #11 0xffffffff81082c2e at fork_trampoline+0xe
> Uptime: 7s
> -------
>
> I have a core dump and can provide more details to help debug the issue
> and open a bug tracker too:
> -------
> (kgdb) bt
> #0  __curthread () at /usr/src/sys/amd64/include/pcpu.h:234
> #1  doadump (textdump=<optimized out>) at
> /usr/src/sys/kern/kern_shutdown.c:371
> #2  0xffffffff80bd0238 in kern_reboot (howto=260) at
> /usr/src/sys/kern/kern_shutdown.c:451
> #3  0xffffffff80bd0699 in vpanic (fmt=<optimized out>, ap=<optimized out>)
> at /usr/src/sys/kern/kern_shutdown.c:877
> #4  0xffffffff80bd0493 in panic (fmt=<unavailable>) at
> /usr/src/sys/kern/kern_shutdown.c:804
> #5  0xffffffff82a6922c in assfail3 (a=<unavailable>, lv=<unavailable>,
> op=<unavailable>, rv=<unavailable>, f=<unavailable>, l=<optimized out>)
>     at /usr/src/sys/cddl/compat/opensolaris/kern/opensolaris_cmn_err.c:91
> #6  0xffffffff828a3b83 in metaslab_free_concrete (vd=0xfffff80004623000,
> offset=137438954496, asize=<optimized out>, checkpoint=0)
>     at
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c:3593
> #7  0xffffffff828a4dd8 in metaslab_free_dva (spa=<optimized out>,
> checkpoint=0, dva=<optimized out>)
>     at
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c:3863
> #8  metaslab_free (spa=<optimized out>, bp=0xfffff800043788a0,
> txg=41924766, now=<optimized out>)
>     at
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c:4145
> #9  0xffffffff8290217c in zio_dva_free (zio=0xfffff80004378830) at
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:3070
> #10 0xffffffff828feb7c in zio_execute (zio=0xfffff80004378830) at
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:1786
> #11 0xffffffff80c2fae4 in taskqueue_run_locked (queue=0xfffff80004222800)
> at /usr/src/sys/kern/subr_taskqueue.c:467
> #12 0xffffffff80c30e18 in taskqueue_thread_loop (arg=<optimized out>) at
> /usr/src/sys/kern/subr_taskqueue.c:773
> #13 0xffffffff80b90c53 in fork_exit (callout=0xffffffff80c30d80
> <taskqueue_thread_loop>, arg=0xfffff800041d90b0, frame=0xfffffe004dcf9bc0)
>     at /usr/src/sys/kern/kern_fork.c:1065
> #14 <signal handler called>
> (kgdb) frame 6
> #6  0xffffffff828a3b83 in metaslab_free_concrete (vd=0xfffff80004623000,
> offset=137438954496, asize=<optimized out>, checkpoint=0)
>     at
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c:3593
> 3593            VERIFY0(P2PHASE(offset, 1ULL << vd->vdev_ashift));
> (kgdb) p /x offset
> $1 = 0x2000000400
> (kgdb) p /x vd->vdev_ashift
> $2 = 0xc
> (kgdb) frame 9
> #9  0xffffffff8290217c in zio_dva_free (zio=0xfffff80004378830) at
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:3070
> 3070            metaslab_free(zio->io_spa, zio->io_bp, zio->io_txg,
> B_FALSE);
> (kgdb) p /x zio->io_spa->spa_root_vdev->vdev_child[0]->vdev_removing
> $3 = 0x1
> (kgdb) p /x zio->io_spa->spa_root_vdev->vdev_child[0]->vdev_ashift
> $4 = 0x9
> (kgdb) p /x zio->io_spa->spa_root_vdev->vdev_child[1]->vdev_ashift
> $5 = 0xc
> (kgdb) frame 8
> #8  metaslab_free (spa=<optimized out>, bp=0xfffff800043788a0,
> txg=41924766, now=<optimized out>)
>     at
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c:4145
> 4145                            metaslab_free_dva(spa, &dva[d],
> checkpoint);
> (kgdb) p /x *bp
> $6 = {blk_dva = {{dva_word = {0x100000002, 0x10000002}}, {dva_word = {0x0,
> 0x0}}, {dva_word = {0x0, 0x0}}}, blk_prop = 0x8000020200010001, blk_pad =
> {0x0,
>     0x0}, blk_phys_birth = 0x0, blk_birth = 0x4, blk_fill = 0x0, blk_cksum
> = {zc_word = {0x0, 0x0, 0x0, 0x0}}}
> -------
>
> Some of the notes on how I got in this state from the forum post (
> https://forums.freebsd.org/threads/kernel-panic-loop-after-zpool-remove.75632/
> ):
>
> I had two 2TB drives in mirror configuration for the last 7 years
> upgrading FreeBSD and zfs as time went by.  I finally needed more storage
> and tried to add two 4TB drives as a second vdev mirror:
> 'zpool add storage mirror /dev/ada3 /dev/ada4'
>
> Next the 'zpool status' showed:
> ---
>   pool: storage
> state: ONLINE
> status: One or more devices are configured to use a non-native block size.
>         Expect reduced performance.
> action: Replace affected devices with devices that support the
>         configured block size, or migrate data to a properly configured
>         pool.
>   scan: scrub repaired 0 in 0 days 08:39:28 with 0 errors on Sat May  9
> 01:19:54 2020
> config:
>
>         NAME                             STATE     READ WRITE CKSUM
>         storage                          ONLINE       0     0     0
>           mirror-0                       ONLINE       0     0     0
>             ada1                         ONLINE       0     0     0  block
> size: 512B configured, 4096B native
>             ada2                         ONLINE       0     0     0  block
> size: 512B configured, 4096B native
>           mirror-1                       ONLINE       0     0     0
>             ada3                         ONLINE       0     0     0
>             ada4                         ONLINE       0     0     0
>
> errors: No known data errors
> ---
>
> I should have stopped there, but saw the block size warning and thought I
> would try to fix it.  The zdb showed mirror-0 with ashift 9 (512 byte
> alignment) and mirror-1 with ashift 12 (4096 byte alignment).  I issued
> 'zpool remove storage mirror-0' and quickly went into a panic reboot loop.
> Rebooting into single user mode, first zfs or zpool command loads the
> driver and it panics again.  Powering off the new drives and rebooting it
> does not panic, but it fails to because the zpool is missing a top-level
> vdev (mirror-1).
>
>
>