Re: Debugging a (potentially?) ZFS-related panic, and discussion about large patchsets

From: Mark Johnston <markj_at_freebsd.org>
Date: Tue, 11 Jan 2022 02:50:42 UTC
On Tue, Jan 11, 2022 at 12:43:06AM +0100, Mateusz Guzik wrote:
> On 1/11/22, Mark Johnston <markj@freebsd.org> wrote:
> > On Mon, Jan 10, 2022 at 05:11:16PM -0500, Shawn Webb wrote:
> >> Hey all,
> >>
> >> So I'm getting an interesting ZFS-related kernel panic. I've uploaded
> >> the core.txt at [0]. I suspect it's related to FreeBSD commit
> >> 681ce946f33e75c590e97c53076e86dff1fe8f4a (zfs: merge
> >> openzfs/zfs@f291fa658 (master) into main).
> >>
> >> I'm able to reproduce it on a single system with some level of
> >> determinism: I'm building the security appliance firmware at ${DAYJOB}
> >> in a bhyve VM that's backed by a zvol. The host is a Dell Precision
> >> 7540 laptop with a single NVMe drive in it. The VM is configured with
> >> a single zvol, booting with UEFI.
> >>
> >> Looking at the commit email sent to dev-commits-src-all@, I see this:
> >> 146 files changed, 4933 insertions(+), 1572 deletions(-)
> >>
> >> Strangely, when I run `git show
> >> 681ce946f33e75c590e97c53076e86dff1fe8f4a`, I only see a small subset
> >> of those changes.
> >
> > That is a merge commit.  You need to specify that you want a diff
> > against the first parent (the preceding FreeBSD), so something
> > equivalent to "git diff --stat 681ce946f^ 681ce946f".  Use
> > "git log 681ce946f^2" to see the merged OpenZFS commits.
> >
> >> As a downstream consumer of 14-CURRENT, how am I supposed to even
> >> start debugging such a large patchset in any manner that respects my
> >> time?
> >>
> >> It seems to me that breaking up commits into smaller, bite-size chunks
> >> would make life easier for those experiencing bugs, especially ones
> >> that result in kernel panics.
> >
> > That's up to the upstream project, in this case OpenZFS.
> >
> >> ZFS in and of itself is a beast, and I've yet to study any of its
> >> code, so when there's a commit that large, even thinking about
> >> debugging it is a daunting task.
> >>
> >> Needless to say, I'm going to need some hand holding here for
> >> debugging this. Anyone have any idea what's going on?
> >
> > To start, you'll need to look at the stack trace for the thread with tid
> > 100061.
> >
> 
> imo the kernel should be patched to obtain the trace on its own. As
> the target has interrupts disabled it will have to do it with NMI, but
> support for that got scrapped in
> 
> commit 1c29da02798d968eb874b86221333a56393a94c3
> Author: Mark Johnston <markj@FreeBSD.org>
> Date:   Fri Jan 31 15:43:33 2020 +0000
> 
>     Reimplement stack capture of running threads on i386 and amd64.

More general and useful, to me at least, is having "acttrace" output
available in core.txt.  So I propose https://reviews.freebsd.org/D33817

I don't think the NMI-based stack(9) machinery to capture stacks is
really very useful here anyway.  We already raise NMIs on all CPUs
during a panic, so just reuse that and add some handler which can call
kdb_backtrace() on the target CPU in ipi_nmi_handler().