[Bug 267028] kernel panics when booting with both (zfs,ko or vboxnetflt,ko or acpi_wmi.ko) and amdgpu.ko

From: <bugzilla-noreply_at_freebsd.org>
Date: Mon, 20 Mar 2023 23:03:19 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=267028

--- Comment #140 from Mark Millard <marklmi26-fbsd@yahoo.com> ---
(In reply to George Mitchell from comment #137)

All 4 are examples related to dbuf_evict_thread (a.k.a.
zfs dbuf related crashes), as I feared. All 4 look like:

Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 01
fault virtual address   = 0x7
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff82600ba6

Looks to be in:

 5    1 0xffffffff82600000   3df128 zfs.ko


 panic: page fault
cpuid = 1
time = 1679349400
KDB: stack backtrace:
#0 0xffffffff80c66ee5 at kdb_backtrace+0x65
#1 0xffffffff80c1bbef at vpanic+0x17f
#2 0xffffffff80c1ba63 at panic+0x43
#3 0xffffffff810addf5 at trap_fatal+0x385
#4 0xffffffff810ade4f at trap_pfault+0x4f
#5 0xffffffff81084fd8 at calltrap+0x8
#6 0xffffffff827ac768 at zap_evict_sync+0x68
#7 0xffffffff8267d74a at dbuf_destroy+0xba
#8 0xffffffff82683129 at dbuf_evict_one+0xf9
#9 0xffffffff8267b43d at dbuf_evict_thread+0x31d
#10 0xffffffff80bd8abe at fork_exit+0x7e
#11 0xffffffff8108604e at fork_trampoline+0xe

#6  0xffffffff810ade4f in trap_pfault (frame=0xfffffe00b3bb6d00, 
    usermode=false, signo=<optimized out>, ucode=<optimized out>)
    at /usr/src/sys/amd64/amd64/trap.c:763
#7  <signal handler called>
#8  avl_destroy_nodes (tree=tree@entry=0xfffff8001a80b5a0, 
    cookie=cookie@entry=0xfffffe00b3bb6dd0)
    at /usr/src/sys/contrib/openzfs/module/avl/avl.c:1023
#9  0xffffffff827ac768 in mze_destroy (zap=0xfffff8001a80b480)
    at /usr/src/sys/contrib/openzfs/module/zfs/zap_micro.c:402

A question would be if this repeats based on amdgpu having been
loaded (again last) but no X11 like activity having ever been
started: limiting amdgpu use to just the load activity or as
close to that limited of use as is possible. (This is separate
from your zfs load time adjustment test.)

My guess is that the content of some memory area(s) is being
trashed in your context. I'm not sure how to track down
what is doing the trashing or were all the trashed area(s)
are if that is what is going on.

At least we now have a clue how to get the specific type of
crash. Before I had no clue what an example initial-context
might be like.


Note: Changing the load order should get a matching kldstat
report to indicate the address ranges that end up involved.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.