[Bug 267028] kernel panics when booting with both (zfs,ko or vboxnetflt,ko or acpi_wmi.ko) and amdgpu.ko

From: <bugzilla-noreply_at_freebsd.org>
Date: Tue, 21 Mar 2023 01:07:55 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=267028

--- Comment #145 from Mark Millard <marklmi26-fbsd@yahoo.com> ---
(In reply to George Mitchell from comment #144)

Looking at your full list of attachments, it appears that . . .

All the shutdown time crashes have:

fault virtual address   = 0x0

(And we might now have a known type of context
for getting the type of failure: late amdgpu
but no XFCE.)

All the dbuf_evict_thread related crashes have:

fault virtual address   = 0x7

(Late admgpu but having used XFCE.)

All the kldload related crashes have:

Fatal trap 9: general protection fault while in kernel mode
(but no explicit fault address listed)

(Early amdgpu loading.)


My guess is something is trashing memory in a way
that involves writing zeros over some pointer values
that it should not be touching. Later code extracts
such zeros and applies any offset and then tries to
dereference the result, resulting in a crash.

That you got "fault virtual address = 0x0" for shutdown
without having involved XFCE, suggests that a problem is
already in place before XFCE is potentially involved:
XFCE is not required. (XFCE use might lead to more
trashed memory than otherwise, leading to the 0x7
fault address cases.)

But I do not see how to get solid evidence for or
against such the hypothesis (or related ones).

The only thing I can identify that is likely unique to
your context --but is involved with amdgpu-- is the
involvement of the amdgpu_raven_gpu_*.ko modules.

Unfortunately moving your context to a different system
that avoids such module use or finding someone with a
separate system that does have such (and is willing to
set up experiments), is non-trivial for both directions
of testing.

Beyond possibly some checking on the degree/ease of
repeatability, I do not see how to gather better
information, much less get anywhere near directly
actionable information for fixing the crashes.

The one thing we have not looked at is the crash
dumps themselves, examining what memory looks like
and such. But I do not know what to do for that
either, relative to known-useful information. Such a
direction would be very exploratory and likely very
time consuming.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.