[Bug 237544] graphics/drm-fbsd12.0-kmod: panic on 12-STABLE with Radeon HD 7450 (but not with drm-fbsd11.2-kmod)

From: <bugzilla-noreply_at_freebsd.org>
Date: Thu, 30 Dec 2021 20:55:13 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=237544

--- Comment #11 from Bill Paul <noisetube@gmail.com> ---
So, since I'm off work this week and have not much else to do, I decided to try
isolating the actual problem here. Now that I have a known working set of code
(drm-fbsd11.2-kmod) I thought I could compare it to the non-working code
(drm-fbsd12.0-kmod) and gradually bisect things to narrow down the fault

After much hair-pulling and gnashing of teeth, I finally isolated things down
to the dma-fence module in the linuxkpi code.

Here's what I tried:

- Replaced the contents of the drivers/gpu/drm/radeon directory in
drm-fbsd12.0-kmod with the contents from the radeon directory in
drm-fbsd11.2-kmod
- Result: no change, panic still occurred

- Replaced the contents of the drivers/gpu/drm/ttm directory in
drm-fbsd12.0-kmod with the contents of the drm directory in drm-fbsd11.2-kmod
(as well as the associated header files)
- Result: no change, panic still occurred

- Replaced the contents of the linuxkpi and drivers/gpu/drm/ttm directories in
drm-fbsd12.0-kmod with the contents of linuxkpi and ttm directories from
drm-fbsd11.2-kmod (as well as the associated header files)
- Result: No panic

- Replaced _just_ the contents of the linuxkpi directory in drm-fbsd12.0-kmod
with the contents of the linuxkpi directory in drm-fbsd11.2-kmod (this time
taking care to preserve the ttm module; they are somewhat tightly coupled so
this took a bit more effort)
- Result: No panic

- Replaced _just_ the dma-fence.h and linux_dmafence.c modules in the linuxkpi
directory in drm-fbsd12.0-kmod with the ones from drm-fbsd11.2-kmod, and also
tweaked linux_synx_file.c a little (it uses an API from the 12.0 code which
isn't in the 11.2 code)
- Result: No panic

I'm still not exactly sure what's wrong here, but there seems to be a problem
in the dma-fence module with locking and/or reference counting that causes
fence structures to be deleted unexpectedly. This is what leads to the traps on
bad pointers.

I created a custom tarball of the drm-fbsd12.0-kmod port which includes patches
to the 4.16 FreeBSDDesktop 4.16 code to revert the dma-fence code as described
above. You can download it from here:

http://people.freebsd.org/~wpaul/radeon/drm-fbsd12.0-kmod.tar.gz

The specific things I did are:

1) Replaced dma-fence.h and linux_dmafence.c in the drm-fbsd12.0-kmod port with
the versions drm-fbsd11.2-kmod.

2) Added a compat wrapper function in dma-fence.h for dma_fence_get_rcu_safe()
which just calls dma_fence_get_rcu().

3) Added a compat macro in dma-fence.h for dma_fence_is_signaled_locked() which
just calls dma_fence_is_signaled()

4) In linux_sync_file.c, changed the sync_fill_fence_info() function back to
how it looked in the 11.2 codebase, because it uses dma_fence_get_status() and
DMA_FENCE_FLAG_TIMESTAMP_BIT, which were not available in the older 11.2
dma-fence code

Just unpack the tarball under /usr/ports/graphics in place of the old one and
then run make, followed by "make deinstall" and "make reinstall".

It occurred to me that instead of taking the older 11.2 dma-fence module and
porting it forward, it might make more sense to take the 13.0 module and port
it back. But this assumes that the drm-fbsd13.0-kmod code doesn't have the same
stability problem it in as drm-fbsd12.0-kmod, and I don't know if that's true.
(So far nobody has said whether or not they're using a Radeon card with 13.0
and whether or not they've encountered the same problems.) I may still try this
anyway if I'm still sufficiently bored.

So far I've tested this on two devices:

vgapci0@pci0:1:0:0:     class=0x030000 card=0x21261028 chip=0x68f91002 rev=0x00
hdr=0x00
    vendor     = 'Advanced Micro Devices, Inc. [AMD/ATI]'
    device     = 'Cedar [Radeon HD 5000/6000/7350/8350 Series]'
    class      = display
    subclass   = VGA

vgapci0@pci0:0:1:0: class=0x030000 card=0x168b103c chip=0x96481002 rev=0x00
hdr=0x00
vendor = 'Advanced Micro Devices, Inc. [AMD/ATI]'
device = 'Sumo [Radeon HD 6480G]'
class = display
subclass = VGA

I'm using the machine with the CEDAR device right now. The laptop with the SUMO
device is much more prone to crashing. Usually what I do to provoke it is:

- Boot and load the driver
- Plug in my phone and set up tethering over USB
- Start KDE5
- Start Firefox
- Browse Facebook or Reddit for a while

It usually panics within a few minutes.

Lastly, I have a question: I followed up to this particular PR because the it
seemed to most closely match the problems I was having, but it's been closed.
Should I open a new PR? This bug is still present with 12.3 and I'm clearly not
the only one affected by it. (I also still can't explain why it doesn't seem to
affect the i915kms driver.)

-- 
You are receiving this mail because:
You are the assignee for the bug.