[Bug 253461] [AMD/ATI] RV730 PRO [Radeon HD 4650] panic kernel

From: <bugzilla-noreply_at_freebsd.org>
Date: Tue, 04 Jan 2022 22:52:02 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=253461

Bill Paul <noisetube@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |noisetube@gmail.com

--- Comment #3 from Bill Paul <noisetube@gmail.com> ---
I believe I have a fix for this bug. It is a problem with the linuxkpi code in
the FreeBSDDesktop-kms-drm-4.16.g20201016-8843e1fc5_GH0.tar.gz distribution.

Notes:

- This problem has been there for some time. I've had it happen in FreeBSD
12.2-RELEASE and FreeBSD 12.3-RELEASE.

- It's not confined to a single Radeon card. I've observed the problem with the
following hardware on different machines:

vgapci0@pci0:1:0:0:     class=0x030000 card=0x21261028 chip=0x68f91002 rev=0x00
hdr=0x00
    vendor     = 'Advanced Micro Devices, Inc. [AMD/ATI]'
    device     = 'Cedar [Radeon HD 5000/6000/7350/8350 Series]'
    class      = display
    subclass   = VGA

vgapci0@pci0:0:1:0: class=0x030000 card=0x168b103c chip=0x96481002 rev=0x00
hdr=0x00
    vendor     = 'Advanced Micro Devices, Inc. [AMD/ATI]'
    device     = 'Sumo [Radeon HD 6480G]'
    class      = display
    subclass   = VGA

vgapci1@pci0:131:0:0:   class=0x030000 card=0x90b8103c chip=0x67711002 rev=0x00
hdr=0x00
    vendor     = 'Advanced Micro Devices, Inc. [AMD/ATI]'
    device     = 'Caicos XTX [Radeon HD 8490 / R5 235X OEM]'
    class      = display
    subclass   = VGA

(Note that the Sumo device is built into a laptop, an HP ProBook 4535S.)

- This problem has been reported by others. PR 237544 is a duplicate. The
panics I experienced had the same stack traces as shown in both PRs.

- PR 237544 provides an important hint that this crash did _not_ happen with
the drm-fbsd11.2-kmod port/package. Although it has been deprecated, I was able
to build and install the drm-fbsd11.2-kmod code on my FreeBSD 12.3-RELEASE
system (the laptop) and the crashes went away.

- In my case, the panics were more likely to occur when the system was under
load. The laptop seemed to trigger it more frequently (which actually made it
easier to track it down).

I tried to track the problem down by comparing the the drm-fbsd11.2-kmod and
drm-fbsd12.0-kmod code and swapping bits of the 11.2 code into the 12.0 tree to
see what effect that would have. Eventually I traced the problem to the
linuxkpi code, and then to the dma-fence code, and then finally, to this
function in linuxkpi/gplv2/include/linux/dma-fence.h:

static inline void
dma_fence_signal_locked_sub(struct dma_fence *fence)
{
        struct dma_fence_cb *cur;

        while ((cur = list_first_entry_or_null(&fence->cb_list,
                    struct dma_fence_cb, node)) != NULL) {
                list_del_init(&cur->node);
                spin_unlock(fence->lock);   /* <-- No! */
                cur->func(fence, cur);
                spin_lock(fence->lock);     /* <-- No! */
        }
} 

Note the two lines highlited above.

The dma_fence_signal_locked_sub() routine is shared by both dma_fence_signal()
and dma_fence_signal_locked(). The latter function is intended to be used when
the caller is already holding the fence spinlock. The former takes the spinlock
itself.

The problem is that the above code causes the spinlock to be dropped in the
case where dma_fence_signal() is called. This is not the same behavior as the
older 11.2 code: in that case, the lock is held while the callouts are invoked.
(I *think* this is also the case in the later code in FreeBSD 13 too.) I
believe that dropping the lock before calling the callouts opens a race
condition window and this is what leads to the crash. It's difficult to
ascertain that this is the what's happening from the crash stack traces, but in
my analysis I found that at least sometimes the problem was that something was
trying to dereference a NULL DMA fence pointer.

I patched my copy of the code to remove the spin_unlock() and spin_lock() calls
shown above, and that seemed to fix the problem. The laptop has not crashed
since I did this. I also made the same change to the 12.2-RELEASE system with
the "Cedar" card and exercised it a bit, and that one seemed to run ok too. I
have just patched the "Caicos" machine today and so far it's running stable as
well (this is my work machine and this is my first day back at the office for
the new year).

I created a version of the drm-fbsd12.0-kmod port with this change included as
a patch, which can be downloaded from here:

http://people.freebsd.org/~wpaul/radeon/drm-fbsd12.0-kmod.tar.gz

I will also attach the patch to this PR.

Can someone please test this to see if it fixes the problem for them too?

Note: I happen to have about 3 or 4 extra Radeon cards as spares (I rescued
these from the e-waste bin) and would be happen to send one to a developer if
that would help (assuming they have a machine with a slot that can accommodate
it).

-- 
You are receiving this mail because:
You are on the CC list for the bug.