[Bug 253461] LinuxKPI: [AMD/ATI] RV730 PRO [Radeon HD 4650] crashes kernel

From: <bugzilla-noreply_at_freebsd.org>
Date: Sun, 09 Jan 2022 23:30:52 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=253461

--- Comment #9 from Bill Paul <noisetube@gmail.com> ---
(In reply to Vladimir Kondratyev from comment #7)

It looked to me that originally Linux had the dma_fence_signal() API and later
a new API dma_fence_signal_locked() was added. According to what I've read, the
idea is that dma_fence_signal_locked() can be used if the caller is already
holding the DMA fence object spinlock, while the older dma_fence_signal()
function takes the lock for you.

The question here is: when you signal a dma fence object, and you invoke its
attached callout routines, do you hold the spinlock or do you drop it?

The older linuxkpi code in drm-fbsd11.2-kmod was based on Linux 4.11 and only
had the dma_fence_signal() API, and that code always held the fence spinlock
when invoking the callouts.

In drm-fbsd12.0-kmod, based on Linux 4.16, both dma_fence_signal() and
dma_fence_signal_locked() are present. HOWEVER, the logic is now such that both
functions drop the dma fence spinlock when calling the callouts.

This changes the behavior of dma_fence_signal(), and I think the change was
wrong (though likely unintentional). Now, dma_fence_signal() drops the spinlock
when invoking the callouts. This does not seem to harm the Intel i915kms.ko
driver, but it seems to cause the radeonkms.ko driver driver to panic when the
system is under load. I must assume that dropping the lock leads to a race
condition when two different threads try to access the same dma fence object.

If you browse the most recent Linux kernel code, you can also see that this
behavior is inconsistent with the native Linux implementations of
dma_fence_signal() and dma_fence_signal_locked():

https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/dma-fence.c#L376

The dma_fence_signal_timestamp_locked() function shown here is used by both
dma_fence_signal() and dma_fence_signal_locked(). dma_fence_signal() takes the
fence spinlock before calling it. Note that the fence spinlock is _not_
released when invoking the callbacks.

From this I am forced to conclude:

- When calling dma_fence_signal(), the fence spinlock is supposed to be held
until the function returns, including when the callbacks are called.
- When calling dma_fence_signal_locked(), the same is true, except it is the
caller that's expected to take the fence spinlock.
- The current behavior in drm-fbsd12.0-kmod where the lock is dropped when
invoking the callouts is therefore wrong on two counts: it deviates from the
Linux behavior, which breaks synchronization in the Radeon driver.

I think my fix preserves the expected behavior of both routines, because
dma_fence_signal_unlocked() does not call dma_fence_signal_unlocked_sub() with
the spinlock held, while dma_fence_signal() does.

My office machine with the CAICOS chipset has been running with this fix for a
week now and has been stable. I've also been using the same fix on my laptop
with the SUMO chipset with the same fix for a bit longer and it also hasn't
crashed. Before the laptop would not last more than 5 minutes before it would
panic.

-Bill

-- 
You are receiving this mail because:
You are the assignee for the bug.