Reproducible page faults with drm-kmod on 12-Stable/amd64

From: Philipp Ost <pj_at_smo.de>
Date: Wed, 28 Jul 2021 15:50:28 UTC
Hi stable@!

Since switching back to my Radeon HD 5450, I get reproducible page 
faults and the occasional panic.

I am running FreeBSD 12.2-STABLE stable/12-n233459-0f97f2a1857 amd64; I 
am running a stripped down GENERIC kernel with DEBUG=-g.

I have installed these DRM modules:
drm-fbsd12.0-kmod-4.16.g20201016_2
drm-kmod-g20190710_1
gpu-firmware-kmod-g20210330

I built these after I updated my machine to the above mentioned 
revision. Since then, I rebuilt drm-fbsd12.0-kmod with DEBUG=on.

The radeonkms module gets loaded via /etc/rc.conf:
kld_list="/boot/modules/radeonkms.ko"

The graphics card gets identified as follows:
vgapci0@pci0:1:0:0:     class=0x030000 card=0xe164174b chip=0x68f91002 
rev=0x00
hdr=0x00
     vendor     = 'Advanced Micro Devices, Inc. [AMD/ATI]'
     device     = 'Cedar [Radeon HD 5000/6000/7350/8350 Series]'
     class      = display
     subclass   = VGA

Most page faults are DRM related:

1.
Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 12
fault virtual address	= 0xfffff803a38c9180
fault code		= supervisor read instruction, protection violation
instruction pointer	= 0x20:0xfffff803a38c9180
stack pointer		= 0x28:0xfffffe00a89036a8
frame pointer		= 0x28:0xfffffe00a89036a0
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current proces		= 951 (Renderer)
trap number 		= 12
panic: page fault
cpuid = 2
time = 1627412707
KDB: stack backtrace:
#0 0xffffffff8076af45 at kdb_backtrace+0x65
#1 0xffffffff8071f21b at vpanic+0x17b
#2 0xffffffff8071f093 at panic+0x43
#3 0xffffffff80a7e9a1 at trap_fatal+0x391
#4 0xffffffff80a7e9ff at trap_pfault+0x4f
#5 0xffffffff80a7e046 at trap+0x286
#6 0xffffffff80a56a08 at calltrap+0x8
#7 0xffffffff81cf681c at reservation_object_test_signaled_rcu+0x1dc
#8 0xffffffff81bc2350 at radeon_gem_busy_ioctl+0x30
#9 0xffffffff81cad2e1 at drm_ioctl_kernel+0xf1
#10 0xffffffff81cad589 at drm_ioctl+0x289
#11 0xffffffff809788b0 at linux_file_ioctl+0x330
#12 0xffffffff80788e47 at kern_ioctl+0x2b7
#13 0xffffffff80788aea at sys_ioctl+0xfa
#14 0xffffffff80a7f557 at amd64_syscall+0x387
#15 0xffffffff80a5732e at fast_syscall_common+0xf8
Uptime: 12m19s
Automatic reboot in 15 seconds - press a key on the console to abort
--> Press a key on the console to reboot,
--> or switch off the system now.

2. This one happend during `make index`:
Fatal trap 12: page fault while in kernel-mode
cpuid = 3; apic id = 13
fault virtual address	= 0x60045dabb18
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff8163b2a7
stack pointer		= 0x28:0xfffffe00a7fb7380
frame pointer		= 0x28:0xfffffe00a7fb73b0
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 85505 (sh)
trap number		= 12
panic: page fault
cpuid = 3
time = 1627414262
KDB: stack backtrace:
#0 0xffffffff8076af45 at kdb_backtrace+0x65
#1 0xffffffff8071f21b at vpanic+0x17b
#2 0xffffffff8071f093 at panic+0x43
#3 0xffffffff80a7e9a1 at trap_fatal+0x391
#4 0xffffffff80a7e9ff at trap_pfault+0x4f
#5 0xffffffff80a7e046 at trap+0x286
#6 0xffffffff80a56a08 at calltrap+0x8
#7 0xffffffff816f75b2 at zfs_freebsd_write+0xb72
#8 0xffffffff80b2039b at VOP_WRITE_APV+0xeb
#9 0xffffffff80801961 at vn_write+0x261
#10 0xffffffff80801433 at vn_io_fault_doio+0x43
#11 0xffffffff807fee0c at vn_io_fault1+0x15c
#12 0xffffffff807fce05 at vn_io_fault+0x185
#13 0xffffffff80788750 at dofilewrite+0xb0
#14 0xffffffff807882d0 at sys_write+0xc0
#15 0xffffffff80a7f557 at amd64_syscall+0x387
#16 0xffffffff80a5732e at fast_syscall_common+0xf8
Uptime: 7m48s
Automatic reboot in 15 seconds - press a key on the console to abort
--> Press a key on the console to reboot,
--> or switch off the system now.

3. The lone kernel panic:
panic: BUG ON!list_empty(&fence->cb_list) failed at 
/usr/ports/graphics/drm-fbsd12.0-kmod/work/kms-drm-8843e1fc5/linuxkpi/gplv2/include/linux/dma-fence.h:91
cpuid = 1
time = 1627415383
KDB: stack backtrace:
#0 0xffffffff8076af45 at kdb_backtrace+0x65
#1 0xffffffff8071f21b at vpanic+0x17b
#2 0xffffffff8071f093 at panic+0x43
#3 0xffffffff81cf5c84 at reservation_object_add_shared_fence+0x274
#4 0xffffffff81d0b289 at ttm_eu_fence_buffer_objects+0x69
#5 0xffffffff81bb2b72 at radeon_cs_parser_fini+0x52
#6 0xffffffff81bb26eb at radeon_cs_ioctl+0x8fb
#7 0xffffffff81cad2e1 at drm_ioctl_kernel+0xf1
#8 0xffffffff81cad589 at drm_ioctl+0x289
#9 0xffffffff809788b0 at linux_file_ioctl+0x330
#10 0xffffffff80788e47 at kern_ioctl+0x2b7
#11 0xffffffff80788aea at sys_ioctl+0xfa
#12 0xffffffff80a7f557 at amd64_syscall+0x387
#13 0xffffffff80a5732e at fast_syscall_common+0xf8
Uptime: 1m58s
Automatic reboot in 15 seconds - press a key on the console to abort
--> Press a key on the console to reboot,
--> or switch off the system now.

4. The most recent one:
Fatal trap 12: page fault while in kernel mode
cpuid = 4; apic id = 14
fault virtual address	= 0x18
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff81d0e43f
stack pointer		= 0x0:0xfffffe00a8908750
frame pointer		= 0x0:0xfffffe00a89087e0
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor flags		= interrupt enabled, resume, IOPL = 0
current process		= 1120 (Renderer)
trap number 		= 12
panic: page fault
cpuid = 4
time = 1627484091
KDB: stack backtrace:
#0 0xffffffff8076af45 at kdb_backtrace+0x65
#1 0xffffffff8071f21b at vpanic+0x17b
#2 0xffffffff8071f093 at panic+0x43
#3 0xffffffff80a7e9a1 at trap_fatal+0x391
#4 0xffffffff80a7e9ff at trap_pfault+0x4f
#5 0xffffffff80a7e046 at trap+0x286
#6 0xffffffff80a56a08 at calltrap+0x8
#7 0xffffffff81bd8dac at radeon_ttm_fault+0x4c
#8 0xffffffff8097b685 at linux_cdev_pager_populate+0x125
#9 0xffffffff80a21fee at vm_fault+0x53e
#10 0xffffffff80a21990 at vm_fault_trap+0x60
#11 0xffffffff80a7eb4c at trap_pfault+0x19c
#12 0xffffffff80a7e1d0 at trap+0x410
#13 0xffffffff80a56a08 at calltrap+0x8
Uptime: 1h32m55s
Automatic reboot in 15 seconds - press a key on the console to abort
--> Press a key on the console to reboot,
--> or switch off the system now.

These are all I could capture till now (transcribed by hand, any typos 
are my fault...).

Unfortunately, I was not able to get any sort of crash dump. I have
dumpdev=AUTO
dumpdir=/var/crash
savecore_enable=YES
in my /etc/rc.conf, but /var/crash is empty save for a file named minfree.

As I said, this is 100% reproducible. The time for something to go 
haywire ranges from pretty much immediatly to around two hours. Any 
advice on how to fix this?

I'm happy to provide more information if needed.

Thanks in advance!
Philipp