Re: ZFS + FreeBSD XEN dom0 panic

From: Roger Pau Monné <roger.pau_at_citrix.com>
Date: Sat, 26 Mar 2022 14:38:29 UTC
On Sat, Mar 26, 2022 at 02:08:06PM +0200, Ze Dupsys wrote:
> On 2022.03.26. 11:11, Roger Pau Monné wrote:
> >
> > Hm, do you think you could upload (or attach) your
> > /usr/lib/debug/boot/kernel/kernel.debug and provide an updated panic
> > trace using that same exact kernel?
> 
> Yes, it is just too big for email attachment.
> Uploaded at: https://files.fm/f/mp3v3qa22
> 
> This time i starved Dom0 of RAM(2G) to speed panic up. Panic trace it the
> same.
> 
> Trace:
> Fatal trap 12: page fault while in kernel mode
> cpuid = 2; apic id = 04
> fault virtual address	= 0x22710028
> fault code		= supervisor read data, page not present
> instruction pointer	= 0x20:0xffffffff80c6a2b2
> stack pointer	        = 0x28:0xfffffe009e486b30
> frame pointer	        = 0x28:0xfffffe009e486b30
> code segment		= base 0x0, limit 0xfffff, type 0x1b
> 			= DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags	= interrupt enabled, resume, IOPL = 0
> current process		= 3995 (devmatch)
> trap number		= 12
> panic: page fault
> cpuid = 2
> time = 1648293768
> KDB: stack backtrace:
> #0 0xffffffff80c7c285 at kdb_backtrace+0x65
> #1 0xffffffff80c2e2e1 at vpanic+0x181
> #2 0xffffffff80c2e153 at panic+0x43
> #3 0xffffffff810c8b97 at trap+0xba7
> #4 0xffffffff810c8bef at trap+0xbff
> #5 0xffffffff810c8243 at trap+0x253
> #6 0xffffffff810a0848 at calltrap+0x8
> #7 0xffffffff80c86ed1 at rman_is_region_manager+0x241
> #8 0xffffffff80c3eb41 at sbuf_new_for_sysctl+0x101
> #9 0xffffffff80c3df8c at kernel_sysctl+0x3ec
> #10 0xffffffff80c3e603 at userland_sysctl+0x173
> #11 0xffffffff80c3e44f at sys___sysctl+0x5f
> #12 0xffffffff810c949c at amd64_syscall+0x10c
> #13 0xffffffff810a115b at Xfast_syscall+0xfb
> Uptime: 10m19s

It's weird, because here you get a page fault, but there are also
traces with:

general protection fault while in kernel mode
cpuid = 3; a(d8) Scan for VGA option rom
pic id = 06
instruction pointer     = 0x20:0xffffffff810c5d64
stack pointer           = 0x28:0xfffffe00a20fe990
frame pointer           = 0x28:0xfffffe00a20fe990
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 8998 (devmatch)
trap number             = 9
panic: general protection fault
cpuid = 3
time = 1647416577
KDB: stack backtrace:
#0 0xffffffff80c7ca05 at kdb_backtrace+0x65
#1 0xffffffff80c2ea11 at vpanic+0x181
#2 0xffffffff80c2e883 at panic+0x43
#3 0xffffffff810c9b97 at trap+0xba7
#4 0xffffffff810c907b at trap+0x8b
#5 0xffffffff810a0dc8 at calltrap+0x8
#6 0xffffffff80c83067 at kvprintf+0x1007
#7 0xffffffff80c83df9 at snprintf+0x59
#8 0xffffffff80c8768b at rman_is_region_manager+0x27b
#9 0xffffffff80c3f271 at sbuf_new_for_sysctl+0x101
#10 0xffffffff80c3e6bc at kernel_sysctl+0x3ec
#11 0xffffffff80c3ed33 at userland_sysctl+0x173
#12 0xffffffff80c3eb7f at sys___sysctl+0x5f
#13 0xffffffff810ca49c at amd64_syscall+0x10c
#14 0xffffffff810a16db at Xfast_syscall+0xfb

That show a general protection fault instead of a page fault.

I've built an hypervisor with debug enabled for you, it's at:

https://people.freebsd.org/~royger/xen-debug

This is the same as the one in ports, just build with debug=y. If you
can place it in /boot/ and change your xen_kernel to:

xen_kernel="/boot/xen-debug"

It might provide some additional info.

I've also noticed it seems to always be 'devmatch' the process that
triggers the panic.

> 
> cat /tmp/panic.log| sed -Ee 's/^#[0-9]* //' -e 's/ .*//' | xargs addr2line
> -e /usr/lib/debug/boot/kernel/kernel.debug
> /usr/src/sys/kern/subr_kdb.c:443
> /usr/src/sys/kern/kern_shutdown.c:0
> /usr/src/sys/kern/kern_shutdown.c:844
> /usr/src/sys/amd64/amd64/trap.c:944
> /usr/src/sys/amd64/amd64/trap.c:0
> /usr/src/sys/amd64/amd64/trap.c:0
> /usr/src/sys/amd64/amd64/exception.S:292
> /usr/src/sys/kern/subr_rman.c:0

I've been able to get a better trace with gdb and your debug symbols,
and this is:

(gdb) info line *0xffffffff80c6a2b2
Line 1386 of "/usr/src/sys/kern/subr_bus.c" starts at address 0xffffffff80c6a2b2 <device_get_name+18>
   and ends at 0xffffffff80c6a2b6 <device_get_name+22>.
(gdb) info line *0xffffffff80c86ed1
Line 1052 of "/usr/src/sys/kern/subr_rman.c" starts at address 0xffffffff80c86ecc <sysctl_rman+540>
   and ends at 0xffffffff80c86ed5 <sysctl_rman+549>.

The page fault happens exactly at:

https://cgit.freebsd.org/src/tree/sys/kern/subr_bus.c?h=stable/13#n1386

Which is called from

https://cgit.freebsd.org/src/tree/sys/kern/subr_rman.c?h=stable/13#n1052

I'm trying to figure out how the device could be removed or
disconnected from the rman. I will try to create a patch to catch the
device that leaves rman regions when destroyed/removed.

Thanks, Roger.