-CURRENT fatal trap cause by cxgbe module

Tue Mar 3 02:47:42 UTC 2020

On Mon, Mar 2, 2020 at 6:55 PM Ryan Libby <rlibby at freebsd.org> wrote:
>
> On Sun, Mar 1, 2020 at 8:07 PM Dustin Marquess <dmarquess at gmail.com> wrote:
> >
> > So I've been fighting with any current from the last month or so
> > instantly crashing when I boot it.  I did notice that kernels in the
> > various snapshot images were working, however, so I was trying to
> > figure out why.  At first I thought it was because I had INVARIANTS
> > and such disabled, but no, I finally figured it out.
> >
> > I've had in my /boot/loader.conf for a while now:
> >
> > if_cxgbe_load="YES"
> >
> > I guess since the stock installer kernels don't have cxgbe enabled by
> > default.  I added "device cxgbe" to my kernels a while ago.  Normally
> > the kernel would give some error about the module already being loaded
> > or something and just continue.  As of last month or so, however,
> > instead it just crashes:
> >
> > FreeBSD clang version 9.0.1 (git at github.com:llvm/llvm-project.git
> > c1a0a213378a458fbea1a5c77b315c7dce08fd05) (based on LLVM 9.0.1)
> > WARNING: WITNESS option enabled, expect reduced performance.
> > kernel trap 12 with interrupts disabled
> >
> >
> > Fatal trap 12: page fault while in kernel mode
> > cpuid = 0; apic id = 00
> > fault virtual address = 0x8
> > fault code = supervisor read data, page not present
> > instruction pointer = 0x20:0xffffffff80622931
> > stack pointer         = 0x28:0xffffffff8241c9a0
> > frame pointer         = 0x28:0xffffffff8241c9e0
> > code segment = base 0x0, limit 0xfffff, type 0x1b
> > = DPL 0, pres 1, long 1, def32 0, gran 1
> > processor eflags = resume, IOPL = 0
> > current process = 0 ()
> > trap number = 12
> > panic: page fault
> > cpuid = 0
> > time = 1
> >
> > KDB: stack backtrace:
> > db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xffffffff8241c600
> > vpanic() at vpanic+0x18a/frame 0xffffffff8241c660
> > panic() at panic+0x43/frame 0xffffffff8241c6c0
> > trap_fatal() at trap_fatal+0x386/frame 0xffffffff8241c720
> > trap_pfault() at trap_pfault+0x99/frame 0xffffffff8241c7a0
> > trap() at trap+0x4e9/frame 0xffffffff8241c8d0
> > calltrap() at calltrap+0x8/frame 0xffffffff8241c8d0
> > --- trap 0xc, rip = 0xffffffff80622931, rsp = 0xffffffff8241c9a0, rbp
> > = 0xffffffff8241c9e0 ---
> > malloc() at malloc+0x51/frame 0xffffffff8241c9e0
> > sysctl_handle_string() at sysctl_handle_string+0x12d/frame 0xffffffff8241ca20
> > sysctl_root_handler_locked() at sysctl_root_handler_locked+0xa2/frame
> > 0xffffffff8241ca70
> > sysctl_register_oid() at sysctl_register_oid+0x54c/frame 0xffffffff8241cd80
> > sysctl_register_all() at sysctl_register_all+0x88/frame 0xffffffff8241cda0
> > mi_startup() at mi_startup+0xf2/frame 0xffffffff8241cdf0
> > btext() at btext+0x2c
> > KDB: enter: panic
> > [ thread pid 0 tid 0 ]
> > Stopped at      kdb_enter+0x37: movq    $0,0xa5f4a6(%rip)
> > db>
> >
> > If I take the if_cxgbe_load out, however, it boots fine.
>
> You maybe also have something defined in your /boot/loader.conf that
> causes a tunable to be set?
>
> It looks like there's just an ordering bug in kern_sysctl.c, where we
> call sysctl_register_all() with SI_SUB_KMEM, SI_ORDER_FIRST but we do
> MALLOC_DEFINE() with SI_SUB_KMEM, SI_ORDER_THIRD.  If
> sysctl_register_all() is going to malloc(), it needs to run after
> malloc_init(), and it looks like populating a string tunable causes it
> to malloc().

Ah, indeed, I do! That explains why Navdeep couldn't reproduce it.

-Dustin