bge panic in 8.0

Wed Jan 20 23:42:20 UTC 2010

On Wed, Jan 20, 2010 at 03:12:51PM -0800, Erik Klavon wrote:
> On Thu, Jan 14, 2010 at 03:26:18PM -0800, Erik Klavon wrote:
> > On Wed, Jan 13, 2010 at 06:06:40PM -0800, Pyun YongHyeon wrote:
> > > On Wed, Jan 13, 2010 at 05:47:19PM -0800, Erik Klavon wrote:
> > > > One of my amd64 machines running 8.0p1 acting as a NAT system for many
> > > > network clients dropped into kdb today. tr indicates a problem in
> > > > bge.
> > > > 
> > > > Tracing pid 12 tid 100033 td 0xffffff0001687000
> > > > pmap_kextract() at pmap_kextract+0x4e
> > > > bus_dmamap_load() at bus_dmamap_load+0xab
> > > > bge_newbuf_std() at bge_newbuf_std+0xcc
> > > > bge_rxeof() at bge_rxeof+0x36a
> > > > bge_intr() at bge_intr+0x1c0
> > > > intr_event_execute_handlers() at intr_event_execute_handlers+0xfd
> > > > ithread_loop() at ithread_loop+0x8e
> > > > fork_exit() at fork_exit+0x118
> > > > fork_trampoline() at fork_trampoline+0xe
> > > > --- trap 0, rip = 0, rsp = 0xffffff8074c01d30, rbp = 0 ---
> > > > 
> > > > I haven't been able to find a PR that matches this particular trace.
> > > > 
> > > > Pyun recently MFCd to stable (hence my post to this list) some changes
> > > > to bge that involve functions in the above trace and according to the
> > > > commit log (r201685) may address a kernel panic. Is there any
> > > > indication in the above trace that this is the type of panic the
> > > > commit attempts to address? I don't have a core dump for this
> > > > panic. This machine has been unstable on 8, so I may be able to get a
> > > > core dump in the future. If there is other information you'd like me
> > > > to gather, please let me know.
> > > 
> > > Yes, that part of code in trace above were rewritten to address
> > > bus_dma(9) issues. So it would be great if you can try latest
> > > bge(4) in stable/8 and let me know how it goes on your box. I guess
> > > you can just download if_bge.c and if_bgereg.h from stable/8 and
> > > rebuild bge(4) would be enough to run it on 8.0-RELEASE.
> > 
> > Great, I will try this out on a test machine today. If it holds up
> > under testing, I will put it into production. These crashes can happen
> > weeks after a machine boots, so I won't know if the problem is solved
> > for some time. Thanks for your help,
> 
> I didn't run into any problems while testing. I started running bge(4)
> from stable in production this morning. I had three kernel panics in a
> couple hours; here's an example
> 
> Fatal trap 12: page fault while in kernel mode
> cpuid = 0; apic id = 00
> fault virtual address   = 0x18
> fault code              = supervisor read data, page not present
> instruction pointer     = 0x20:0xffffffff805ccf17
> stack pointer           = 0x28:0xffffff800004f830
> frame pointer           = 0x28:0xffffff800004f890
> code segment            = base 0x0, limit 0xfffff, type 0x1b
>                         = DPL 0 pres 1, long 1, def32 0, gran 1
> processor eflags        = interrupt enabled, resume, IOPL = 0
> current process         = 13 (ng_queue0) 
> [thread pid 13 tid 100009 ]
> Stopped at      m_copym+0x37:   movl    0x18(%r12),%eax
> 
> db> tr
> Tracing pid 13 tid 100009 td 0xffffff000189aab0
> m_copym() at m_copym+0x37
> ip_fragment() at ip_fragment+0x131
> ip_output() at ip_output+0xeec
> ip_forward() at ip_forward+0x16a
> ip_input() at ip_input+0x57d
> ng_ipfw_rcvdata() at ng_ipfw_rcvdata+0xb9
> ng_apply_item() at ng_apply_item+0x220
> ngthread() at ngthread+0x16b
> fork_exit() at fork_exit+0x118
> fork_trampoline() at fork_trampoline+0xe
> --- trap 0, rip = 0, rsp = 0xffffff800004fd30, rbp = 0 ---
> 
> I tried the kdb command 'panic' to dump core, but this command only
> produced further faults. After the third panic related to m_copym, I
> reverted to the previous version of bge(4) from 8.0p1. A couple of
> hours has passed without these panics repeating while running the
> previous version of bge(4).
> 

I guess this is NULL pointer dereference in m_copym(9). And I also
see you're using netgraph(4). Can you run the server without
netgraph(4) in your configuration and see how this make any
difference? I'm not familiar with netgraph(4) but other developers
can comment on this.
Another thing to narrow down the cause would be trying other
controllers and see you can reproduce the issue. But I think the
above panic is not related with bge(4).

> There is a long open PR, 89070, that looks to be related to the above
> panic. I don't have any proof that these panics resulted from the
> newer version of bge(4). I haven't seen kernel panics such as these on
> any of the other machines with this same configuration.
> 
> I have seen a kernel panic on systems running 8.0p1 with a different
> stack trace than the one I posted previous that also appears to be
> related to bge(4).
> 
> Fatal trap 12: page fault while in kernel mode
> cpuid = 1; apic id = 01
> fault virtual address   = 0x28
> fault code              = supervisor write data, page not present
> instruction pointer     = 0x20:0xffffffff802cdf0e
> stack pointer           = 0x28:0xffffff8074c1ab10
> frame pointer           = 0x28:0xffffff8074c1ab70
> code segment            = base 0x0, limit 0xfffff, type 0x1b
>                         = DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags        = interrupt enabled, resume, IOPL = 0
> current process         = 12 (irq25: bge1)
> [thread pid 12 tid 100034 ]
> Stopped at      bge_rxeof+0x1be:        movq    %r15,0x28(%r14)
> 
> db> trace
> Tracing pid 12 tid 100034 td 0xffffff0001680ab0
> bge_rxeof() at bge_rxeof+0x1be
> bge_intr() at bge_intr+0x1c0
> intr_event_execute_handlers() at intr_event_execute_handlers+0xfd
> ithread_loop() at ithread_loop+0x8e
> fork_exit() at fork_exit+0x118
> fork_trampoline() at fork_trampoline+0xe

I think this is a real bug of bge(4) and I believe it was fixed in
stable.

> --- trap 0, rip = 0, rsp = 0xffffff8074c1ad30, rbp = 0 ---
> 
> Erik