Kernel panic (page fault) on 10.3-STABLE with IB & VIMAGE modules

Justin Clift justin at postgresql.org
Thu Apr 21 14:16:18 UTC 2016


Hi all,

Have been hitting a kernel panic (page fault) with the IB modules loaded
on 10.3-STABLE.  (compiled multiple times over the last few days, all panicing)

Spent several hours narrowing down the cause, and it's definitely a bad
interaction between the IB modules (unsure which) + the "VIMAGE" module.

I'll fill out a bug report in a bit.  In the meantime, does the below have any
useful info in it that I can use for further investigation?  (commands taken from
https://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug-gdb.html)

***********************************************************************************

root at cluster1:/usr/obj/usr/src/sys/CONNECTX # kgdb kernel.debug /var/crash/vmcore.0
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...

Unread portion of the kernel message buffer:
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 12 (irq271: mlx4_core0)
trap number		= 12
panic: page fault
cpuid = 0
KDB: stack backtrace:
#0 0xffffffff807263d0 at kdb_backtrace+0x60
#1 0xffffffff806e8c76 at vpanic+0x126
#2 0xffffffff806e8b43 at panic+0x43
#3 0xffffffff80b8bf3b at trap_fatal+0x36b
#4 0xffffffff80b8c23d at trap_pfault+0x2ed
#5 0xffffffff80b8b8ba at trap+0x47a
#6 0xffffffff80b71892 at calltrap+0x8
#7 0xffffffff807be1a2 at netisr_dispatch_src+0x62
#8 0xffffffff808f89fa at ipoib_cm_handle_rx_wc+0x22a
#9 0xffffffff808fcc98 at ipoib_ib_completion+0x78
#10 0xffffffff80930c43 at mlx4_cq_completion+0x63
#11 0xffffffff80933d43 at mlx4_eq_int+0x2c3
#12 0xffffffff80932fac at mlx4_msi_x_interrupt+0xc
#13 0xffffffff806b35cb at intr_event_execute_handlers+0xab
#14 0xffffffff806b3a16 at ithread_loop+0x96
#15 0xffffffff806b104a at fork_exit+0x9a
#16 0xffffffff80b71dce at fork_trampoline+0xe
Uptime: 3m47s
Dumping 485 out of 7857 MB:..4%..14%..24%..33%..43%..53%..63%..73%..83%..93%

Reading symbols from /boot/kernel/ums.ko.symbols...done.
Loaded symbols for /boot/kernel/ums.ko.symbols
#0  doadump (textdump=<value optimized out>) at pcpu.h:219
219		__asm("movq %%gs:%1,%0" : "=r" (td)
(kgdb) list *0xffffffff808f89fa
0xffffffff808f89fa is in ipoib_cm_handle_rx_wc (/usr/src/sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_cm.c:565).
560		mb->m_pkthdr.rcvif = dev;
561		proto = *mtod(mb, uint16_t *);
562		m_adj(mb, IPOIB_ENCAP_LEN);
563	
564		IPOIB_MTAP_PROTO(dev, mb, proto);
565		ipoib_demux(dev, mb, ntohs(proto));
566	
567	repost:
568		if (has_srq) {
569			if (unlikely(ipoib_cm_post_receive_srq(priv, wr_id)))
Current language:  auto; currently minimal
(kgdb) list *0xffffffff807be1a2
0xffffffff807be1a2 is in netisr_dispatch_src (/usr/src/sys/net/netisr.c:976).
971		if (dispatch_policy == NETISR_DISPATCH_DIRECT) {
972			nwsp = DPCPU_PTR(nws);
973			npwp = &nwsp->nws_work[proto];
974			npwp->nw_dispatched++;
975			npwp->nw_handled++;
976			netisr_proto[proto].np_handler(m);
977			error = 0;
978			goto out_unlock;
979		}
980	
(kgdb) list *0xffffffff80b71892
0xffffffff80b71892 is at /usr/src/sys/amd64/amd64/exception.S:238.
233		.type	calltrap, at function
234	calltrap:
235		movq	%rsp,%rdi
236		call	trap
237		MEXITCOUNT
238		jmp	doreti			/* Handle any pending ASTs */
239	
240		/*
241		 * alltraps_noen entry point.  Unlike alltraps above, we want to
242		 * leave the interrupts disabled.  This corresponds to
(kgdb) list *0xffffffff80b8b8ba
0xffffffff80b8b8ba is in trap (/usr/src/sys/amd64/amd64/trap.c:447).
442	
443			KASSERT(cold || td->td_ucred != NULL,
444			    ("kernel trap doesn't have ucred"));
445			switch (type) {
446			case T_PAGEFLT:			/* page fault */
447				(void) trap_pfault(frame, FALSE);
448				goto out;
449	
450			case T_DNA:
451				KASSERT(!PCB_USER_FPU(td->td_pcb),
(kgdb)

***********************************************************************************

Regards and best wishes,

Justin Clift

--
"My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there."
- Indira Gandhi



More information about the freebsd-infiniband mailing list