kern/88725: netinet6 updates in -CURRENT cause panic when using user-level ppp

Wed Nov 9 05:20:26 PST 2005

>Number:         88725
>Category:       kern
>Synopsis:       netinet6 updates in -CURRENT cause panic when using user-level ppp
>Confidential:   no
>Severity:       serious
>Priority:       low
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Wed Nov 09 13:20:19 GMT 2005
>Closed-Date:
>Last-Modified:
>Originator:     Victor Snezhko
>Release:        7.0-CURRENT
>Organization:
IndorSoft Ltd.
>Environment:
FreeBSD freebsd.indorsoft.ru 7.0-CURRENT FreeBSD 7.0-CURRENT #12: Sat Nov  5 19:24:55 NOVT 2005     root at freebsd.indorsoft.ru:/home/vvs/obj/usr/src/sys/VVS  i386
cvsupped on 2005.10.21.16.25.00, on 2005.11.06 problem is still here.
I use custom config but in the GENERIC problem remains.
The problem is reproducible at least on i386 (including virtual machine) and on amd64.
>Description:
The changes to netinet6 committed on 2005.10.21.16.23.01 break user-level ppp.
After these changes, when I start /usr/sbin/ppp, I experience panic. Here is the backtrace analysis:

/var/crash # kgdb /usr/obj/usr/src/sys/VVS/kernel /var/crash/vmcore.27
[GDB will not be able to debug user-mode threads: /usr/lib/libthread_db.so: Undefined symbol "ps_pglobal_lookup"]
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-marcel-freebsd".

Unread portion of the kernel message buffer:
kernel trap 12 with interrupts disabled


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address	= 0xdeadc0e6
fault code		= supervisor read, page not present
instruction pointer	= 0x20:0xc066c182
stack pointer	        = 0x28:0xc6082cc0
frame pointer	        = 0x28:0xc6082ce8
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, def32 1, gran 1
processor eflags	= resume, IOPL = 0
current process		= 36 (swi4: clock sio)
panic: from debugger
cpuid = 0
Uptime: 1m25s
Dumping 63 MB (3 chunks)
  chunk 0: 1MB (159 pages) ... ok
  chunk 1: 62MB (15856 pages) 46 30 14 ... ok
  chunk 2: 1MB (256 pages)

#0  doadump () at pcpu.h:165
165	pcpu.h: No such file or directory.
	in pcpu.h
(kgdb) bt
#0  doadump () at pcpu.h:165
#1  0xc0660824 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xc0660b39 in panic (fmt=0xc0856f00 "from debugger")
    at /usr/src/sys/kern/kern_shutdown.c:555
#3  0xc046cee1 in db_panic (addr=-1067007614, have_addr=0, count=-1, 
    modif=0xc6082abc "") at /usr/src/sys/ddb/db_command.c:434
#4  0xc046ce78 in db_command (last_cmdp=0xc0947984, cmd_table=0x0, 
    aux_cmd_tablep=0xc08bd97c, aux_cmd_tablep_end=0xc08bd998)
    at /usr/src/sys/ddb/db_command.c:403
#5  0xc046cf40 in db_command_loop () at /usr/src/sys/ddb/db_command.c:454
#6  0xc046eb59 in db_trap (type=12, code=0) at /usr/src/sys/ddb/db_main.c:221
#7  0xc06793a4 in kdb_trap (type=12, code=0, tf=0xc6082c80)
    at /usr/src/sys/kern/subr_kdb.c:473
#8  0xc0821ac8 in trap_fatal (frame=0xc6082c80, eva=3735929062)
    at /usr/src/sys/i386/i386/trap.c:846
#9  0xc0821152 in trap (frame=
      {tf_fs = 8, tf_es = 40, tf_ds = 40, tf_edi = -1054618496, tf_esi = -1054756736, tf_ebp = -972542744, tf_isp = -972542804, tf_ebx = 1, tf_edx = -1030106232, tf_ecx = -559038242, tf_eax = 83559, tf_trapno = 12, tf_err = 0, tf_eip = -1067007614, tf_cs = 32, tf_eflags = 589826, tf_esp = -1054618496, tf_ss = 0})
    at /usr/src/sys/i386/i386/trap.c:269
---Type <return> to continue, or q <return> to quit---
#10 0xc080ec2a in calltrap () at /usr/src/sys/i386/i386/exception.s:139
#11 0xc066c182 in softclock (dummy=0x0)
    at /usr/src/sys/kern/kern_timeout.c:220
#12 0xc064e260 in ithread_loop (arg=0xc121b080)
    at /usr/src/sys/kern/kern_intr.c:547
#13 0xc064d668 in fork_exit (callout=0xc064e118 <ithread_loop>, 
    arg=0xc121b080, frame=0xc6082d38) at /usr/src/sys/kern/kern_fork.c:789
#14 0xc080ec8c in fork_trampoline () at /usr/src/sys/i386/i386/exception.s:208
(kgdb) up 11
#11 0xc066c182 in softclock (dummy=0x0)
    at /usr/src/sys/kern/kern_timeout.c:220
220				if (c->c_time != curticks) {
(kgdb) list
215			curticks = softticks;
216			bucket = &callwheel[curticks & callwheelmask];
217			c = TAILQ_FIRST(bucket);
218			while (c) {
219				depth++;
220				if (c->c_time != curticks) {
221					c = TAILQ_NEXT(c, c_links.tqe);
222					++steps;
223					if (steps >= MAX_SOFTCLOCK_STEPS) {
224						nextsoftcheck = c;
(kgdb) print c
$1 = (struct callout *) 0xdeadc0de
(kgdb) print *bucket
$2 = {tqh_first = 0xc1644020, tqh_last = 0xc1644020}
(kgdb) print steps
$3 = 1
(kgdb) print *(bucket->tqh_first)
$4 = {c_links = {sle = {sle_next = 0xdeadc0de}, tqe = {tqe_next = 0xdeadc0de, 
      tqe_prev = 0xdeadc0de}}, c_time = -559038242, c_arg = 0xdeadc0de, 
  c_func = 0xdeadc0de, c_mtx = 0xdeadc0de, c_flags = -559038242}



The following patch from John Baldwin (intended for testing only) doesn't help - symptoms remain the same:

Index: nd6.c
===================================================================
RCS file: /usr/cvs/src/sys/netinet6/nd6.c,v
retrieving revision 1.62
diff -u -r1.62 nd6.c

--- nd6.c       22 Oct 2005 05:07:16 -0000      1.62
+++ nd6.c       3 Nov 2005 19:56:42 -0000
@@ -398,7 +398,7 @@
        if (tick < 0) {
                ln->ln_expire = 0;
                ln->ln_ntick = 0;
-               callout_stop(&ln->ln_timer_ch);
+               callout_drain(&ln->ln_timer_ch);
        } else {
                ln->ln_expire = time_second + tick / hz;
                if (tick > INT_MAX) {
======================================================================

I have tried 2 attempts to find a cause of the callwheel corruption:
1) I wrote a checking function that searched corrupted entries in a callwheel and panics if any. This function was called from every place in kern/kern_timeout.c that could modify the callwheel. No success - callwheel is modified elsewhere.

2) I tried to extend trash_dtor() in vm/uma_dbg.c in the following way to find what element of the callwheel is freed before being disarmed.
(Warning: this patch may be not 64bit-ready in the pointer casts/comparisons)
--- uma_dbg.c.orig	Mon Nov  7 23:05:09 2005
+++ uma_dbg.c	Tue Nov  8 17:37:24 2005
@@ -41,6 +41,8 @@
 #include <sys/lock.h>
 #include <sys/mutex.h>
 #include <sys/malloc.h>
+#include <sys/callout.h>
+#include <sys/kdb.h>
 
 #include <vm/vm.h>
 #include <vm/vm_object.h>
@@ -86,8 +88,33 @@
 {
 	int cnt;
 	u_int32_t *p;
+	struct callout *c;
+	struct callout_tailq *bucket;
+	int i;
 
 	cnt = size / sizeof(uma_junk);
+
+	mtx_lock_spin(&callout_lock);
+ 
+	for (i = 0; i < callwheelsize; ++i) {
+		bucket = &callwheel[i];
+		for (c = TAILQ_FIRST(bucket); c != NULL;
+		     c = TAILQ_NEXT(c, c_links.tqe)) {
+			long c2 = (long)c;
+			long mem2 = (long)mem;
+			if ((u_int32_t)c == uma_junk) {
+				kdb_enter("trash_dtor: uma_junk found in a "\
+					  "callwheel element");
+				break;
+			}
+			if (c2 >= mem2 && c2 < mem2 + size) {
+				kdb_enter("trash_dtor: found invalid "\
+					  "callwhel element");
+			}
+		}
+	}
+
+	mtx_unlock_spin(&callout_lock);
 
 	for (p = mem; cnt > 0; cnt--, p++)
 		*p = uma_junk;
======================================================================
and kdb_enter is called here:
	if ((u_int32_t)c == uma_junk) {
==>		kdb_enter("trash_dtor: uma_junk found in a "\
			  "callwheel element");

I.e. this check founds a callwheel element that was already freed and filled with uma_junks.
There is a side effect: applying the last patch causes the panic to be much less reproducible. When panic doesn't occur, ppp works.

>How-To-Repeat:
cvsup to the -CURRENT as of 2005.10.21.16.25.00 or later, recompile and install the kernel using GENERIC config.
With a new kernel, start /usr/sbin/ppp.
A few seconds (up to 3 on my Celeron-600) after start, when the callwheel in kern/kern_timeout.c is cycled over, the panic will occur.

>Fix:
There is only a workaround: disabling INET6 in the kernel config helps.

>Release-Note:
>Audit-Trail:
>Unformatted: