Repeated similar panics on -STABLE

Wed Apr 2 07:13:19 PST 2003

Dmitry Sivachenko wrote:
> We have three machines under relatively high load.  They are running -STABLE
> on the same hardware with 2 processors (and SMP kernel).
> Periodically (approximately once a week) they panic with similar symptoms:

[ ... ]

Panic.

> #18 0xc0162549 in panic (fmt=0xc028e3b9 "%s")
>     at /mnt/se3/releng_4/src/sys/kern/kern_shutdown.c:595
> #19 0xc0251b1a in trap_fatal (frame=0xeb278e04, eva=1558020096)
>     at /mnt/se3/releng_4/src/sys/i386/i386/trap.c:974
> #20 0xc0251775 in trap_pfault (frame=0xeb278e04, usermode=0, eva=1558020096)
>     at /mnt/se3/releng_4/src/sys/i386/i386/trap.c:867
> #21 0xc02512b7 in trap (frame={tf_fs = -1072300008, tf_es = -361627632,
>       tf_ds = 16, tf_edi = -1070989600, tf_esi = -349729108,
>       tf_ebp = -349729176, tf_isp = -349729232, tf_ebx = -1070870564,
>       tf_edx = 1558020096, tf_ecx = 7, tf_eax = 128, tf_trapno = 12,
>       tf_err = 0, tf_eip = -1072309505, tf_cs = 8, tf_eflags = 66054,
>       tf_esp = 0, tf_ss = -349729108})
>     at /mnt/se3/releng_4/src/sys/i386/i386/trap.c:466

Page not present error.

> #22 0xc015daff in malloc (size=72, type=0xc029fee0, flags=0)
>     at /mnt/se3/releng_4/src/sys/kern/kern_malloc.c:243

Malloc failure was not checked for return value by source code;
probably the kbp list was just refreshed, and while you were
calling the failing malloc, the list was reemptied.

What this generally means is that KVA was exhausted, and the
caller did not expect that.

To workaround: don't exhaust the KVA space; probably you have tuned
some kernel parameter way too high.

To fix: at line 243, you need to check if va is NULL; if it is,
you need to wheck the M_WAITOK, and if set, restart the allocation.
This has to be done before the next line, where "va" is dereferenced.

Maybe something like:

Change:
	va = kbp->kb_next;
	kbp->kb_next = ((struct freelist *)va)->next;

To:

	va = kbp->kb_next;
	if (va == NULL) {
		if (flags & M_NOWAIT) {
			splx(s);
			return ((void *) NULL);
		}
		goto restart;	/* put this label above the "while" */
	}
	kbp->kb_next = ((struct freelist *)va)->next;

Working around the problem is easier (IMO): just change your tuning
parameters to avoid running out of KVA.  Probably your mbufs or
mbufclusters are way to large, for your amount of physical RAM;
remember that, except in very sepcial circumstances, kernel memory
is non-pageable.

> #23 0xc015a3fe in exit1 (p=0xea726820, rv=15)
>     at /mnt/se3/releng_4/src/sys/kern/kern_exit.c:166

It was trying to allocate a "zombie" structure.

> #24 0xc0164011 in sigexit (p=0xea726820, sig=15)
>     at /mnt/se3/releng_4/src/sys/kern/kern_sig.c:1503

For a process someone sent a SIGTERM to, to kill it.

> #25 0xc0163d9c in postsig (sig=15)
>     at /mnt/se3/releng_4/src/sys/kern/kern_sig.c:1406
> #26 0xc0251fc5 in syscall2 (frame={tf_fs = 47, tf_es = 47, tf_ds = 47,
>       tf_edi = 174, tf_esi = 1049187701, tf_ebp = -1077936960,
>       tf_isp = -349728812, tf_ebx = 1, tf_edx = 3, tf_ecx = -1078002496,
>       tf_eax = 3, tf_trapno = 7, tf_err = 2, tf_eip = 672039098, tf_cs = 31,
>       tf_eflags = 659, tf_esp = -1078069180, tf_ss = 47})
>     at /mnt/se3/releng_4/src/sys/i386/i386/trap.c:174

Looks like you caused a floating point exception, and died when
the exit1 failed to create a zombie structure for the process.

-- Terry