Repeated similar panics on -STABLE

Sun Apr 20 04:04:53 PDT 2003

On Sun, Apr 20, 2003 at 03:18:05AM -0700, Don Lewis wrote:
> On 20 Apr, Dmitry Sivachenko wrote:
> > On Sun, Apr 20, 2003 at 01:16:16AM -0700, Don Lewis wrote:
> 
> >> If kbp is pointing to a non-existent page, why does Terry's patch seem
> >> to fix the problem for you?
> > 
> > Well, here is probably a misunderstanding..
> > We did NOT apply Terry's patch.  Let me quote a bit from my e-mail to Terry:
> > 
> > TL> Did my patch fix your problem?
> > TL>
> > TL> Or did you tune your kernel, as I suggested, to fix your problem?
> > TL>
> > TL> Or is it still a problem?
> > 
> > DS>We changed maxusers from 512 to 0 and decreased the number of
> > DS>NMBCLUSTERS.  Now everything is working fine, but since these panics occured
> > DS>about once a week I can't say for sure they are completely gone.
> > DS>Let's wait at least one more week...
> > 
> > Thus I wanted to say that we only tuned maxusers and NMBCLUSTERS.  We
> > run virgin -STABLE kernel without any patches.  Probably my english leaves much
> > to be desired ;-((
> 
> Your English seems just fine to me.
> 
> I just got the impression from Terry that the patch is what fixed the
> problem for you.
> 
> 
> >> I wonder if things are getting further munged after the trap occurs?
> >> That would make it more difficult to track down the problem from the
> >> core file.
> >> 
> >> Something else of interest to print is
> >> 	bucket[7]
> >> bucket[7].kb_next and bucket[7].kb_last might shed some light.
> >> 
> > 
> > (kgdb) up 22
> > #22 0xc015daff in malloc (size=72, type=0xc029fee0, flags=0)
> >     at /mnt/se3/releng_4/src/sys/kern/kern_malloc.c:243
> > 243             va = kbp->kb_next;
> > (kgdb) p bucket[7]
> > $1 = {kb_next = 0x5cdd8000 <Address 0x5cdd8000 out of bounds>,
> >   kb_last = 0xc8fcb000 "", kb_calls = 2127276, kb_total = 4256,
> >   kb_elmpercl = 32, kb_totalfree = 1264, kb_highwat = 160, kb_couldfree = 5497}
> > (kgdb) p bucket[7].kb_next
> > $2 = 0x5cdd8000 <Address 0x5cdd8000 out of bounds>
> > (kgdb) p bucket[7].kb_last
> > $3 = 0xc8fcb000 ""
> > (kgdb)
> 
> That explains a quite a bit.  The free list is somehow getting
> corrupted. That's why the 0x5cdd8000 value shows up in both stack
> frames. The value of kb_last looks ok, though.  Because kb_next is not
> NULL, we skip the "if" block that allocates more memory and proceed to
> line 243. Gdb is lying a bit though, the trap isn't happening on line
> 243, va is just getting the garbage value there.  The trap is actually
> happening on the next line when we try to dereference this garbage
> pointer:
> 	kbp->kb_next = ((struct freelist *)va)->next;
> 
> It sure would be nice to know the source of this wierd value.  It's
> obviously not a pointer, but it's not obvious to me what it might be.
> 
> It sure looks to me like something is writing to memory that has already
> been put back on the free list and is stomping on the next pointer in
> one of the memory blocks on the list.  When this block gets allocated
> again, malloc() does:
> 	va = kbp->kb_next;
> 	kbp->kb_next = ((struct freelist *)va)->next;
> and stores the garbage in kb_next, where we trip over it on the next
> allocation.
> 
> 
> >> One other question ... is your kernel compiled with INVARIANTS?  That
> >> changes the definition of struct freelist.
> > 
> > Without.
> 
> This could potentially be difficult to track down.  Probably the best
> bet is to go back to the previous configuration and compile with
> INVARIANTS and hope that this will catch the problem a bit closer to the
> source.
> 

OK, I'll restore our previous configuration tomorrow and add INVARIANTS.

I'll let you know when a fresh crash dump will be ready.