Repeated similar panics on -STABLE

Sat Apr 19 01:59:32 PDT 2003

On 18 Apr, Terry Lambert wrote:
> Don Lewis wrote:
>> On 18 Apr, Terry Lambert wrote:
>> > Since the thread occurred on -hackers, thought I'd post here, too.
>> >
>> > The thread (see subject) around 02 Apr 2003 demonstrated a panic
>> > in kern_malloc.c in malloc() as a result in the increased default
>> > KVA space in the 4.x branch.
>> >
>> > I just posted a patch to -stable.
>> >
>> > If there is a committer interested in fixing this problem for
>> > people who doesn't read -stable, they may want to grab it and
>> > commit it.
>> 
>> The patch looks ok, but I don't understand the failure mechanism.  If
>> kbp->kb_next is initially NULL, then kmem_malloc() should get called. It
>> doesn't look possible for kmem_malloc() to return NULL in the !M_NOWAIT
>> case with the kmem_map.  It looks like kmem_malloc() will either retry
>> or panic.
> 
> It's hard to see.  But it can happen.  The way it happens is to
> take a fault that causes the freep to be NULL in the case that
> kbp->kb_last == kbp->kb_next.
> 
> We know it happens, because the "va" dereference fails in the
> same place on several machines, and it's set to kbp->kb_next
> immediately before the failure.
> 
> If you hear hoofbeats, look for horses, but once you've eliminated
> the possibility of horses, well, zebras are a good guess.  8-).

Ok, here's a Reader's Digest version of the code in question:

        if (kbp->kb_next == NULL) {
                kbp->kb_last = NULL;
                if (size > MAXALLOCSAVE)
                        allocsize = roundup(size, PAGE_SIZE);
                else
                        allocsize = 1 << indx;
                npg = btoc(allocsize);
                va = (caddr_t) kmem_malloc(kmem_map, (vm_size_t)ctob(npg), flags
);
                /*
                 * Just in case we blocked while allocating memory,
                 * and someone else also allocated memory for this
                 * bucket, don't assume the list is still empty.
                 */
                savedlist = kbp->kb_next;
                kbp->kb_next = cp = va + (npg * PAGE_SIZE) - allocsize;
                for (;;) {
                        freep = (struct freelist *)cp;
                        if (cp <= va)
                                break;
                        cp -= allocsize;
                        freep->next = cp;
                }
                freep->next = savedlist;
                if (kbp->kb_last == NULL)
                        kbp->kb_last = (caddr_t)freep;
        }
        va = kbp->kb_next;
        kbp->kb_next = ((struct freelist *)va)->next;

> Basically, if I allocate 1 page, put it on the freelist, then it
> gets grabbed by something else meanwhile, then cp <= va is true,
> and it breaks with kbp->kb_next = NULL.

This whole section of code is protected by splmem() except that it may
sleep inside kmem_malloc(), so kbp->kb_next and kbp->kb_last could be
modified by another process during that time.  freep is a local variable
and from the time we update kbp->kb_next with the new allocation through
the end of the "for" loop and beyond is protected by splmem(), so
nothing else should be able to steal anything from the free list during
this time.

> This can happen for zalloci() zones, for which there are page
> mappings, but no backing pages (mbufs, whatever).  All that's
> required is that you run out of physical RAM before you run out
> of kmem_map (a distinct possibility with a large KVA, and a
> stair function in the scaling -- or even if you just follow the
> handbook, and increase the number of mbufs/nmbclusters).

That should just cause kmem_malloc() to block.

> What the patch basically does is forces the caller to sleep
> until it's possible to satisfy the request (yeah, this could be
> "never", but it's more likely to be a transient condition.

Yup, I understand that part.

> It didn't happen with the smaller KVA, because the page mapping
> overhead vs. pages ratio vs. users of zalloci() happened to keep
> things happy.  Plus you have to have a pretty big load to see it.

>> Something else that bothers me is
>> 
>> >> fault virtual address   = 0x5cdd8000
>> 
>> If the trap is being caused by va being NULL, I'd expect the fault
>> adress to be 0x12 because the next member of struct freelist is at
>> offset 12, at least for the i386 architecture.
> 
> No; the problem is kbp->kb_next is NULL, not kbp itself.  Consider
> if "cp -= allocsize" is zero.  No matter what va is, that's going
> to drop out of the "for(;;)" loop.

I don't see how that can happen.  allocsize will either be a power of
two that is less than PAGE_SIZE and npg will be 1, or allocsize will be
some multiple of the page size.  In the first case
	kbp->kb_next = cp = va + (npg * PAGE_SIZE) - allocsize
will be > va and the "for" loop will iterate a power of two times, and
on the last iteration cp == va.  In the second case,
	allocsize == (npg * PAGE_SIZE)
so that cp and kbp->kb_next will be equal to va and the "for" loop will
terminate on its first iteration.

Also, if "cp -= allocsize" is zero, this happens after the loop
termination test so we'd start the next iteration and do the
	freep = (struct freelist *)cp;
assignment and then blow up on this assignment
	freep->next = savedlist;
immediately after the "for" loop.

This code bothers me, though, because if allocsize ever happened to not
evenly divide (npg * PAGE_SIZE), the loop termination condition would be
wrong and the end of the previous page (or more) would end up on the
free list.  If va were small enough and allocsize were big enough, it
would even be possible to wrap around.