cvs commit: src/sys/i386/i386 pmap.c

Sat Apr 29 17:21:15 UTC 2006

On Friday 28 April 2006 10:36 pm, Nate Lawson wrote:
> Peter Jeremy wrote:
> > On Fri, 2006-Apr-28 14:22:34 -0700, Peter Wemm wrote:
> >> On Friday 28 April 2006 12:05 pm, Peter Wemm wrote:
> >>>   ups@ had a truely evil idea that I'll investigate.  It should
> >>> allow freeing unused pages again by giving us a no-cost way to
> >>> track the holes in the kva block.
> >>
> >> FWIW, this idea appears to work.  For the curious:
> >> http://people.freebsd.org/~peter/freepv.diff
> >
> > Care to explain how this works in slightly more detail than "truely
> > evil".
>
> The PTE (page-table entry) is an appropriate size to store a pointer.
> So you link them all together in a freelist.

In more detail.  There is one PTE per page of virtual address space.  It 
is used by the hardware to translate a virtual address access to a 
physical address.  One normally stores the corresponding physical 
address in there with various control bits (read/write mode etc) and 
most importantly, the 'valid' (PG_V) bit.  This tells the hardware of 
the cpu that the physical address it found is valid.

The problem we have is that we have a block of virtual address space.  
If we allow random pages to be freed and unmapped from that block, 
there is no easy way to keep track of what has been freed.  You can't 
thread a freelist through the unused blocks like one would do with 
malloc(3), because we'd like to return the pages if possible and you 
can't store something in nothing.  Storing a 'first free' pointer 
doesn't scale because we free pages randomly, and after the first-free 
has been used, we have no way to find the next one without an 
exhaustive search.  We could use a tree or list structure to track the 
free address space, except that requires memory allocation, which we 
cannot do from this level of pmap.  We could reserve another tailq and 
length in the headers of the pv chunks, and use the chunk proceding a 
freed block to record the length of the free space and then use the 
tailq to link the blocks together.  This is non-trivial code though and 
incurs a constant runtime cost maintaining it.  A bitmap doesn't scale 
because there are potentially hundreds of thousands of entries - an 
exhaustive search there hurts.  And so on.

Stephan realized that the kernel already allocates one PTE per virtual 
page.  Although it normally holds physical addresses plus attributes, 
as long as we don't set PG_V, then there are 31 other bits that we 
could use for data storage.  We could put virtual addresses in there so 
long as we didn't set PG_V, and abuse that to have a singly linked 
freelist threaded through the PTEs.  It turned out even easier though.  
As long as the virtual addresses are page aligned, we neatly avoid all 
the PG_* mode bits as well.  We can avoid locking and atomic ops when 
updating these because they are only accessed under the page queues 
mutex.  We can avoid doing pte invalidations (tlb shootdowns) because 
there can never be any tlb entries corresponding to them. 

So I wrapped this in pmap_ptelist_{alloc/free/init} functions and it 
turned out quite neat and tidy.  And its virtually free - we can grab a 
page of address space by unlinking the head of the ptelist.  Freeing a 
page is as simple as storing it on the head.  I'm running my laptop 
with that code right now.
-- 
Peter Wemm - peter at wemm.org; peter at FreeBSD.org; peter at yahoo-inc.com
"All of this is for nothing if we don't go to the stars" - JMS/B5