tlb, tsb & ...stuff

Tue Apr 15 14:22:15 PDT 2003

On Mon, 14 Apr 2003, Jake Burkholder wrote:

> [ ... ]
> > >
> > > With 3 levels this can support a 32 gigabyte virtual address space (fully
> > > resident), but doesn't penalize processes with sparse address spaces due
> > > to not requiring intermediate "page table pages" in all cases.  Basically
> > > like traditional page tables but backwards :).
> > >
> >
> > This would have rather bad locality of access though, no? The conflict
> > side of the thing is resolvable (to an extent) using it as a sum or xor
> > addressed cache. I guess I should whip up some code to show what I mean.
> > Missing in L2 can delay you for say 100 cycles. Also, this scheme cause
> > the pages to accumulate with basicly no eviction of 'vist once' pages.
>
> Yes, you're right.  The current scheme has ok cache locality and the
> eviction properties are nice.  How does an xor addressed cache work?
> I would be interested to see a simple example.  The hash function needs
> to be as simple as possible because it needs to fit in 1 or 2 instructions
> in the tlb fault handlers.  So '&' is attractive but may not have great
> properties.
>

xor and sum addressed caches rely on additiopnal information beyond just
the cache-size covering ones to compute the index, so simple examples
might be:

xor:	&pm->tsb[((vpn >> (TTE_SHIFT + TSB_SHIFT)) ^
		 (vpn >> TTE_SHIFT)) & TSB_MASK)]

sum:	&pm->tsb[((vpn >> (TTE_SHIFT + TSB_SHIFT)) +
		 (vpn >> TTE_SHIFT)) & TSB_MASK)]

so essentially it would add a shift and a add/xor. The aim is to make most
of the power of two and other pobvious conflict patterns be non-conflicting
in the cache. A xor based cache has some interesting propeties (it doesn't
matter whetever you use n consequtive or "conflicting" entries, the first
n-1 next conflicts get mapped to different lines) - its also nice in a
skew cache - the expense is bad locality. With a four-way cache, sum might
be better when you consider locality.

I gues thepart that worries me with only one TSB strategy is that on
UltraSparc I/II the misses will be capacity misses as the TLB is fully
associative, but on USPARC3 you get a 1024 entry 4-way tlb that will give
you completely different pattern on many programs.

> [ ... ]
> >
> > Another way to overcome the would be to allocate teh different "ways" of
> > the TSB separately, so that the area need not be contiguous.
>
> This has got me thinking.  The main problem is how to map the tsb into
> contiguous virtual address space so it can be indexed easily by the tlb
> fault handlers, ideally base + offset, where offset is a hash of the
> virtual address, with no indirection.  But the only requirement for
> contiguity is to make the tlb fault handlers fast.  When its accessed by
> the kernel for pmap processing it doesn't much matter because there are
> other slow things going, adding indirection there won't make much difference.
> Just mapping it into the kernel address space is easy because faults on
> the tsb inside of the user tlb fault handler are handled uniformly as normal
> kernel tlb faults.  If they were handled specially it might be possible
> to use a multi level structure that's more transparent.
>

Something I really liked from your post about tree like structure was the
idea of having a simple (lets call it L0TSB?) buffer at the front with the
complex lookup hapenning later. We could even simplify it to the point
where the L1 table would only contain the 8KB page number and a pointer to
the entry in the TSB, so the 'fast load' handler would basicly be (in
pseudcode):

	offset = ((v >> 10) & L0TSB_MASK);

	if ((l0tsb[offset+1] == 0) || (l0tsb[offset] != v))
		do_real_tsb_load(v);

	load_from_tsb(&l0tsb[bucket+1]);
	return_from_exception();

which is almost as simple in sparc assembler (sorry, i'm slow writing
sparc asm), short and can accomodate a two-way cache just as easily. It
does mean that loading of non-8K pages will always be slower to create a
fast simple case for 8KB pages. This should also provide for very fast TLB
loads for cases where the LRU replacement TLB has zero locality either due
to repetitive loads that cause the LRU to throw out the about to be used
again entries ( not blocked for TLB matrix codes/bitmap ops) or sparse
access to memory (yay for tree / linked list structures linking data
present all over the heap).

It also means that the "real" tsb can be more complex then now without
causing speed-downs for applications, as long as most cases are handled at
L0 level and there is no restriction for the tsb to be continuous.

> [ ... ]
> >
> > I was more thinking about say using 64K pages for user stack or malloc or
> > similar - it doesn't quite have the same complexity as general mapping
> > scheme, in the malloc case could be requested by the user and would keep
> > pressure or TLB & faults down. But sounds like this is not simple/feasible
> > for now.
>
> I'd really like to see this, but yes it is hard to do in a generalized way.
> The vm system needs to be aware of the contiguity of physical pages, and
> try to make reservations of large contiguous regions, expecting all the
> pages to be faulted in with the same protections so the mapping can be
> "upgraded" to use a larger page size.  The best work that I've seen on
> this topic so far is this paper:
> 	http://www.cs.rice.edu/~ssiyer/r/superpages/
> That Alan Cox is a FreeBSD guy, so we may see this in FreeBSD at some point.
>

Well, for me theproblem part starts where people start talking about page
size promotion/demotion and not simply sneakily using some page size (say
64K) for specific memory regions and never promoting / demoting pages. The
64K pages are sort of magic in that way though, and in general it would
suck to use the other pages sizes that way, except for possibly mapping
framebuffers and similar.

> Jake
>