tlb, tsb & ...stuff

Fri Apr 11 21:40:05 PDT 2003

Hmm, so someone else has read that code.  :)

Apparently, On Sat, Apr 12, 2003 at 02:12:43AM +0300,
	Narvi said words to the effect of;

> 
> ok, I'm a lamer and couldn't think of a nice & spiffy subject line.
> 
> 
> 	TLB / TSB statistics:
> 
> Presently we only get statistics on entries being moved into TSB, with no
> dtlb/itlb separation. Unless people think this is a bad idea, I'd like to
> make an option that would expose dTLB/iTLB and related TSB misses as
> statisics. this would allow you to get Solaris 9 style 'trapstat -t'
> information. The counters would need to be per-processor.

Well, the problem is the current tlb fault handlers are really tight on
space in the trap table.  I think the tl0_immu_miss and tl0_dmmu_prot
have 0 or 1 instructions free.  Incrementing counters to track dTLB
misses will take 3 instructions minimum, so you'd have to do something
like ifdef the handlers to just branch to code at the end of the trap
table if the counters are enabled, which gets pretty ugly.  You're welcome
to do this and report results, but I'm not sure I want it to be committed.

There are some adhoc statistics on tsb replacements with options PMAP_STATS,
under sysctl debug.pmap_stats.  In my experience few replacements occur unless
you are using a lot of memory.

Adding statistics in the page fault path sounds fine, but lower than that I'm
not so sure.

> 
> 
> 	TSB & replacement:
> 
> >From what I gather (please correct me if I'm wrong!) the present TSB
> consists of 2K entries, organised into buckets with each bucket containing
> 4 entries. On replacement/entry we enter into an entry that was
> empty/invalid or pick one "randomly" based on the lower digits of tick. We
> try 4 times (for each page size) so up to 16 places get probed before a
> miss / hit.
> 
> Making it a 4-way random replacement software managed unified L2 tlb (with
> slight oddness for multiple pages sizes).

Yes, this is correct.  The multiple page size stuff doesn't work as well as
I'd like, and the vm system isn't setup to use it yet (this is a lot of
work).  I consider the current tsb implementation to be a bit of an experiment
(I'd never dealt with pmap or tlb fault handlers when I started) and worth
throwing out completely if we can think of something better.  Its decent and
fast in most cases but the fixed size of the tsb, which causes the
replacements, limits the RSS that a process can have without casuing soft
faults into the vm system.  The kernel gets bogged down with soft faults
pretty fast if you go out the RSS that fits in the tsb.  You can really see
this if you reduce the size of the tsb.  It works well enough for current
workloads but once we start supporting things like X I'm not sure that it
will fly.

I've been planning to replace it with something that's more like page tables
and not so reliant on hashing in the same sense.  The way it works is in the
base case you have a 1 page direct mapped tsb (ie no buckets), indexed by the
first 8 bits of virtual address above page size (call this level 0).  On a
miss in the tsb the tlb fault handler would check a bit in the tte which
indicates that there's actually another level and the tte just loaded (the
"miss" tte) contains a pointer to it.  So it would restart the lookup using
the new tsb page.  The twist is that as you go the next level you use the
next higher "page spread" virtual address bits to index the tsb pages.
Basically collisions in the address bits used to index a given level cause
another level to be added which is indexed by the next higher set of virtual
address bits, instead of causing replacements.  The lookup function for an
arbitrary level looks something like this:

#define TSB_MAX_LEVEL                   (3)
#define TSB_PAGE_ADDRESS_BITS           (8)

static __inline struct tte *
tsb_vpntotte(struct tte *tsb, vm_offset_t vpn, int level)
{
        return (&tsb[(vpn >> (TSB_PAGE_ADDRESS_BITS * level)) &
            ((1 << TSB_PAGE_ADDRESS_BITS) - 1)]);
}

static __inline struct tte *
tsb_vtotte(struct tte *tsb, vm_offset_t va, int level)
{
        return (tsb_vpntotte(tsb, va >> TAR_VPN_SHIFT, level));
}

With 3 levels this can support a 32 gigabyte virtual address space (fully
resident), but doesn't penalize processes with sparse address spaces due
to not requiring intermediate "page table pages" in all cases.  Basically
like traditional page tables but backwards :).

This has an added advantage of not requiring more than 1 page of contiguous
virtual or physical address space for any part of the tsb.  With the current
implementation you can't increase the tsb size too much because it allocates
a large chunk of contiguous virtual address space and as the the kernel address
space gets fragmented you start to run out.  I'm not sure if you've looked at
the kernel tlb fault handlers, but the same technique that's used in MIPS
and alpha kernels is used to provide a direct mapped address space region,
which corresponds to the upper VA hole on UltraSPARC II.  What this does is
maps all of physical memory into the upper portion of the address space using
4 meg tlb entries.  The physical address is encoded in the virtual address,
so the fault handler just needs to extract it and whip up a tlb entry on the
fly.  No page tables, no lookups, no nothing.  This would allow the tsb pages
to be mapped with the direct mapped address space, so no mappable kva would
be required for the tsb.

> 
> It would imho be interesting to support a couple of different and
> selectable entry indexing policies, say at least:
> 
> 	* hashed
> 	* skew-associative
> 
> to cater for various access patterns & tsb lookup loads. Again, if this
> would be a bad idea, let me know.

Its a good idea and I'd be interested in the results, but what I'm more
interested in is new data structures to support virtual memory that give
improvements in design or architecture, rather than heuristics such as
tweaking the hashing algorithms.

> 
> 	Usparc3(cu)
> 
> What will happen there? Do we use any of the large page sizes enough to
> make one of the large TLB-s cache a large(r) page size?

Yes, see above about the direct mapped address space.  I've read papers on
generalized schemes for using large page sizes for user mappings, but I'm
not sure if we'll see this anytime soon in FreeBSD in a big way.  However, the
2 512 entry tlbs with programmable page sizes on USIII+ should work very well
with one programmed for 4meg tlb entries and the other for 8K.  The direct
mapping technique is hooked into the kernel zone allocator, uma, which is
also the back end allocator for malloc(9), so allocations of objects that are
less than a page minus some overhead use it, which for the most part would
give the kernel an entire 512 entry tlb for itself.  This may or may not be
faster than just using it as a single 1024 entry tlb for 8K mappings, have
to see.

Anyway, hope I didn't completely blow over your question.

Jake