tlb, tsb & ...stuff

Sat Apr 12 09:12:38 PDT 2003

On Sat, 12 Apr 2003, Jake Burkholder wrote:

>
> Hmm, so someone else has read that code.  :)
>
> Apparently, On Sat, Apr 12, 2003 at 02:12:43AM +0300,
> 	Narvi said words to the effect of;
>
> >
> > ok, I'm a lamer and couldn't think of a nice & spiffy subject line.
> >
> >
> > 	TLB / TSB statistics:
> >
> > Presently we only get statistics on entries being moved into TSB, with no
> > dtlb/itlb separation. Unless people think this is a bad idea, I'd like to
> > make an option that would expose dTLB/iTLB and related TSB misses as
> > statisics. this would allow you to get Solaris 9 style 'trapstat -t'
> > information. The counters would need to be per-processor.
>
> Well, the problem is the current tlb fault handlers are really tight on
> space in the trap table.  I think the tl0_immu_miss and tl0_dmmu_prot
> have 0 or 1 instructions free.  Incrementing counters to track dTLB
> misses will take 3 instructions minimum, so you'd have to do something
> like ifdef the handlers to just branch to code at the end of the trap
> table if the counters are enabled, which gets pretty ugly.  You're welcome
> to do this and report results, but I'm not sure I want it to be committed.
>

Ah, yes, this would be bad -

> There are some adhoc statistics on tsb replacements with options PMAP_STATS,
> under sysctl debug.pmap_stats.  In my experience few replacements occur unless
> you are using a lot of memory.
>
> Adding statistics in the page fault path sounds fine, but lower than that I'm
> not so sure.
>
> >
> >
> > 	TSB & replacement:
> >
> > >From what I gather (please correct me if I'm wrong!) the present TSB
> > consists of 2K entries, organised into buckets with each bucket containing
> > 4 entries. On replacement/entry we enter into an entry that was
> > empty/invalid or pick one "randomly" based on the lower digits of tick. We
> > try 4 times (for each page size) so up to 16 places get probed before a
> > miss / hit.
> >
> > Making it a 4-way random replacement software managed unified L2 tlb (with
> > slight oddness for multiple pages sizes).
>
> Yes, this is correct.  The multiple page size stuff doesn't work as well as
> I'd like, and the vm system isn't setup to use it yet (this is a lot of
> work).  I consider the current tsb implementation to be a bit of an experiment
> (I'd never dealt with pmap or tlb fault handlers when I started) and worth
> throwing out completely if we can think of something better.  Its decent and
> fast in most cases but the fixed size of the tsb, which causes the
> replacements, limits the RSS that a process can have without casuing soft
> faults into the vm system.  The kernel gets bogged down with soft faults
> pretty fast if you go out the RSS that fits in the tsb.  You can really see
> this if you reduce the size of the tsb.  It works well enough for current
> workloads but once we start supporting things like X I'm not sure that it
> will fly.

So these are mainly capacity and not conflict misses? The obvious wasy to
improve this would be to increase the size and eliminate the TAILQ, unless
the TAILQ linking removal would cause massive changes. But entry number
wise we are probably at the minimal range right now.

>
> I've been planning to replace it with something that's more like page tables
> and not so reliant on hashing in the same sense.  The way it works is in the
> base case you have a 1 page direct mapped tsb (ie no buckets), indexed by the
> first 8 bits of virtual address above page size (call this level 0).  On a
> miss in the tsb the tlb fault handler would check a bit in the tte which
> indicates that there's actually another level and the tte just loaded (the
> "miss" tte) contains a pointer to it.  So it would restart the lookup using
> the new tsb page.  The twist is that as you go the next level you use the
> next higher "page spread" virtual address bits to index the tsb pages.
> Basically collisions in the address bits used to index a given level cause
> another level to be added which is indexed by the next higher set of virtual
> address bits, instead of causing replacements.  The lookup function for an
> arbitrary level looks something like this:
>
> #define TSB_MAX_LEVEL                   (3)
> #define TSB_PAGE_ADDRESS_BITS           (8)
>
> static __inline struct tte *
> tsb_vpntotte(struct tte *tsb, vm_offset_t vpn, int level)
> {
>         return (&tsb[(vpn >> (TSB_PAGE_ADDRESS_BITS * level)) &
>             ((1 << TSB_PAGE_ADDRESS_BITS) - 1)]);
> }
>
> static __inline struct tte *
> tsb_vtotte(struct tte *tsb, vm_offset_t va, int level)
> {
>         return (tsb_vpntotte(tsb, va >> TAR_VPN_SHIFT, level));
> }
>
> With 3 levels this can support a 32 gigabyte virtual address space (fully
> resident), but doesn't penalize processes with sparse address spaces due
> to not requiring intermediate "page table pages" in all cases.  Basically
> like traditional page tables but backwards :).
>

This would have rather bad locality of access though, no? The conflict
side of the thing is resolvable (to an extent) using it as a sum or xor
addressed cache. I guess I should whip up some code to show what I mean.
Missing in L2 can delay you for say 100 cycles. Also, this scheme cause
the pages to accumulate with basicly no eviction of 'vist once' pages.

> This has an added advantage of not requiring more than 1 page of contiguous
> virtual or physical address space for any part of the tsb.  With the current
> implementation you can't increase the tsb size too much because it allocates
> a large chunk of contiguous virtual address space and as the the kernel address
> space gets fragmented you start to run out.  I'm not sure if you've looked at
> the kernel tlb fault handlers, but the same technique that's used in MIPS
> and alpha kernels is used to provide a direct mapped address space region,
> which corresponds to the upper VA hole on UltraSPARC II.  What this does is
> maps all of physical memory into the upper portion of the address space using
> 4 meg tlb entries.  The physical address is encoded in the virtual address,
> so the fault handler just needs to extract it and whip up a tlb entry on the
> fly.  No page tables, no lookups, no nothing.  This would allow the tsb pages
> to be mapped with the direct mapped address space, so no mappable kva would
> be required for the tsb.
>

Another way to overcome the would be to allocate teh different "ways" of
the TSB separately, so that the area need not be contiguous.

> >
> > It would imho be interesting to support a couple of different and
> > selectable entry indexing policies, say at least:
> >
> > 	* hashed
> > 	* skew-associative
> >
> > to cater for various access patterns & tsb lookup loads. Again, if this
> > would be a bad idea, let me know.
>
> Its a good idea and I'd be interested in the results, but what I'm more
> interested in is new data structures to support virtual memory that give
> improvements in design or architecture, rather than heuristics such as
> tweaking the hashing algorithms.
>

The problem is that almost anything you try is guaranteed to lose under
different loads.

> >
> > 	Usparc3(cu)
> >
> > What will happen there? Do we use any of the large page sizes enough to
> > make one of the large TLB-s cache a large(r) page size?
>
> Yes, see above about the direct mapped address space.  I've read papers on
> generalized schemes for using large page sizes for user mappings, but I'm
> not sure if we'll see this anytime soon in FreeBSD in a big way.  However, the
> 2 512 entry tlbs with programmable page sizes on USIII+ should work very well
> with one programmed for 4meg tlb entries and the other for 8K.  The direct
> mapping technique is hooked into the kernel zone allocator, uma, which is
> also the back end allocator for malloc(9), so allocations of objects that are
> less than a page minus some overhead use it, which for the most part would
> give the kernel an entire 512 entry tlb for itself.  This may or may not be
> faster than just using it as a single 1024 entry tlb for 8K mappings, have
> to see.
>

I was more thinking about say using 64K pages for user stack or malloc or
similar - it doesn't quite have the same complexity as general mapping
scheme, in the malloc case could be requested by the user and would keep
pressure or TLB & faults down. But sounds like this is not simple/feasible
for now.

> Anyway, hope I didn't completely blow over your question.
>
> Jake
>