superpages for UMA

Mon Aug 18 22:35:53 UTC 2014

On Mon, Aug 18, 2014 at 3:26 PM, Warner Losh <imp at bsdimp.com> wrote:

>
> On Aug 18, 2014, at 2:13 PM, Peter Grehan <grehan at freebsd.org> wrote:
>
> >> Newer Intel CPUs have more entries, and AMD CPUs have long (since
> >> Barcelona) had more.  In particular, they allow 2 MB page mappings to be
> >> cached in a larger L2 TLB.  Nowadays, the trouble is with the 1 GB
> pages.
> >> A lot of CPUs still only support an 8 entry, 1 level TLB for 1 GB pages.
> >
> > There are new(ish) ones effectively without 1GB pages. From the
> "Software Optimization Guide for AMD Family 16h Processors"
> >
> > "Smashing"
> >  ...
> > "when the Family 16h processor encounters a 1-Gbyte page size, it will
> smash translations of that 1-Gbyte region into 2-Mbyte TLB entries, each
> > of which translates a 2-Mbyte region of the 1-Gbyte page."
>
> “we’ll emulate this feature designed to make things go faster in hardware
> in software by doing the very thing that makes it go slow in hardware.”
>
> Fun times. Performance Smashing!
>
>

I'm guessing that these are low-power processors, where they don't want to
have another CAM consuming power.  Under those circumstances, it's still
better to support 1 GB page mappings in the page table even if the TLB
doesn't support them than not to support 1 GB page mappings at all.  With
the hierarchical page tables on x86, you get a 512x reduction in page table
size with each increase in page size.  So, on a TLB miss, the page table
walk is more likely to be all L2 data cache hits, rather than misses that
go all the way to DRAM.

One feature that I always liked about the AMD performance counters was that
they allowed you to count L2 cache misses caused by page table walks on a
TLB miss.  This was often a better predictor of whether large pages were
going to be beneficial than counting TLB misses.