expanding amd64 past the 1TB limit

Chris Torek torek at torek.net
Fri Jun 28 20:34:02 UTC 2013


[combining two messages and adding kib and alc to cc per Oliver Pinter]

>> the CPU's CR4 on entry to the kernel.
>It is %cr3.

Oops, well, easily fixed. :-)

>> (If we used bcopy() to copy the kernel pmap's NKPML4E and NDMPML4E
>> entries into the new pmap, the L3 pages would not have to be
>> physically contiguous, but the KVA ones would still all have to
>> exist.  It's free to allocate physically contiguous pages here
>> anyway though.)
>I do not see how the physical continuity of the allocated page table
>pages is relevant there.

Not in create_pagetables(), no, but later in pmap_pinit(), which has
loops to set pmap->pm_pml4[x] for kernel and and direct-map.  And:

>Copying the L4 or L3 PTEs would cause serious complications.

Perhaps what I wrote was a little fuzzy.  Here's the pmap_pinit()
code I was referring to, as modified (the original version has only
the second loop -- it assumes NKPML4E is always 1 so it just sets
pml4[KPML4I]):

	pmap->pm_pml4 = (pml4_entry_t *)PHYS_TO_DMAP(VM_PAGE_TO_PHYS(pml4pg));

	if ((pml4pg->flags & PG_ZERO) == 0)
		pagezero(pmap->pm_pml4);

	for (i = 0; i < NKPML4E; i++) {
		pmap->pm_pml4[KPML4BASE + i] = (KPDPphys + (i << PAGE_SHIFT)) |
		    PG_RW | PG_V | PG_U;
	}
	for (i = 0; i < NDMPML4E; i++) {
		pmap->pm_pml4[DMPML4I + i] = (DMPDPphys + (i << PAGE_SHIFT)) |
		    PG_RW | PG_V | PG_U;
	}

	/* install self-referential address mapping entry(s) */

These require that KPDPphys and DMPDPphys both point to the first
of n physically-contiguous pages.  But suppose we did this (this
is deliberately simple for illustration, and furthermore I am
assuming here that vmspace0 never acquires any user-level L4
entries):

	pmap->pm_pml4 = (pml4_entry_t *)PHYS_TO_DMAP(VM_PAGE_TO_PHYS(pml4pg));

	/* Clear any junk and wire in kernel global address entries. */
	bcopy(vmspace0.vm_pmap.pm_pml4, pmap->pm_pml4, NBPG);

	/* install self-referential address mapping entry(s) */

Now whatever we set up in create_pagetables() is simply copied to
new (user) pmaps, so we could go totally wild if we wanted. :-)

>> So, the last NKPML4E slots in KPML4phys point to the following
>> page tables, which use all of L3, L2, and L1 style PTEs.  (Note
>> that we did not need any L1 PTEs for the direct map, which always
>> uses 2MB or 1GB super-pages.)
>This is not quite true. In the initial state, indeed all PTEs for direct
>map are superpages, either 1G or 2M. But Intel states that a situation
>when the physical page has mappings with different caching modes causes
>undefined behaviour. As result, if a page is remapped with non-write
>back caching attributes, the direct map has to demote the superpage and
>adjust the mapping attribute of the page frame for the page.

Yes, this particular bit of description was restricted to the setup
work in create_pagetables().

(Perhaps I should take out "always", or substitute "initially"?)

Also, I think I left out a description of the loop where some KPDphys
entries are overwritten with 2MB mappings.

>> +AMD64_HUGE		opt_global.h

>Is this option needed ? The SMAP is already parsed at the time of
>pmap_bootstrap() call, so you could determine the amount of physical
>memory and size the KVA map accordingly ?

Mostly I was afraid of the consequences on VM_MIN_KERNEL_ADDRESS,
which is #included so widely, and any complaints people might have
about:

  - wasting NKPML4E-2 (i.e., 14) pages on small AMD64 systems (for
    the new empty L3 pages in KPDphys that will likely not be used);

  - "wasting" yet another page because dynamic memory will start
    at the first new L3 page (via KPML4BASE) instead of just using
    the KMPL4I'th one because VM_MIN_KERNEL_ADDRESS is now at -8TB
    instead of -.5TB -- with VM_MIN_KERNEL_ADDRESS at -.5TB, all
    KVAs use the single KMPL4I'th slot;

  - wasting 30 more pages because NDMPML4E grew from 2 to 32; and

  - adding a loop to set up NKPML4E entries in every pmap, instead
    of the single "shove KPDphys into one slot" code that used to
    be there, and making the pmap_pinit loop run 32 times instead
    of just 2 for the direct map.

Adding these up, the option chews up 45 pages, or 180 kbytes, when
compared to the current setup (1 TB direct map, .5 TB kernel VM).
180 kbytes is pretty trivial if you're planning to have a couple
of terabytes of RAM, but on a tiny machine ... of course if it's
that tiny you could run as i386, in 32 bit mode. :-)

If we copied the kernel's L4 table to new pmaps -- or even just
put in a new "ndmpdpphys" variable -- we could avoid allocating
any pages for DMPDPphys that we know won't actually be used.  That
would fix the "30 extra" pages above, and even regain one page on
many amd64 setups (those with <= 512 GB).  We'd be down to just 14
extra pages = 56 kbytes, and the new loop in pmap_pinit().  Here's
prototype code for sizing DMPDPphys, for illustration:

old:	DMPDPphys = allocpages(firstaddr, NDMPML4E);

new:	ndmpdpphys = howmany(ndmpdp, NPML4EPG);
	if (ndmpdpphys > NDMPML4E)
		panic("something or other"); /* or shrink to fit? */
	DMPDPphys = allocpages(firstaddr, ndmpdpphys);

and then instead of connecting NDMPML4E pages, connect ndmpdpphys
of them.  Would that break anything?  The direct mapped VA range
is known in advance; if you get a bad value it will "look" direct-
mapped, but now not have an L3 page under it, whereas before it
would always have an L3 page, just no L2 page.  (Offhand, I think
this would only affect pmap_enter(), and calling that for an
invalid physical address would be bad anyway.)

[I'm also not sure if we might be able to tweak the KPTphys usage
slightly to eliminate whole pages full of L1 PTEs, e.g., if the
GENERIC kernel occupies about 15 MB, we can map it with 7 2MB big
page entries in KPDphys, then just one "regular" PTE-page and 256
"regular" PTEs in the first actually-used page of KPTphys.  (This
would recover another 7 pages in this particular example.)  But
this would at least affect pmap_init()'s loop over nkpt entries to
initialize the vm page array entries that describe the KPTphys
area, so I did not attempt it.]

Chris


More information about the freebsd-hackers mailing list