expanding amd64 past the 1TB limit
Oliver Pinter
oliver.pntr at gmail.com
Thu Jun 27 22:32:03 UTC 2013
On 6/27/13, Chris Torek <torek at elf.torek.net> wrote:
> OK, I wasted :-) way too much time, but here's a text file that
> can be comment-ified or stored somewhere alongside the code or
> whatever...
>
> (While drawing this I realized that there's at least one "wasted"
> page if the machine has .5 TB or less: we can just leave zero
> slots in the corresponding L4 direct-map entries. But that would
> require switching to the bcopy() method also mentioned below. Or
> indexing into vmspace0.vm_pmap.pm_pml4, which is basically the
> same thing.)
>
> Chris
>
> -----
>
> There are six -- or sometimes five -- sets of pages allocated here
> at boot time to map physical memory in two ways. Note that each
> page, regardless of level, stores 512 PTEs (or PDEs or PDPs, but
> let's just use PTE here and prefix it with "level" as needed: 4,
> 3, 2, or 1.)
>
> There is one page for the top level, L4, page table entries. Each
> L4 PTE maps 512 GB of space. Unless it's marked "invalid", no L4
> PTE can be marked "stop here": it either is marked as "this
> address is invalid", or it points to one physically-adressed page
> full of L3 PTEs. Eventually, those L3 PTEs will map-or-reject
> half a terabyte. 512 entries, each mapping .5 TB, allow us to map
> 256 TB, which is as much as the hardware supports (there are, in
> effect, only 48 virtual address bits: the top 16 bits must match
> the 47th bit).
>
> The L4 entry halfway down, at PML4PML4I, is set to point back to
> this page itself; that's the "recursive page table" for user
> space, which we do nothing else with at boot time.
>
> We need (up to) NDMPML4E pages, each holding 512 L3 PTEs, for the
> direct map space. If the processor supports 1 GB pages, an L3 PTE
> can be marked with "stop here" and these L3 PTEs each grant (or
> forbid) access to 1 GB of physical space at a time. A system
> with, say, 3 GB of RAM starting at 0 can map it all with three L3
> PTEs: "address 0 is valid for 1GB", "address 1GB is valid for
> 1GB", "address 2GB is valid for 1GB". The remaining L3 PTEs are
> zero, making the remaining address space invalid.
>
> If the processor does not support 1 GB pages, or if there is less
> than 1 GB of RAM "at the end" (e.g., if the system has 4.5 GB),
> the L3 PTEs may need to point to more pages holding L2 PTEs.
> These L2 PTEs always support 2 MB pages. Each page of L2 PTEs
> maps 1 GB. So a machine with 4.5 GB and 1 GB mappings needs one L3
> page with four valid 1 GB L3 PTEs and then one L3 PTE pointing to
> one page of L2 PTEs. That one page of L2 PTEs is half-filled,
> containing 256 2MB-sized PTEs, mapping the 512 MB. The remaining
> half of that page is zero, making the remaining addresses invalid.
>
> Pictorially, and adding the names of the physical page(s), thus
> far we have this. (Note, the L4 PTE page is drawn more than twice
> as tall as the L3 and L2 pages, just to get space for arrows.)
>
> LEVEL 4: LEVEL 3: LEVEL 2:
> _._
> KPML4phys v \
> +---------+ |
> | 0: | |
> |---------| |
> | 1: | | DMPDPphys DMPDphys
> ( ... ) | .-> +---------+ +----------------+
> | 127: | | / | 0: 0GB | .-> | 0: 4GB |
> |---------| | | | 1: 1GB | / | 1: 4GB+2MB |
> PML4PML4I: | 128: *--|--/ | | 2: 2GB | / | 2: 4GB+4MB |
> |---------| | | 3: 3GB | / ( ... )
> | 129: | | | 4: *--|-/ | 255: 4.5GB-2MB |
> | ... | | | 5: | | 256: |
> ________ |---------| | ( ... ) | 257: |
> / DMPML4I: | *--|-----/ | 511: | ( ... )
> NDMPML4E |---------| +---------+ +----------------+
> \________ | *--|---------> | 0: |
> |---------| | 1: |
> | | | 2: | (These are used only
> |---------| | 3: | if the system has more
> | ... | ( ... ) than 512 GB)
> ( |---------| ) | 509: |
> ( | 510: see below ) | 510: |
> ( |---------| ) | 511: |
> ( | 511: see below ) +---------+
> +---------+
>
> If the hardware supports 1GB pages, "ndm1g" is the number of
> gigabyte entries (4 in the example above). Otherwise it's just
> zero. Meanwhile "ndmpdp" is the number of gigabytes of RAM that
> need to be mapped, in this case 5. Thus, if ndmpdp > ndm1g, we
> need ndmpdp-ndm1g pages to hold some L2 PTEs.
>
> Now we get to the weirder case of the kernel itself (both its
> non-direct-mapped dynamically allocated virtual memory, and its
> text/data/bss). The branch offset limitations encourage the
> placement of the kernel's text, etc., in the last 2 GB of virtual
> space, i.e., starting at 0xffff.ffff.f800.0000. But, we want
> a reasonable amount of room for dynamic VM. So we give the kernel
> at least 512 GB of VM -- that's one L4 PTE -- while making sure that
> the text snuggles up close to the end of the space, in that last 2 GB
> of the at-least-512-GB area.
>
> Meanwhile, the boot loader has loaded the kernel into relatively
> low physical memory addresses.
>
> If KPML4I is 511 (and it actually is), this uses the final L4 slot
> to map the kernel. If we want to allow kernel VM to have more
> than 512 GB available, though, we need extra space below KPML4I,
> i.e., starting at KPMLBASE. So we allocate NKPML4E pages that
> we set up as L3 PTEs, and point the last NKPML4E slots in the L4
> page table here. If NKPML4E is 4, for instance, we will have
> this:
>
> last part of KPML4phys:
> ( ... ) .----> [page #0 of all-zero L3 PTEs]
> | DMPML4I | /
> ( ... ) | .--> [page #1 of all-zero L3 PTEs]
> | 507: | | /
> | 508: *--|--/ | .-> [page #2 of all-zero L3 PTEs]
> | 509: *--|----/ |
> | 510: *--|------/
> | 511: *--|---------> [page #3 of L3 PTEs, see below]
> +---------+
>
> The reason for having those "empty" (all-zero) PTE pages is that
> whenever new processes are created, in pmap_pinit(), they have
> their (new) L4 PTE page set up to point to the *same* physical
> pages that the kernel is using. Thus, if the kernel creates or
> destroys any level-3-or-below mapping by writing into any of the
> above four pages, that mapping is also created/destroyed in all
> processes. Similarly, the NDMPML4 pages starting at DMPDPphys are
> mapped identically in all processes. The kernel can therefore
> "borrow" a user pmap at any time, i.e., there's no need to adjust
> the CPU's CR4 on entry to the kernel.
>
> (If we used bcopy() to copy the kernel pmap's NKPML4E and NDMPML4E
> entries into the new pmap, the L3 pages would not have to be
> physically contiguous, but the KVA ones would still all have to
> exist. It's free to allocate physically contiguous pages here
> anyway though.)
>
> So, the last NKPML4E slots in KPML4phys point to the following
> page tables, which use all of L3, L2, and L1 style PTEs. (Note
> that we did not need any L1 PTEs for the direct map, which always
> uses 2MB or 1GB super-pages.)
>
> LEVEL 3: LEVEL 2: LEVEL 1:
>
> (assuming NKPML4=4) (nkpt pages)
> KPDPphys KPTphys
> +---------+ +---------------+
> page 0 | 0: | .-> | 0: 0 KB |
> | 1: | / | 1: 4 KB |
> | 2: | / | 2: 8 KB |
> | 3: | / | 3: 12 KB |
> ( ... ) | ( ... )
> | 509: | | | 509: 2MB-12KB |
> | 510: | | | 510: 2MB-8KB |
> | 511: | | | 511: 2MB-4KB |
> +---------+ | +---------------+
> page 1 | 0: | | .-> | 0: 2 MB |
> | 1: | | / | 1: 2MB+4KB |
> | 2: | | | ( ... )
> | 3: | | | ( ... )
> ( ... ) | | +---------------+
> | 509: | | | .-> ( ... )
> | 510: | | | | ( ... )
> | 511: | KPDphys | | | +---------------+
> +---------+ +---------+ | | | ..( ... ... ... )
> page 2 | 0: | .---> | 0: *--|--/ | | . [etc]
> | 1: | / | 1: *--|---/ | .
> | 2: | | | 2: *--|-----/ .
> | 3: | | | 3: *--|---....
> ( ... ) | ( ... )
> | 509: | | | 509: ...|
> | 510: | | | 510: ...|
> | 511: | | | 511: ...|
> +---------+ | +---------+
> page 3 | 0: | | .-> | 0: ...|
> | 1: | | / ( ... )
> | 2: | | | ( ... )
> | 3: | | | ( ... )
> ( ... ) | | ( ... )
> | 509: | | | ( ... )
> | 510: *--|--/ | ( ... )
> | 511: *--|----/ | 511: |
> +---------+ +---------+
>
> There are nkpdpe pages at KPDphys, where nkpdpe is either 1 or 2.
> One page maps 1 GB, and the other page maps the remaining 1 GB.
> Remember that kernel text+data+bss lives in the final 2 GB of the
> virtual address space, so there cannot be more than 2 GB. These
> one or two pages map nkpt pages at KPTphys.
added two VM guru, to CC
> _______________________________________________
> freebsd-hackers at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe at freebsd.org"
>
More information about the freebsd-hackers
mailing list