git: 811f08c1cf41 - main - amd64 pmap: add comment explaining TLB invalidation modes.

Konstantin Belousov kib at FreeBSD.org
Sun Jan 10 00:45:18 UTC 2021


The branch main has been updated by kib:

URL: https://cgit.FreeBSD.org/src/commit/?id=811f08c1cf417306fd9590e062625ca847cb572c

commit 811f08c1cf417306fd9590e062625ca847cb572c
Author:     Konstantin Belousov <kib at FreeBSD.org>
AuthorDate: 2020-12-21 17:51:02 +0000
Commit:     Konstantin Belousov <kib at FreeBSD.org>
CommitDate: 2021-01-10 00:44:42 +0000

    amd64 pmap: add comment explaining TLB invalidation modes.
    
    Requested and reviewed by:      alc
    Discussed with: markj
    Sponsored by:   The FreeBSD Foundation
    Differential Revision:  https://reviews.freebsd.org/D25815
---
 sys/amd64/amd64/pmap.c | 161 +++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 144 insertions(+), 17 deletions(-)

diff --git a/sys/amd64/amd64/pmap.c b/sys/amd64/amd64/pmap.c
index 245a94f1d5b9..5267203d7473 100644
--- a/sys/amd64/amd64/pmap.c
+++ b/sys/amd64/amd64/pmap.c
@@ -2693,28 +2693,155 @@ pmap_update_pde_invalidate(pmap_t pmap, vm_offset_t va, pd_entry_t newpde)
 		invltlb_glob();
 	}
 }
-#ifdef SMP
 
 /*
- * For SMP, these functions have to use the IPI mechanism for coherence.
+ * The amd64 pmap uses different approaches to TLB invalidation
+ * depending on the kernel configuration, available hardware features,
+ * and known hardware errata.  The kernel configuration option that
+ * has the greatest operational impact on TLB invalidation is PTI,
+ * which is enabled automatically on affected Intel CPUs.  The most
+ * impactful hardware features are first PCID, and then INVPCID
+ * instruction presence.  PCID usage is quite different for PTI
+ * vs. non-PTI.
  *
- * N.B.: Before calling any of the following TLB invalidation functions,
- * the calling processor must ensure that all stores updating a non-
- * kernel page table are globally performed.  Otherwise, another
- * processor could cache an old, pre-update entry without being
- * invalidated.  This can happen one of two ways: (1) The pmap becomes
- * active on another processor after its pm_active field is checked by
- * one of the following functions but before a store updating the page
- * table is globally performed. (2) The pmap becomes active on another
- * processor before its pm_active field is checked but due to
- * speculative loads one of the following functions stills reads the
- * pmap as inactive on the other processor.
- * 
- * The kernel page table is exempt because its pm_active field is
- * immutable.  The kernel page table is always active on every
- * processor.
+ * * Kernel Page Table Isolation (PTI or KPTI) is used to mitigate
+ *   the Meltdown bug in some Intel CPUs.  Under PTI, each user address
+ *   space is served by two page tables, user and kernel.  The user
+ *   page table only maps user space and a kernel trampoline.  The
+ *   kernel trampoline includes the entirety of the kernel text but
+ *   only the kernel data that is needed to switch from user to kernel
+ *   mode.  The kernel page table maps the user and kernel address
+ *   spaces in their entirety.  It is identical to the per-process
+ *   page table used in non-PTI mode.
+ *
+ *   User page tables are only used when the CPU is in user mode.
+ *   Consequently, some TLB invalidations can be postponed until the
+ *   switch from kernel to user mode.  In contrast, the user
+ *   space part of the kernel page table is used for copyout(9), so
+ *   TLB invalidations on this page table cannot be similarly postponed.
+ *
+ *   The existence of a user mode page table for the given pmap is
+ *   indicated by a pm_ucr3 value that differs from PMAP_NO_CR3, in
+ *   which case pm_ucr3 contains the %cr3 register value for the user
+ *   mode page table's root.
+ *
+ * * The pm_active bitmask indicates which CPUs currently have the
+ *   pmap active.  A CPU's bit is set on context switch to the pmap, and
+ *   cleared on switching off this CPU.  For the kernel page table,
+ *   the pm_active field is immutable and contains all CPUs.  The
+ *   kernel page table is always logically active on every processor,
+ *   but not necessarily in use by the hardware, e.g., in PTI mode.
+ *
+ *   When requesting invalidation of virtual addresses with
+ *   pmap_invalidate_XXX() functions, the pmap sends shootdown IPIs to
+ *   all CPUs recorded as active in pm_active.  Updates to and reads
+ *   from pm_active are not synchronized, and so they may race with
+ *   each other.  Shootdown handlers are prepared to handle the race.
+ *
+ * * PCID is an optional feature of the long mode x86 MMU where TLB
+ *   entries are tagged with the 'Process ID' of the address space
+ *   they belong to.  This feature provides a limited namespace for
+ *   process identifiers, 12 bits, supporting 4095 simultaneous IDs
+ *   total.
+ *
+ *   Allocation of a PCID to a pmap is done by an algorithm described
+ *   in section 15.12, "Other TLB Consistency Algorithms", of
+ *   Vahalia's book "Unix Internals".  A PCID cannot be allocated for
+ *   the whole lifetime of a pmap in pmap_pinit() due to the limited
+ *   namespace.  Instead, a per-CPU, per-pmap PCID is assigned when
+ *   the CPU is about to start caching TLB entries from a pmap,
+ *   i.e., on the context switch that activates the pmap on the CPU.
+ *
+ *   The PCID allocator maintains a per-CPU, per-pmap generation
+ *   count, pm_gen, which is incremented each time a new PCID is
+ *   allocated.  On TLB invalidation, the generation counters for the
+ *   pmap are zeroed, which signals the context switch code that the
+ *   previously allocated PCID is no longer valid.  Effectively,
+ *   zeroing any of these counters triggers a TLB shootdown for the
+ *   given CPU/address space, due to the allocation of a new PCID.
+ *
+ *   Zeroing can be performed remotely.  Consequently, if a pmap is
+ *   inactive on a CPU, then a TLB shootdown for that pmap and CPU can
+ *   be initiated by an ordinary memory access to reset the target
+ *   CPU's generation count within the pmap.  The CPU initiating the
+ *   TLB shootdown does not need to send an IPI to the target CPU.
+ *
+ * * PTI + PCID.  The available PCIDs are divided into two sets: PCIDs
+ *   for complete (kernel) page tables, and PCIDs for user mode page
+ *   tables.  A user PCID value is obtained from the kernel PCID value
+ *   by setting the highest bit, 11, to 1 (0x800 == PMAP_PCID_USER_PT).
+ *
+ *   User space page tables are activated on return to user mode, by
+ *   loading pm_ucr3 into %cr3.  If the PCPU(ucr3_load_mask) requests
+ *   clearing bit 63 of the loaded ucr3, this effectively causes
+ *   complete invalidation of the user mode TLB entries for the
+ *   current pmap.  In which case, local invalidations of individual
+ *   pages in the user page table are skipped.
+ *
+ * * Local invalidation, all modes.  If the requested invalidation is
+ *   for a specific address or the total invalidation of a currently
+ *   active pmap, then the TLB is flushed using INVLPG for a kernel
+ *   page table, and INVPCID(INVPCID_CTXGLOB)/invltlb_glob() for a
+ *   user space page table(s).
+ *
+ *   If the INVPCID instruction is available, it is used to flush entries
+ *   from the kernel page table.
+ *
+ * * mode: PTI disabled, PCID present.  The kernel reserves PCID 0 for its
+ *   address space, all other 4095 PCIDs are used for user mode spaces
+ *   as described above.  A context switch allocates a new PCID if
+ *   the recorded PCID is zero or the recorded generation does not match
+ *   the CPU's generation, effectively flushing the TLB for this address space.
+ *   Total remote invalidation is performed by zeroing pm_gen for all CPUs.
+ *	local user page: INVLPG
+ *	local kernel page: INVLPG
+ *	local user total: INVPCID(CTX)
+ *	local kernel total: INVPCID(CTXGLOB) or invltlb_glob()
+ *	remote user page, inactive pmap: zero pm_gen
+ *	remote user page, active pmap: zero pm_gen + IPI:INVLPG
+ *	(Both actions are required to handle the aforementioned pm_active races.)
+ *	remote kernel page: IPI:INVLPG
+ *	remote user total, inactive pmap: zero pm_gen
+ *	remote user total, active pmap: zero pm_gen + IPI:(INVPCID(CTX) or
+ *          reload %cr3)
+ *	(See note above about pm_active races.)
+ *	remote kernel total: IPI:(INVPCID(CTXGLOB) or invltlb_glob())
+ *
+ * PTI enabled, PCID present.
+ *	local user page: INVLPG for kpt, INVPCID(ADDR) or (INVLPG for ucr3)
+ *          for upt
+ *	local kernel page: INVLPG
+ *	local user total: INVPCID(CTX) or reload %cr3 for kpt, clear PCID_SAVE
+ *          on loading UCR3 into %cr3 for upt
+ *	local kernel total: INVPCID(CTXGLOB) or invltlb_glob()
+ *	remote user page, inactive pmap: zero pm_gen
+ *	remote user page, active pmap: zero pm_gen + IPI:(INVLPG for kpt,
+ *          INVPCID(ADDR) for upt)
+ *	remote kernel page: IPI:INVLPG
+ *	remote user total, inactive pmap: zero pm_gen
+ *	remote user total, active pmap: zero pm_gen + IPI:(INVPCID(CTX) for kpt,
+ *          clear PCID_SAVE on loading UCR3 into $cr3 for upt)
+ *	remote kernel total: IPI:(INVPCID(CTXGLOB) or invltlb_glob())
+ *
+ *  No PCID.
+ *	local user page: INVLPG
+ *	local kernel page: INVLPG
+ *	local user total: reload %cr3
+ *	local kernel total: invltlb_glob()
+ *	remote user page, inactive pmap: -
+ *	remote user page, active pmap: IPI:INVLPG
+ *	remote kernel page: IPI:INVLPG
+ *	remote user total, inactive pmap: -
+ *	remote user total, active pmap: IPI:(reload %cr3)
+ *	remote kernel total: IPI:invltlb_glob()
+ *  Since on return to user mode, the reload of %cr3 with ucr3 causes
+ *  TLB invalidation, no specific action is required for user page table.
+ *
+ * EPT.  EPT pmaps do not map KVA, all mappings are userspace.
+ * XXX TODO
  */
 
+#ifdef SMP
 /*
  * Interrupt the cpus that are executing in the guest context.
  * This will force the vcpu to exit and the cached EPT mappings


More information about the dev-commits-src-all mailing list