git: 280cfe2264d7 - stable/15 - amd64: fix INVLPGB range invalidation

Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Kyle Evans <kevans_at_FreeBSD.org>
Date: Thu, 23 Apr 2026 13:49:12 UTC
The branch stable/15 has been updated by kevans:

URL: https://cgit.FreeBSD.org/src/commit/?id=280cfe2264d7bf2199e5a41bdcbb9acb49d059c1

commit 280cfe2264d7bf2199e5a41bdcbb9acb49d059c1
Author:     Kyle Evans <kevans@FreeBSD.org>
AuthorDate: 2026-04-20 20:18:17 +0000
Commit:     Kyle Evans <kevans@FreeBSD.org>
CommitDate: 2026-04-23 13:48:45 +0000

    amd64: fix INVLPGB range invalidation
    
    AMD64 Architecture Programmer's Manual Volume 3 says the following:
    
    > ECX[15:0] contains a count of the number of sequential pages to
    > invalidate in addition to the original virtual address, starting from
    > the virtual address specified in rAX. A count of 0 invalidates a
    > single page. ECX[31]=0 indicates to increment the virtual address at
    > the 4K boundary. ECX[31]=1 indicates to increment the virtual address
    > at the 2M boundary. The maximum count supported is reported in
    > CPUID function 8000_0008h, EDX[15:0].
    
    ECX[31] being what we call INVLPGB_2M_CNT, signaling to increment the
    VA by 2M.
    
    > This instruction invalidates the TLB entry or entries, regardless of
    > the page size (4 Kbytes, 2 Mbytes, 4 Mbytes, or 1 Gbyte). [...]
    
    Combined with this, my interpretation of the current code is: if
    <va> is aligned on a PDE boundary, we'll use INVLPGB_2M_CNT to try and
    invalidate <cnt> PDEs with a single call, but that only works if <va> is
    the start of at least <cnt> 2M pages.  Otherwise, if <va> or any of the
    subsequent PDEs isn't actually a superpage, then we would actually only
    invalidate the *first* page within the PDE before skipping to the next
    PDE, leaving the remainder of the 4K pages in between as they were.
    
    The implication would seem to be that we would need to inspect the range
    that we're trying to invalidate if we're planning on using
    INVLPGB_2M_CNT at all, so this patch just simplifies it to a series of
    4K invalidations.  My gut feeling is that we likely still come out on
    top vs. the TLB shootdown we're avoiding.
    
    This seems to explain some issues we've seen lately with fdgrowtable()
    and kqueue on recent Zen4/Zen5 EPYC hardware, where we'd experience
    corruption that we can't explain.
    
    PR:             293382
    Reviewed by:    alc, kib, markj
    
    (cherry picked from commit 1b8e5c02f5c07521129e06ff8ab7c660238fd75c)
---
 sys/amd64/amd64/mp_machdep.c | 25 ++++++-------------------
 1 file changed, 6 insertions(+), 19 deletions(-)

diff --git a/sys/amd64/amd64/mp_machdep.c b/sys/amd64/amd64/mp_machdep.c
index 91f725c93158..1de6fe9227c7 100644
--- a/sys/amd64/amd64/mp_machdep.c
+++ b/sys/amd64/amd64/mp_machdep.c
@@ -726,25 +726,12 @@ smp_masked_invlpg_range(vm_offset_t addr1, vm_offset_t addr2, pmap_t pmap,
 		addr2 = round_page(addr2);
 		total = atop(addr2 - addr1);
 		for (va = addr1; total > 0;) {
-			if ((va & PDRMASK) != 0 || total < NPDEPG) {
-				cnt = atop(NBPDR - (va & PDRMASK));
-				if (cnt > total)
-					cnt = total;
-				if (cnt > invlpgb_maxcnt + 1)
-					cnt = invlpgb_maxcnt + 1;
-				invlpgb(INVLPGB_GLOB | INVLPGB_VA | va, 0,
-				    cnt - 1);
-				va += ptoa(cnt);
-				total -= cnt;
-			} else {
-				cnt = total / NPTEPG;
-				if (cnt > invlpgb_maxcnt + 1)
-					cnt = invlpgb_maxcnt + 1;
-				invlpgb(INVLPGB_GLOB | INVLPGB_VA | va, 0,
-				    INVLPGB_2M_CNT | (cnt - 1));
-				va += cnt << PDRSHIFT;
-				total -= cnt * NPTEPG;
-			}
+			cnt = MIN(total, invlpgb_maxcnt + 1);
+			/* 4K increments because these may not be superpages. */
+			invlpgb(INVLPGB_GLOB | INVLPGB_VA | va, 0,
+			    cnt - 1);
+			va += ptoa(cnt);
+			total -= cnt;
 		}
 		tlbsync();
 		sched_unpin();