[Bug 261338] [PATCH] kernel panic "bad pte" on heavy CPU load on 12.2 and 12.3 (i386)

From: <bugzilla-noreply_at_freebsd.org>
Date: Wed, 19 Jan 2022 16:15:23 UTC

            Bug ID: 261338
           Summary: [PATCH] kernel panic "bad pte" on heavy CPU load on
                    12.2 and 12.3 (i386)
           Product: Base System
           Version: 12.3-RELEASE
          Hardware: i386
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: threads
          Assignee: threads@FreeBSD.org
          Reporter: thedix@yandex.ru

Created attachment 231160
  --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=231160&action=edit
Panic screenshot

After updating to 12.2p12 and 12.3p1 I noticed kernel panic under heavy
multi-core CPU load.
As an example of heavy load is building kernel in multi-threaded mode.

Affected systems:
- 12.2p12 i386
- 12.3p1 i386

12.X amd64 is not affected, 13.0 is not affected at all.

Tested hardware:
- Virtual machine 8 vCPU 4 GB vRAM under VMWare ESXi 6.7
- HP MicroServer Gen8 Intel Xeon E3-1265Lv2 16 GB RAM
- PC Intel Core i5-4690 16 GB RAM

Steps to reproduce:
# cd /usr/src
# make -s -j`sysctl -n hw.ncpu` KERNCONF=GENERIC buildkernel

And after some time the system hangs with panic like:
TPTE at 0x2857f14  IS ZERO @ VA 247c5000
panic: bad pte
cpuid = 7
time = 1642334372
KDB: stack backtrace:
#0 0x10438ee at kdb_backtrace+0x4e
#1 0xffdb68 at vpanic+0x118
#2 0xffda44 at panic+0x14
#3 0x155b6d5 at pmap_remove_pages+0x5a5
#4 0x12fceb4 at vmspace_exit+0x94
#5 0xfbe0f3 at exit1+0x593
#6 0xfbdb52 at sys_sys_exit+0x12
#7 0x1561b79 at syscall+0x3e9
#8 0xffc033e7 at PTDpde+0x43ef

Additional stack info:
#0  0x00ffd9f6 in doadump () at /usr/src/sys/kern/kern_shutdown.c:370
370             savectx(&dumppcb);
(kgdb) #0  0x00ffd9f6 in doadump () at /usr/src/sys/kern/kern_shutdown.c:370
#1  0x00ffd831 in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:452
#2  0x00ffdbbf in vpanic (fmt=0x15d448a "bad pte", ap=0x1ff80a10 "")
    at /usr/src/sys/kern/kern_shutdown.c:881
#3  0x00ffda44 in panic (fmt=0x15d448a "bad pte")
    at /usr/src/sys/kern/kern_shutdown.c:808
#4  0x0155b6d5 in pmap_remove_pages (pmap=0x22a0354c)
    at /usr/src/sys/i386/i386/pmap.c:4845
#5  0x012fceb4 in vmspace_exit (td=0x1bb57380) at /usr/src/sys/vm/vm_map.c:411
#6  0x00fbe0f3 in exit1 (td=0x1bb57380, rval=0, signo=0)
    at /usr/src/sys/kern/kern_exit.c:399
#7  0x00fbdb52 in sys_sys_exit (td=0x1bb57380, uap=0x1bb57604)
    at /usr/src/sys/kern/kern_exit.c:176
#8  0x01561b79 in syscall (frame=0x1ff80ba8)
    at src/sys/i386/i386/../../kern/subr_syscall.c:144
#9  0xffc033e7 in ?? ()
#10 0x00000033 in ?? ()

I made some research on the kernel code and found the problem appeared in the
recent changes of SMP processing in mp_x86.c:

The problem is in the function smp_targeted_tlb_shootdown():
-       sched_pin();
+       KASSERT(curthread->td_pinned > 0, ("curthread not pinned"));
Under some circumstances the function is not pinned, which later causes PTE
I recompiled GENERIC kernel with INVARIANTS options and added the function name
to the assertion text for additional info and got an immediate panic during the
boot (see attached image panic_not_pinned.png).

So the fix is to revert this line back:
-       KASSERT(curthread->td_pinned > 0, ("curthread not pinned"));
+       sched_pin();

I attached the patch mp_x86.c.patch to fix the problem.
After recompiling the kernel with this patch, I no longer see panics on both
12.2 and 12.3 when recompiling the kernel further.

You are receiving this mail because:
You are the assignee for the bug.