kern/148698: Panic at boot due to integer overflow computing top->cg_mask in smp_topo_none() when mp_ncpus == 32

Joe Landers jlanders at vmware.com
Fri Jul 16 23:10:08 UTC 2010


>Number:         148698
>Category:       kern
>Synopsis:       Panic at boot due to integer overflow computing top->cg_mask in smp_topo_none() when mp_ncpus == 32
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Fri Jul 16 23:10:07 UTC 2010
>Closed-Date:
>Last-Modified:
>Originator:     Joe Landers
>Release:        FreeBSD 8.0 (amd64)
>Organization:
VMware, Inc
>Environment:
Any system with 32 CPUs.
>Description:
Attempting to boot FreeBSD 8.0 (amd64) on machine with 32 CPUs causes the kernel to panic. panic() calls boot() and boot() calls thread_lock(curthread), however curthread->td_lock is NULL so the system starts taking repeated page faults and printing to the screen.

FreeBSD/SMP: Multiprocessor System Detected: 32 CPUs
FreeBSD/SMP: 32 package(s) x 1 core(s)
 cpu0 (BSP): APIC ID: 0
 cpu1 (AP): APIC ID:  1
 cpu2 (AP): APIC ID:  2
 cpu3 (AP): APIC ID:  3
 cpu4 (AP): APIC ID:  4
 cpu5 (AP): APIC ID:  5
 cpu6 (AP): APIC ID:  6
 cpu7 (AP): APIC ID:  7
 cpu8 (AP): APIC ID:  8
 cpu9 (AP): APIC ID:  9
 cpu10 (AP): APIC ID: 10
 cpu11 (AP): APIC ID: 11
 cpu12 (AP): APIC ID: 12
 cpu13 (AP): APIC ID: 13
 cpu14 (AP): APIC ID: 14
 cpu15 (AP): APIC ID: 15
 cpu16 (AP): APIC ID: 16
 cpu17 (AP): APIC ID: 17
 cpu18 (AP): APIC ID: 18
 cpu19 (AP): APIC ID: 19
 cpu20 (AP): APIC ID: 20
 cpu21 (AP): APIC ID: 21
 cpu22 (AP): APIC ID: 22
 cpu23 (AP): APIC ID: 23
 cpu24 (AP): APIC ID: 24
 cpu25 (AP): APIC ID: 25
 cpu26 (AP): APIC ID: 26
 cpu27 (AP): APIC ID: 27
 cpu28 (AP): APIC ID: 28
 cpu29 (AP): APIC ID: 29
 cpu30 (AP): APIC ID: 30
 cpu31 (AP): APIC ID: 31
panic: Built bad topology at 0xffffffff8081fc20.  CPU mask 0x0 != 0xFFFFFFFF
cpuid = 0
kernel trap 12 with interrupts disabled


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x18
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80322e28
stack pointer           = 0x28:0xffffffff809b0b20
frame pointer           = 0x28:0xffffffff809b0b60
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = resume, IOPL = 0
current process         = 0 (swapper)
trap number             = 12
panic: page fault
cpuid = 0
kernel trap 12 with interrupts disabled

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x18
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80322e28
stack pointer           = 0x28:0xffffffff809b0790
frame pointer           = 0x28:0xffffffff809b07d0
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = resume, IOPL = 0
current process         = 0 (swapper)
trap number             = 12
panic: page fault
cpuid = 0
kernel trap 12 with interrupts disabled


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x18
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80322e28
stack pointer           = 0x28:0xffffffff809b0400
frame pointer           = 0x28:0xffffffff809b0440
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = resume, IOPL = 0
current process         = 0 (swapper)
trap number             = 12
panic: page fault
cpuid = 0
kernel trap 12 with interrupts disabled

.. <repeats> ...

When the hardware has 32 CPUs, the computation of top->cg_mask in smp_topo_none() overflows and becomes zero. This causes smp_topo() to call panic() with the message about the topology mismatch.

struct cpu_group *
smp_topo_none(void)
{
 struct cpu_group *top;

 top = &group[0];
 top->cg_parent = NULL;
 top->cg_child = NULL;
 top->cg_mask = (1 << mp_ncpus) - 1; <<<<<<<<<<<<<<<<
 top->cg_count = mp_ncpus;
 top->cg_children = 0;
 top->cg_level = CG_SHARE_NONE;
 top->cg_flags = 0;
 
 return (top);
}

0xffffffff805b9743 <smp_topo_none+51>: shl %cl,%eax <<<<<<<<
0xffffffff805b9745 <smp_topo_none+53>: mov %cl,6632169(%rip) # 0xffffffff80c0ca34 <group+20>
0xffffffff805b974b <smp_topo_none+59>: movb $0x0,6632165(%rip) # 0xffffffff80c0ca37 <group+23>
0xffffffff805b9752 <smp_topo_none+66>: sub $0x1,%eax
0xffffffff805b9755 <smp_topo_none+69>: mov %eax,6632149(%rip) # 0xffffffff80c0ca30 <group+16>
0xffffffff805b975b <smp_topo_none+75>: leaveq
0xffffffff805b975c <smp_topo_none+76>: mov $0xffffffff80c0ca20,%rax
0xffffffff805b9763 <smp_topo_none+83>: retq

static int
start_all_aps(void)
{
..
  all_cpus |= (1 << cpu); /* record AP in CPU map */
 }
..
}

struct cpu_group *
smp_topo(void)
{
 /*
  * Verify the returned topology.
  */
 if (top->cg_count != mp_ncpus)
  panic("Built bad topology at %p. CPU count %d != %d",
      top, top->cg_count, mp_ncpus);
 if (top->cg_mask != all_cpus) <<<<<<
  panic("Built bad topology at %p. CPU mask 0x%X != 0x%X",
      top, top->cg_mask, all_cpus);
 return (top);
}

(gdb) x/xw 0xc00000 + (0xffffffff80c0ca20 & 0x1fffff) + 0x10
0xc0ca30: 0x00000000 // top->cg_mask
(gdb) x/xw 0xc00000 + (0xffffffff80c0ca20 & 0x1fffff) + 0x14
0xc0ca34: 0x00000020 // top->cg_count

(gdb) p &all_cpus
$16 = (cpumask_t *) 0xffffffff80c0c9b8
(gdb) x/xw 0xc00000 + (0xffffffff80c0c9b8 & 0x1fffff)
0xc0c9b8: 0xffffffff // all_cpus

(gdb) p &mp_ncpus
$15 = (int *) 0xffffffff80c0c9b0
(gdb) x/xw 0xc00000 + (0xffffffff80c0c9b0 & 0x1fffff)
0xc0c9b0: 0x00000020 // mp_ncpus



0x121db30: 0xffffff00bfed7000 0x0000000000000104
0x121db40: 0xffffffff80951948 0xffffffff80be7600
0x121db50: 0x0000000000000000 0x0000000000000000
0x121db60: 0xffffffff8121dbb0 0xffffffff8057f41f // boot+47


0xffffffff8057f418 <boot+40>: xor %esi,%esi
0xffffffff8057f41a <boot+42>: callq 0xffffffff80571290 <_thread_lock_flags>
0xffffffff8057f41f <boot+47>: mov %gs:0x0,%rdi

0x121db70: 0xffffffff80951948 0xffffffff80be7600
0x121db80: 0x0000000000000000 0x0000000000000104
0x121db90: 0xffffffff80951948 0xffffffff80be7600
0x121dba0: 0x0000000000000000 0x0000000000000000
0x121dbb0: 0xffffffff8121dcb0 0xffffffff8057fcfc // panic+332
0x121dbc0: 0x0000003000000020 0xffffffff8121dcc0
0x121dbd0: 0xffffffff8121dbe0 0x00000000bfed9560
0x121dbe0: 0xffffffff8121dc70 0xffffffff80c0ca20
0x121dbf0: 0x0000000000000000 0x00000000ffffffff
0x121dc00: 0x0000000000000000 0x0000000000000001
0x121dc10: 0xffffff00bfed94e0 0x0000000200000202
0x121dc20: 0xffffff00bfed94e8 0x0000010280b60920
0x121dc30: 0x0000000000000000 0xffffffff80b60920
0x121dc40: 0xffffffff8121dc80 0xffffffff8056db9b
0x121dc50: 0x00000000023a3c80 0xffffff00bfed94c0
0x121dc60: 0x00000000023a3c80 0xffffff00bfed94c0
0x121dc70: 0xffffff00023b0d00 0xffffffff8098dfe8
0x121dc80: 0xffffffff80bf1240 0x0000000000000000
0x121dc90: 0x0000000000000000 0xffffffff8098dfe8
0x121dca0: 0xffffffff80bf1240 0x0000000000000000
0x121dcb0: 0xffffffff8121dcc0 0xffffffff805b9bc7 // smp_topo+263

0xffffffff80951948 <basefix.2920+424>: "Built bad topology at %p. CPU mask 0x%X != 0x%X"

0xffffffff805b9bb9 <smp_topo+249>: mov $0xffffffff80951948,%rdi
0xffffffff805b9bc0 <smp_topo+256>: xor %eax,%eax
0xffffffff805b9bc2 <smp_topo+258>: callq 0xffffffff8057fbb0 <panic>


>How-To-Repeat:
Our engineering team originally came across this issue testing FreeBSD-8.0 (amd64) running on large number of virtual CPUs running as a guest on VMware ESX.

Subsequently, engineering was able to reproduce this issue on physical hardware, booting the FreeBSD-8.0 (amd64) install disk on an IBM x3950 (4 node) Athena M2. The physical system panic()'s identically there as well.
>Fix:


>Release-Note:
>Audit-Trail:
>Unformatted:


More information about the freebsd-bugs mailing list