kern/148698: Panic at boot due to integer overflow computing
top->cg_mask in smp_topo_none() when mp_ncpus == 32
Joe Landers
jlanders at vmware.com
Fri Jul 16 23:10:08 UTC 2010
>Number: 148698
>Category: kern
>Synopsis: Panic at boot due to integer overflow computing top->cg_mask in smp_topo_none() when mp_ncpus == 32
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: freebsd-bugs
>State: open
>Quarter:
>Keywords:
>Date-Required:
>Class: sw-bug
>Submitter-Id: current-users
>Arrival-Date: Fri Jul 16 23:10:07 UTC 2010
>Closed-Date:
>Last-Modified:
>Originator: Joe Landers
>Release: FreeBSD 8.0 (amd64)
>Organization:
VMware, Inc
>Environment:
Any system with 32 CPUs.
>Description:
Attempting to boot FreeBSD 8.0 (amd64) on machine with 32 CPUs causes the kernel to panic. panic() calls boot() and boot() calls thread_lock(curthread), however curthread->td_lock is NULL so the system starts taking repeated page faults and printing to the screen.
FreeBSD/SMP: Multiprocessor System Detected: 32 CPUs
FreeBSD/SMP: 32 package(s) x 1 core(s)
cpu0 (BSP): APIC ID: 0
cpu1 (AP): APIC ID: 1
cpu2 (AP): APIC ID: 2
cpu3 (AP): APIC ID: 3
cpu4 (AP): APIC ID: 4
cpu5 (AP): APIC ID: 5
cpu6 (AP): APIC ID: 6
cpu7 (AP): APIC ID: 7
cpu8 (AP): APIC ID: 8
cpu9 (AP): APIC ID: 9
cpu10 (AP): APIC ID: 10
cpu11 (AP): APIC ID: 11
cpu12 (AP): APIC ID: 12
cpu13 (AP): APIC ID: 13
cpu14 (AP): APIC ID: 14
cpu15 (AP): APIC ID: 15
cpu16 (AP): APIC ID: 16
cpu17 (AP): APIC ID: 17
cpu18 (AP): APIC ID: 18
cpu19 (AP): APIC ID: 19
cpu20 (AP): APIC ID: 20
cpu21 (AP): APIC ID: 21
cpu22 (AP): APIC ID: 22
cpu23 (AP): APIC ID: 23
cpu24 (AP): APIC ID: 24
cpu25 (AP): APIC ID: 25
cpu26 (AP): APIC ID: 26
cpu27 (AP): APIC ID: 27
cpu28 (AP): APIC ID: 28
cpu29 (AP): APIC ID: 29
cpu30 (AP): APIC ID: 30
cpu31 (AP): APIC ID: 31
panic: Built bad topology at 0xffffffff8081fc20. CPU mask 0x0 != 0xFFFFFFFF
cpuid = 0
kernel trap 12 with interrupts disabled
Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address = 0x18
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80322e28
stack pointer = 0x28:0xffffffff809b0b20
frame pointer = 0x28:0xffffffff809b0b60
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = resume, IOPL = 0
current process = 0 (swapper)
trap number = 12
panic: page fault
cpuid = 0
kernel trap 12 with interrupts disabled
Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address = 0x18
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80322e28
stack pointer = 0x28:0xffffffff809b0790
frame pointer = 0x28:0xffffffff809b07d0
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = resume, IOPL = 0
current process = 0 (swapper)
trap number = 12
panic: page fault
cpuid = 0
kernel trap 12 with interrupts disabled
Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address = 0x18
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80322e28
stack pointer = 0x28:0xffffffff809b0400
frame pointer = 0x28:0xffffffff809b0440
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = resume, IOPL = 0
current process = 0 (swapper)
trap number = 12
panic: page fault
cpuid = 0
kernel trap 12 with interrupts disabled
.. <repeats> ...
When the hardware has 32 CPUs, the computation of top->cg_mask in smp_topo_none() overflows and becomes zero. This causes smp_topo() to call panic() with the message about the topology mismatch.
struct cpu_group *
smp_topo_none(void)
{
struct cpu_group *top;
top = &group[0];
top->cg_parent = NULL;
top->cg_child = NULL;
top->cg_mask = (1 << mp_ncpus) - 1; <<<<<<<<<<<<<<<<
top->cg_count = mp_ncpus;
top->cg_children = 0;
top->cg_level = CG_SHARE_NONE;
top->cg_flags = 0;
return (top);
}
0xffffffff805b9743 <smp_topo_none+51>: shl %cl,%eax <<<<<<<<
0xffffffff805b9745 <smp_topo_none+53>: mov %cl,6632169(%rip) # 0xffffffff80c0ca34 <group+20>
0xffffffff805b974b <smp_topo_none+59>: movb $0x0,6632165(%rip) # 0xffffffff80c0ca37 <group+23>
0xffffffff805b9752 <smp_topo_none+66>: sub $0x1,%eax
0xffffffff805b9755 <smp_topo_none+69>: mov %eax,6632149(%rip) # 0xffffffff80c0ca30 <group+16>
0xffffffff805b975b <smp_topo_none+75>: leaveq
0xffffffff805b975c <smp_topo_none+76>: mov $0xffffffff80c0ca20,%rax
0xffffffff805b9763 <smp_topo_none+83>: retq
static int
start_all_aps(void)
{
..
all_cpus |= (1 << cpu); /* record AP in CPU map */
}
..
}
struct cpu_group *
smp_topo(void)
{
/*
* Verify the returned topology.
*/
if (top->cg_count != mp_ncpus)
panic("Built bad topology at %p. CPU count %d != %d",
top, top->cg_count, mp_ncpus);
if (top->cg_mask != all_cpus) <<<<<<
panic("Built bad topology at %p. CPU mask 0x%X != 0x%X",
top, top->cg_mask, all_cpus);
return (top);
}
(gdb) x/xw 0xc00000 + (0xffffffff80c0ca20 & 0x1fffff) + 0x10
0xc0ca30: 0x00000000 // top->cg_mask
(gdb) x/xw 0xc00000 + (0xffffffff80c0ca20 & 0x1fffff) + 0x14
0xc0ca34: 0x00000020 // top->cg_count
(gdb) p &all_cpus
$16 = (cpumask_t *) 0xffffffff80c0c9b8
(gdb) x/xw 0xc00000 + (0xffffffff80c0c9b8 & 0x1fffff)
0xc0c9b8: 0xffffffff // all_cpus
(gdb) p &mp_ncpus
$15 = (int *) 0xffffffff80c0c9b0
(gdb) x/xw 0xc00000 + (0xffffffff80c0c9b0 & 0x1fffff)
0xc0c9b0: 0x00000020 // mp_ncpus
0x121db30: 0xffffff00bfed7000 0x0000000000000104
0x121db40: 0xffffffff80951948 0xffffffff80be7600
0x121db50: 0x0000000000000000 0x0000000000000000
0x121db60: 0xffffffff8121dbb0 0xffffffff8057f41f // boot+47
0xffffffff8057f418 <boot+40>: xor %esi,%esi
0xffffffff8057f41a <boot+42>: callq 0xffffffff80571290 <_thread_lock_flags>
0xffffffff8057f41f <boot+47>: mov %gs:0x0,%rdi
0x121db70: 0xffffffff80951948 0xffffffff80be7600
0x121db80: 0x0000000000000000 0x0000000000000104
0x121db90: 0xffffffff80951948 0xffffffff80be7600
0x121dba0: 0x0000000000000000 0x0000000000000000
0x121dbb0: 0xffffffff8121dcb0 0xffffffff8057fcfc // panic+332
0x121dbc0: 0x0000003000000020 0xffffffff8121dcc0
0x121dbd0: 0xffffffff8121dbe0 0x00000000bfed9560
0x121dbe0: 0xffffffff8121dc70 0xffffffff80c0ca20
0x121dbf0: 0x0000000000000000 0x00000000ffffffff
0x121dc00: 0x0000000000000000 0x0000000000000001
0x121dc10: 0xffffff00bfed94e0 0x0000000200000202
0x121dc20: 0xffffff00bfed94e8 0x0000010280b60920
0x121dc30: 0x0000000000000000 0xffffffff80b60920
0x121dc40: 0xffffffff8121dc80 0xffffffff8056db9b
0x121dc50: 0x00000000023a3c80 0xffffff00bfed94c0
0x121dc60: 0x00000000023a3c80 0xffffff00bfed94c0
0x121dc70: 0xffffff00023b0d00 0xffffffff8098dfe8
0x121dc80: 0xffffffff80bf1240 0x0000000000000000
0x121dc90: 0x0000000000000000 0xffffffff8098dfe8
0x121dca0: 0xffffffff80bf1240 0x0000000000000000
0x121dcb0: 0xffffffff8121dcc0 0xffffffff805b9bc7 // smp_topo+263
0xffffffff80951948 <basefix.2920+424>: "Built bad topology at %p. CPU mask 0x%X != 0x%X"
0xffffffff805b9bb9 <smp_topo+249>: mov $0xffffffff80951948,%rdi
0xffffffff805b9bc0 <smp_topo+256>: xor %eax,%eax
0xffffffff805b9bc2 <smp_topo+258>: callq 0xffffffff8057fbb0 <panic>
>How-To-Repeat:
Our engineering team originally came across this issue testing FreeBSD-8.0 (amd64) running on large number of virtual CPUs running as a guest on VMware ESX.
Subsequently, engineering was able to reproduce this issue on physical hardware, booting the FreeBSD-8.0 (amd64) install disk on an IBM x3950 (4 node) Athena M2. The physical system panic()'s identically there as well.
>Fix:
>Release-Note:
>Audit-Trail:
>Unformatted:
More information about the freebsd-bugs
mailing list