Panics on IBM Bladecenter HS20/amd64 blades

Daniel Bond db at danielbond.org
Fri Oct 6 09:06:02 UTC 2006


Hi, 

FreeBSD has been running rock solid on the older i386/HS20's, but the newer ones
 with amd64 configuration keeps panicing, and I can't quite figure out why. 

Help tracking this issue down, is greatly appreciated. 
The panics happen randomly, average once every 2 days, sometimes just
20minutes between each panic, allways in the process tcpserver, which
indicates that this is a network related issue(?).

Another problem is that the system can't reboot by it's self, because there is
no keyboard controller, leaving the filesystems dirty (there is a flag
BROKEN_KEYBOARD_RESET in i386, but not in amd64), so I have to reboot the
machine via bladecenter managament to get it up again. 

If there is anything I can do to provide more usefull output, please let me know.

Trace:
------------------------------------------

mxtwo# kgdb kernel.debug /var/crash/vmcore.3

Unread portion of the kernel message buffer:
kernel trap 12 with interrupts disabled


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x18c
fault code              = supervisor read, page not present
instruction pointer     = 0x8:0xffffffff802cf867
stack pointer           = 0x10:0xffffffffb3ff38b0
frame pointer           = 0x10:0x4
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = resume, IOPL = 0
current process         = 1363 (tcpserver)
trap number             = 12
panic: page fault
cpuid = 0
Uptime: 6m22s
Dumping 2047 MB (2 chunks)
  chunk 0: 1MB (154 pages) ... ok
  chunk 1: 2047MB (523966 pages) 2031 2015 1999 1983 1967 1951 1935 1919 1903
1887 1871 1855 1839 1823 1807 1791 1775 1759 1743 1727 1711 1695 1679 1663
1647 1631 1615 1599 1583 1567 1551 1535 1519 1503 1487 1471 1455 1439 1423
1407 1391 1375 1359 1343 1327 1311 1295 1279 1263 1247 1231 1215 1199 1183
1167 1151 1135 1119 1103 1087 1071 1055 1039 1023 1007 991 975 959 943 927 911
895 879 863 847 831 815 799 783 767 751 735 719 703 687 671 655 639 623 607
591 575 559 543 527 511 495 479 463 447 431 415 399 383 367 351 335 319 303
287 271 255 239 223 207 191 175 159 143 127 111 95 79 63 47 31 15

#0  doadump () at pcpu.h:172
172             __asm __volatile("movq %%gs:0,%0" : "=r" (td));
(kgdb) list *0xffffffff802cf867
0xffffffff802cf867 is in _mtx_lock_sleep (/usr/src/sys/kern/kern_mutex.c:544).
539                      * If the current owner of the lock is executing on
another
540                      * CPU, spin instead of blocking.
541                      */
542                     owner = (struct thread *)(v & MTX_FLAGMASK);
543     #ifdef ADAPTIVE_GIANT
544                     if (TD_IS_RUNNING(owner)) {
545     #else
546                     if (m != &Giant && TD_IS_RUNNING(owner)) {
547     #endif
548                             turnstile_release(&m->mtx_object);
(kgdb) bt
#0  doadump () at pcpu.h:172
#1  0x0000000000000004 in ?? ()
#2  0xffffffff802d9bf7 in boot (howto=260) at
/usr/src/sys/kern/kern_shutdown.c:409
#3  0xffffffff802da291 in panic (fmt=0xffffff005b3fa4c0 "@Ó¯[") at
/usr/src/sys/kern/kern_shutdown.c:565
#4  0xffffffff80488bff in trap_fatal (frame=0xffffff005b3fa4c0,
eva=18446742975736173376) at /usr/src/sys/amd64/amd64/trap.c:660
#5  0xffffffff80489126 in trap (frame=
      {tf_rdi = 56, tf_rsi = -1097980730176, tf_rdx = 6, tf_rcx = 0, tf_r8 =
0, tf_r9 = 0, tf_rax = 1, tf_rbx = -1098015721464, tf_rbp = 4, tf_r10 =
-2037788432, tf_r11 = -1097980730176, tf_r12 = -1097980730176, tf_r13 =
-1097438414848, tf_r14 = 0, tf_r15 = 1, tf_trapno = 12, tf_addr = 396,
tf_flags = -2143116959, tf_err = 0, tf_rip = -2144536473, tf_cs = 8, tf_rflags
= 65538, tf_rsp = -1275119424, tf_ss = 16}) at
/usr/src/sys/amd64/amd64/trap.c:238
#6  0xffffffff8047449b in calltrap () at
/usr/src/sys/amd64/amd64/exception.S:168
#7  0xffffffff802cf867 in _mtx_lock_sleep (m=0xffffff005929b808,
tid=18446742975728821440, opts=6, file=0x0, line=0)
    at /usr/src/sys/kern/kern_mutex.c:542
#8  0xffffffff803826bd in ip_ctloutput (so=0x38, sopt=0xffffffffb3ff3b30) at
/usr/src/sys/netinet/ip_output.c:1193
#9  0xffffffff80393bd5 in tcp_ctloutput (so=0xffffff005a83b738,
sopt=0xffffffffb3ff3b30) at /usr/src/sys/netinet/tcp_usrreq.c:1038
#10 0xffffffff80322068 in sosetopt (so=0xffffff005a83b738,
sopt=0xffffffffb3ff3b30) at /usr/src/sys/kern/uipc_socket.c:1563
#11 0xffffffff80328536 in kern_setsockopt (td=0xffffff005b3fa4c0,
s=1619162408, level=56, name=0, val=0x0, valseg=UIO_USERSPACE, 
    valsize=2257178864) at /usr/src/sys/kern/uipc_syscalls.c:1351
#12 0xffffffff803285ae in setsockopt (td=0x38, uap=0xffffff005b3fa4c0) at
/usr/src/sys/kern/uipc_syscalls.c:1307
#13 0xffffffff80489a51 in syscall (frame=
      {tf_rdi = 0, tf_rsi = 0, tf_rdx = 1, tf_rcx = 0, tf_r8 = 0, tf_r9 =
140737488349992, tf_rax = 105, tf_rbx = 0, tf_rbp = 0, tf_r10 = 0, tf_r11 =
514, tf_r12 = 3, tf_r13 = 140737488350320, tf_r14 = 0, tf_r15 = 0, tf_trapno =
12, tf_addr = 5285992, tf_flags = 12, tf_err = 2, tf_rip = 34368089164, tf_cs
= 43, tf_rflags = 582, tf_rsp = 140737488350040, tf_ss = 35})
    at /usr/src/sys/amd64/amd64/trap.c:792
#14 0xffffffff80474638 in Xfast_syscall () at
/usr/src/sys/amd64/amd64/exception.S:270
#15 0x00000008007f6c4c in ?? ()
Previous frame inner to this frame (corrupt stack?)
(kgdb) up 5
#5  0xffffffff80489126 in trap (frame=
      {tf_rdi = 56, tf_rsi = -1097980730176, tf_rdx = 6, tf_rcx = 0, tf_r8 =
0, tf_r9 = 0, tf_rax = 1, tf_rbx = -1098015721464, tf_rbp = 4, tf_r10 =
-2037788432, tf_r11 = -1097980730176, tf_r12 = -1097980730176, tf_r13 =
-1097438414848, tf_r14 = 0, tf_r15 = 1, tf_trapno = 12, tf_addr = 396,
tf_flags = -2143116959, tf_err = 0, tf_rip = -2144536473, tf_cs = 8, tf_rflags
= 65538, tf_rsp = -1275119424, tf_ss = 16}) at
/usr/src/sys/amd64/amd64/trap.c:238
238                             trap_fatal(&frame, frame.tf_addr);
(kgdb) up
#6  0xffffffff8047449b in calltrap () at
/usr/src/sys/amd64/amd64/exception.S:168
168             call    trap
Current language:  auto; currently asm
(kgdb) up
#7  0xffffffff802cf867 in _mtx_lock_sleep (m=0xffffff005929b808,
tid=18446742975728821440, opts=6, file=0x0, line=0)
    at /usr/src/sys/kern/kern_mutex.c:542
542                     owner = (struct thread *)(v & MTX_FLAGMASK);
Current language:  auto; currently c
(kgdb) list
537     #if defined(SMP) && !defined(NO_ADAPTIVE_MUTEXES)
538                     /*
539                      * If the current owner of the lock is executing on
another
540                      * CPU, spin instead of blocking.
541                      */
542                     owner = (struct thread *)(v & MTX_FLAGMASK);
543     #ifdef ADAPTIVE_GIANT
544                     if (TD_IS_RUNNING(owner)) {
545     #else
546                     if (m != &Giant && TD_IS_RUNNING(owner)) {
(kgdb) up
#8  0xffffffff803826bd in ip_ctloutput (so=0x38, sopt=0xffffffffb3ff3b30) at
/usr/src/sys/netinet/ip_output.c:1193
1193                            INP_LOCK(inp);
(kgdb) list
1188                                                m->m_len);
1189                            if (error) {
1190                                    m_free(m);
1191                                    break;
1192                            }
1193                            INP_LOCK(inp);
1194                            error = ip_pcbopts(inp, sopt->sopt_name, m);
1195                            INP_UNLOCK(inp);
1196                            return (error);
1197                    }
(kgdb) print so
$1 = (struct socket *) 0x38
(kgdb) print sopt
$2 = (struct sockopt *) 0xffffffffb3ff3b30
(kgdb) up
#9  0xffffffff80393bd5 in tcp_ctloutput (so=0xffffff005a83b738,
sopt=0xffffffffb3ff3b30) at /usr/src/sys/netinet/tcp_usrreq.c:1038
1038                    error = ip_ctloutput(so, sopt);
(kgdb) list
1033    #ifdef INET6
1034                    if (INP_CHECK_SOCKAF(so, AF_INET6))
1035                            error = ip6_ctloutput(so, sopt);
1036                    else
1037    #endif /* INET6 */
1038                    error = ip_ctloutput(so, sopt);
1039                    return (error);
1040            }
1041            tp = intotcpcb(inp);
(kgdb) up
#10 0xffffffff80322068 in sosetopt (so=0xffffff005a83b738,
sopt=0xffffffffb3ff3b30) at /usr/src/sys/kern/uipc_socket.c:1563
1563                            return ((*so->so_proto->pr_ctloutput)
(kgdb) print so->so_proto->pr_ctloutput
$3 = (pr_ctloutput_t *) 0xffffffff80393ae0 <tcp_ctloutput>
(kgdb) list *0xffffffff80393ae0
0xffffffff80393ae0 is in tcp_ctloutput
(/usr/src/sys/netinet/tcp_usrreq.c:1016).
1011     */
1012    int
1013    tcp_ctloutput(so, sopt)
1014            struct socket *so;
1015            struct sockopt *sopt;
1016    {
1017            int     error, opt, optval;
1018            struct  inpcb *inp;
1019            struct  tcpcb *tp;
1020            struct  tcp_info ti;
(kgdb) up
#11 0xffffffff80328536 in kern_setsockopt (td=0xffffff005b3fa4c0,
s=1619162408, level=56, name=0, val=0x0, valseg=UIO_USERSPACE, 
    valsize=2257178864) at /usr/src/sys/kern/uipc_syscalls.c:1351
1351                    error = sosetopt(so, &sopt);
(kgdb) list
1346
1347            NET_LOCK_GIANT();
1348            error = getsock(td->td_proc->p_fd, s, &fp);
1349            if (error == 0) {
1350                    so = fp->f_data;
1351                    error = sosetopt(so, &sopt);
1352                    fdrop(fp, td);
1353            }
1354            NET_UNLOCK_GIANT();
1355            return(error);
(kgdb) up
#12 0xffffffff803285ae in setsockopt (td=0x38, uap=0xffffff005b3fa4c0) at
/usr/src/sys/kern/uipc_syscalls.c:1307
1307            return (kern_setsockopt(td, uap->s, uap->level, uap->name,
(kgdb) up
#13 0xffffffff80489a51 in syscall (frame=
      {tf_rdi = 0, tf_rsi = 0, tf_rdx = 1, tf_rcx = 0, tf_r8 = 0, tf_r9 =
140737488349992, tf_rax = 105, tf_rbx = 0, tf_rbp = 0, tf_r10 = 0, tf_r11 =
514, tf_r12 = 3, tf_r13 = 140737488350320, tf_r14 = 0, tf_r15 = 0, tf_trapno =
12, tf_addr = 5285992, tf_flags = 12, tf_err = 2, tf_rip = 34368089164, tf_cs
= 43, tf_rflags = 582, tf_rsp = 140737488350040, tf_ss = 35})
    at /usr/src/sys/amd64/amd64/trap.c:792
792                             error = (*callp->sy_call)(td, argp);
(kgdb) list
787                     if ((callp->sy_narg & SYF_MPSAFE) == 0) {
788                             mtx_lock(&Giant);
789                             error = (*callp->sy_call)(td, argp);
790                             mtx_unlock(&Giant);
791                     } else
792                             error = (*callp->sy_call)(td, argp);
793             }
794
795             switch (error) {
796             case 0:
#14 0xffffffff80474638 in Xfast_syscall () at
/usr/src/sys/amd64/amd64/exception.S:270
270             call    syscall
Current language:  auto; currently asm
(kgdb) list
265             movq    %r12,TF_R12(%rsp)       /* C preserved */
266             movq    %r13,TF_R13(%rsp)       /* C preserved */
267             movq    %r14,TF_R14(%rsp)       /* C preserved */
268             movq    %r15,TF_R15(%rsp)       /* C preserved */
269             FAKE_MCOUNT(TF_RIP(%rsp))
270             call    syscall
271             movq    PCPU(CURPCB),%rax
272             testq   $PCB_FULLCTX,PCB_FLAGS(%rax)
273             jne     3f
274     1:      /* Check for and handle AST's on return to userland */
(kgdb) up
#15 0x00000008007f6c4c in ?? ()
(kgdb) up
Initial frame selected; you cannot go up.
(kgdb) list
275             cli
276             movq    PCPU(CURTHREAD),%rax
277             testl   $TDF_ASTPENDING | TDF_NEEDRESCHED,TD_FLAGS(%rax)
278             je      2f
279             sti
280             movq    %rsp, %rdi
281             call    ast
282             jmp     1b
283     2:      /* restore preserved registers */
284             MEXITCOUNT


DMESG:
------------------------------------------
opyright (c) 1992-2006 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
FreeBSD 6.1-STABLE #0: Tue Oct  3 08:33:25 CEST 2006
    root at mxtwo.nsn.no:/usr/obj/usr/src/sys/BladeSMP
Timecounter "i8254" frequency 1193182 Hz quality 0
CPU: Intel(R) Xeon(TM) CPU 2.80GHz (2800.11-MHz K8-class CPU)
  Origin = "GenuineIntel"  Id = 0xf41  Stepping = 1
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,
HTT,TM,PBE>
  Features2=0x641d<SSE3,RSVD2,MON,DS_CPL,CNTX-ID,CX16,<b14>>
  AMD Features=0x20000800<SYSCALL,LM>
  Logical CPUs per core: 2
real memory  = 2147213312 (2047 MB)
avail memory = 2064486400 (1968 MB)
kbd0 at kbdmux0
cpu0 on motherboard
pcib0: <Host to PCI bridge> pcibus 0 on motherboard
pci0: <PCI bus> on pcib0
pci0: <unknown> at device 0.1 (no driver attached)
pcib1: <PCI-PCI bridge> at device 3.0 on pci0
pci4: <PCI bus> on pcib1
pcib2: <PCI-PCI bridge> at device 0.0 on pci4
pci6: <PCI bus> on pcib2
pcib3: <PCI-PCI bridge> at device 0.2 on pci4
pci5: <PCI bus> on pcib3
bge0: <Broadcom BCM5704 B0, ASIC rev. 0x2100> mem 0xdcff0000-0xdcffffff irq 7
at device 1.0 on pci5
bge0: Ethernet address: 00:14:5e:3c:94:b6
bge1: <Broadcom BCM5704 B0, ASIC rev. 0x2100> mem 0xdcfe0000-0xdcfeffff irq 5
at device 1.1 on pci5
bge1: Ethernet address: 00:14:5e:3c:94:b7
pci0: <base peripheral> at device 8.0 (no driver attached)
pcib4: <PCI-PCI bridge> at device 28.0 on pci0
pci2: <PCI bus> on pcib4
mpt0: <LSILogic 1030 Ultra4 Adapter> port 0x4000-0x40ff mem
0xdeff0000-0xdeffffff,0xdefe0000-0xdefeffff irq 10 at device 1.0 on pci2
mpt0: [GIANT-LOCKED]
mpt0: MPI Version=1.2.15.0
mpt0: Capabilities: ( RAID-1E RAID-1 SAFTE )
mpt0: 1 Active Volume (1 Max)
mpt0: 2 Hidden Drive Members (6 Max)
uhci0: <UHCI (generic) USB controller> port 0x2200-0x221f irq 10 at device
29.0 on pci0
uhci0: [GIANT-LOCKED]
usb0: <UHCI (generic) USB controller> on uhci0
usb0: USB revision 1.0
uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 2 ports with 2 removable, self powered
uhci1: <UHCI (generic) USB controller> port 0x2600-0x261f irq 5 at device 29.1
on pci0
uhci1: [GIANT-LOCKED]
usb1: <UHCI (generic) USB controller> on uhci1
usb1: USB revision 1.0
uhub1: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub1: 2 ports with 2 removable, self powered
pci0: <base peripheral> at device 29.4 (no driver attached)
pci0: <base peripheral, interrupt controller> at device 29.5 (no driver
attached)
pcib5: <PCI-PCI bridge> at device 30.0 on pci0
pci1: <PCI bus> on pcib5
pci1: <display, VGA> at device 1.0 (no driver attached)
isab0: <PCI-ISA bridge> at device 31.0 on pci0
isa0: <ISA bus> on isab0
atapci0: <Intel 6300ESB UDMA100 controller> port
0x1f0-0x1f7,0x3f6,0x170-0x177,0x376 at device 31.1 on pci0
ata0: <ATA channel 0> on atapci0
ata1: <ATA channel 1> on atapci0
pci0: <serial bus, SMBus> at device 31.3 (no driver attached)
orm0: <ISA Option ROM> at iomem 0xc0000-0xc8fff on isa0
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x300>
sio0 at port 0x3f8-0x3ff irq 4 flags 0x10 on isa0
sio0: type 16550A
sio1: configured irq 3 not in bitmap of probed irqs 0
sio1: port may not be enabled
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
uhub2: Cypress Semiconductor 4 Port Hub, class 9/0, rev 1.10/0.01, addr 2
uhub2: 4 ports with 4 removable, bus powered
ukbd0: IBM PPC I/F, rev 1.10/0.01, addr 3, iclass 3/1
kbd1 at ukbd0
ums0: IBM PPC I/F, rev 1.10/0.01, addr 3, iclass 3/1
ums0: X report 0x0002 not supported
device_attach: ums0 attach returned 6
ukbd1: IBM HIDK/M, rev 1.10/0.01, addr 4, iclass 3/1
kbd2 at ukbd1
ums0: IBM HIDK/M, rev 1.10/0.01, addr 4, iclass 3/1
ums0: 3 buttons and Z dir.
Timecounter "TSC" frequency 2800109935 Hz quality 800
Timecounters tick every 1.000 msec
IP Filter: v4.1.8 initialized.  Default = pass all, Logging = enabled
Waiting 5 seconds for SCSI devices to settle
mpt0:vol0(mpt0:0:0): Settings ( Hot-Plug-Spares )
mpt0:vol0(mpt0:0:0): Using Spare Pool: 0
mpt0:vol0(mpt0:0:0): 2 Members:
      (mpt0:0:0): Primary
      (mpt0:0:1): Secondary
mpt0:vol0(mpt0:0:0): RAID-1 - Optimal
mpt0:vol0(mpt0:0:0): Status ( Enabled )
(mpt0:vol0:0): Physical (mpt0:0:0), Pass-thru (mpt0:1:0)
(mpt0:vol0:0): Online
(mpt0:vol0:1): Physical (mpt0:0:1), Pass-thru (mpt0:1:1)
(mpt0:vol0:1): Online
pass1 at mpt0 bus 1 target 0 lun 0
pass1: <IBM-ESXS ST973401LC    FN B41D> Fixed unknown SCSI-4 device 
pass1: 320.000MB/s transfers (160.000MHz, offset 63, 16bit), Tagged Queueing
Enabled
pass2 at mpt0 bus 1 target 1 lun 0
pass2: <IBM-ESXS ST973401LC    FN B41D> Fixed unknown SCSI-4 device 
pass2: 320.000MB/s transfers (160.000MHz, offset 63, 16bit), Tagged Queueing
Enabled
da0 at mpt0 bus 0 target 0 lun 0
da0: <LSILOGIC 1030 IM       IM 1000> Fixed Direct Access SCSI-2 device 
da0: 320.000MB/s transfers (160.000MHz, offset 63, 16bit), Tagged Queueing
Enabled
da0: 69878MB (143110144 512 byte sectors: 255H 63S/T 8908C)
Trying to mount root from ufs:/dev/da0s1a
WARNING: / was not properly dismounted
WARNING: /usr was not properly dismounted
bge1: link state changed to UP



-- 
Med vennlig hilsen / Best regards,

------------------------------------------

  Daniel Bond         
  PGP: C822C4BD        
  
------------------------------------------


More information about the freebsd-stable mailing list