em? watchdog timeout 7-stable

Fri May 15 08:25:03 UTC 2009

Following up to myself, I experienced a watchdog timout followed by
lockuup again early this morning.  Strangely, rather than happening at
a time of heavy activity, it seems to have occurred when there was 
very little activity.

I was running 'systat' in a window when the watchdog timeout occurred
and the network disappeared, and it showed:

=== begin systat output ===

    2 users    Load  0.46  1.36  1.32                  May 15 05:29

Mem:KB    REAL            VIRTUAL                       VN PAGER   SWAP PAGER
        Tot   Share      Tot    Share    Free           in   out     in   out
Act   50544    5484   471736     8504 1789768  count
All  151492    8748 13158556    21348          pages
Proc:                                                            Interrupts
  r   p   d   s   w   Csw  Trp  Sys  Int  Sof  Flt        cow   16006 total
  1         162       360    4  132    6 1126             zfod        sio0 irq4
                                                          ozfod       fdc0 irq6
12.9%Sys  12.5%Intr  0.0%User  0.0%Nice 74.7%Idle        %ozfod       ata0 irq14
|    |    |    |    |    |    |    |    |    |    |       daefr     6 skc0 em0 1
======+++++++                                             prcfr       twa0 irq18
                                        39 dtbuf          totfr       em1 irq24
Namei     Name-cache   Dir-cache    100000 desvn          react  2000 cpu0: time
   Calls    hits   %    hits   %     84258 numvn          pdwak  2000 cpu3: time
                                     25000 frevn          pdpgs  2000 cpu1: time
                                                          intrn  2000 cpu2: time
Disks   da0   da1 pass0 pass1                      541836 wire   2000 cpu7: time
KB/t   0.00  0.00  0.00  0.00                       51628 act    2000 cpu6: time
tps       0     0     0     0                    13865008 inact  2000 cpu4: time
MB/s   0.00  0.00  0.00  0.00                      458052 cache  2000 cpu5: time
%busy     0     0     0     0                     1331716 free

=== end systat output ===

This time, I was able to break into the debugger from my console, with
the following result:

=== begin kdb output ===

KDB: enter: Line break on console
[thread pid 17 tid 100009 ]
Stopped at      kdb_enter_why+0x3d:     movq    $0,0x5d70d8(%rip)
db> panic
panic: from debugger
cpuid = 1
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2a
panic() at panic+0x182
db_panic() at db_panic+0x17
db_command() at db_command+0x1ef
db_command_loop() at db_command_loop+0x50
db_trap() at db_trap+0x89
kdb_trap() at kdb_trap+0x95
trap() at trap+0x264
calltrap() at calltrap+0x8
--- trap 0x3, rip = 0xffffffff804d07cd, rsp = 0xfffffffe800819d0, rbp = 0xfffffffe800819f0 ---
kdb_enter_why() at kdb_enter_why+0x3d
siointr1() at siointr1+0x2c5
siointr() at siointr+0x58
intr_execute_handlers() at intr_execute_handlers+0x8b
Xapic_isr1() at Xapic_isr1+0x7f
--- interrupt, rip = 0xffffffff80727c36, rsp = 0xfffffffe80081b90, rbp = 0xfffffffe80081ba0 ---
acpi_cpu_c1() at acpi_cpu_c1+0x6
acpi_cpu_idle() at acpi_cpu_idle+0x19c
sched_idletd() at sched_idletd+0x46
fork_exit() at fork_exit+0x11f
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xfffffffe80081d30, rbp = 0 ---
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2a
mi_switch() at mi_switch+0x2a8
sched_bind() at sched_bind+0x58
boot() at boot+0x3f
panic() at panic+0x16c
db_panic() at db_panic+0x17
db_command() at db_command+0x1ef
db_command_loop() at db_command_loop+0x50
db_trap() at db_trap+0x89
kdb_trap() at kdb_trap+0x95
trap() at trap+0x264
calltrap() at calltrap+0x8
--- trap 0x3, rip = 0xffffffff804d07cd, rsp = 0xfffffffe800819d0, rbp = 0xfffffffe800819f0 ---
kdb_enter_why() at kdb_enter_why+0x3d
siointr1() at siointr1+0x2c5
siointr() at siointr+0x58
intr_execute_handlers() at intr_execute_handlers+0x8b
Xapic_isr1() at Xapic_isr1+0x7f
--- interrupt, rip = 0xffffffff80727c36, rsp = 0xfffffffe80081b90, rbp = 0xfffffffe80081ba0 ---
acpi_cpu_c1() at acpi_cpu_c1+0x6
acpi_cpu_idle() at acpi_cpu_idle+0x19c
sched_idletd() at sched_idletd+0x46
fork_exit() at fork_exit+0x11f
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xfffffffe80081d30, rbp = 0 ---
db> bt
Tracing pid 17 tid 100009 td 0xffffff00013f3a50
kdb_enter_why() at kdb_enter_why+0x3d
siointr1() at siointr1+0x2c5
siointr() at siointr+0x58
intr_execute_handlers() at intr_execute_handlers+0x8b
Xapic_isr1() at Xapic_isr1+0x7f
--- interrupt, rip = 0xffffffff80727c36, rsp = 0xfffffffe80081b90, rbp = 0xfffffffe80081ba0 ---
acpi_cpu_c1() at acpi_cpu_c1+0x6
acpi_cpu_idle() at acpi_cpu_idle+0x19c
sched_idletd() at sched_idletd+0x46
fork_exit() at fork_exit+0x11f
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xfffffffe80081d30, rbp = 0 ---

=== end kdb output ===

Kernel/world are 7-STABLE amd64, in sync, from sources csup'ed Thursday, 14 May 2009.

Other information re em1 on this machine:

# pciconf -lvb
em1 at pci0:7:1:0: class=0x020000 card=0x10028086 chip=0x10118086 rev=0x01 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82545EM Gigabit Ethernet Controller (Fiber)'
    class      = network
    subclass   = ethernet
    bar   [10] = type Memory, range 64, base 0xda300000, size 131072, enabled
    bar   [20] = type I/O Port, range 32, base 0x5000, size 64, enabled

# vmstat -i
interrupt                          total       rate
irq4: sio0                          1479          0
irq6: fdc0                            10          0
irq14: ata0                           58          0
irq16: skc0 em0                   758850         85
irq18: twa0                      2085338        234
irq24: em1                             1          0
cpu0: timer                     17806226       1999
cpu3: timer                     17798161       1998
cpu2: timer                     17798127       1998
cpu1: timer                     17798043       1998
cpu5: timer                     17798058       1998
cpu6: timer                     17798161       1998
cpu4: timer                     17798160       1998
cpu7: timer                     17798160       1998
Total                          145238832      16311

# ifconfig em1
em1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=db<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,POLLING,VLAN_HWCSUM>
        ether 00:07:e9:1a:ae:dc
        inet 192.168.1.62 netmask 0xfffff800 broadcast 192.168.7.255
        media: Ethernet autoselect (1000baseLX <full-duplex>)
        status: active

Any ideas?

On Wed, May 13, 2009 at 06:44:38PM +0200, Greg Byshenk wrote:
> On Wed, May 13, 2009 at 06:42:07PM +0200, Greg Byshenk wrote:
> 
> > As a followup to my own previous message, I continue to have annoying 
> > problems with "em?: watchdog timeout" on one of my machines (now running
> > 7.2-STABLE as of 2009-05-08).
> > 
> > I have discontinued using the on-board (em, copper) NICs, and replaced
> > the original fibre NIC with a newer model, but the problem persists.
> > I've also set
> > 
> >    hw.pci.enable_msix=0
> >    hw.pci.enable_msi=0
> >    hw.em.rxd=1024
> >    hw.em.txd=1024
> >    net.inet.tcp.tso=0
> > 
> > ...as suggested in some discussions of this problem, and set the em1
> > interface to 'polling', all to no avail.  Frequently, though irregularly
> > (once or twice a day), the console begins to display
> > 
> >    em1: watchdog timeout -- resetting
> >    em1: watchdog timeout -- resetting
> >    em1: watchdog timeout -- resetting
> > 
> > the nework is down, and the machine locks up.
> > 
> > [Note: I am getting 'em1' now instead of 'em0' as previously, but this
> > is due to changing all of the nics, which led to a different numbering;
> > the timeout is still occurring on the (main) interface, the fibre 
> > gigabit connection.]
> > 
> > What is particularly perverse (IMO) is that, since changing the NIC to
> > the newer model (and updating the kernel), I can no longer break to the
> > debugger when the lockup occurs (there is no response to the break) --
> > bit I _can_ shut the machine down cleanly via hardware (a touch of the
> > power switch sends 'shutdown', and the machine shuts down cleanly --
> > after killing off processes waiting on network i/o).
> > 
> > The machine is running nfs and samba (3.2.10, from ports), and pretty
> > much nothing else.
> > 
> > 
> > Anyone have any ideas about this...?  I'm going mad with this.

-- 
greg byshenk  -  gbyshenk at byshenk.net  -  Leiden, NL