em? watchdog timeout 7-stable
Greg Byshenk
freebsd at byshenk.net
Fri May 15 08:25:03 UTC 2009
Following up to myself, I experienced a watchdog timout followed by
lockuup again early this morning. Strangely, rather than happening at
a time of heavy activity, it seems to have occurred when there was
very little activity.
I was running 'systat' in a window when the watchdog timeout occurred
and the network disappeared, and it showed:
=== begin systat output ===
2 users Load 0.46 1.36 1.32 May 15 05:29
Mem:KB REAL VIRTUAL VN PAGER SWAP PAGER
Tot Share Tot Share Free in out in out
Act 50544 5484 471736 8504 1789768 count
All 151492 8748 13158556 21348 pages
Proc: Interrupts
r p d s w Csw Trp Sys Int Sof Flt cow 16006 total
1 162 360 4 132 6 1126 zfod sio0 irq4
ozfod fdc0 irq6
12.9%Sys 12.5%Intr 0.0%User 0.0%Nice 74.7%Idle %ozfod ata0 irq14
| | | | | | | | | | | daefr 6 skc0 em0 1
======+++++++ prcfr twa0 irq18
39 dtbuf totfr em1 irq24
Namei Name-cache Dir-cache 100000 desvn react 2000 cpu0: time
Calls hits % hits % 84258 numvn pdwak 2000 cpu3: time
25000 frevn pdpgs 2000 cpu1: time
intrn 2000 cpu2: time
Disks da0 da1 pass0 pass1 541836 wire 2000 cpu7: time
KB/t 0.00 0.00 0.00 0.00 51628 act 2000 cpu6: time
tps 0 0 0 0 13865008 inact 2000 cpu4: time
MB/s 0.00 0.00 0.00 0.00 458052 cache 2000 cpu5: time
%busy 0 0 0 0 1331716 free
=== end systat output ===
This time, I was able to break into the debugger from my console, with
the following result:
=== begin kdb output ===
KDB: enter: Line break on console
[thread pid 17 tid 100009 ]
Stopped at kdb_enter_why+0x3d: movq $0,0x5d70d8(%rip)
db> panic
panic: from debugger
cpuid = 1
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2a
panic() at panic+0x182
db_panic() at db_panic+0x17
db_command() at db_command+0x1ef
db_command_loop() at db_command_loop+0x50
db_trap() at db_trap+0x89
kdb_trap() at kdb_trap+0x95
trap() at trap+0x264
calltrap() at calltrap+0x8
--- trap 0x3, rip = 0xffffffff804d07cd, rsp = 0xfffffffe800819d0, rbp = 0xfffffffe800819f0 ---
kdb_enter_why() at kdb_enter_why+0x3d
siointr1() at siointr1+0x2c5
siointr() at siointr+0x58
intr_execute_handlers() at intr_execute_handlers+0x8b
Xapic_isr1() at Xapic_isr1+0x7f
--- interrupt, rip = 0xffffffff80727c36, rsp = 0xfffffffe80081b90, rbp = 0xfffffffe80081ba0 ---
acpi_cpu_c1() at acpi_cpu_c1+0x6
acpi_cpu_idle() at acpi_cpu_idle+0x19c
sched_idletd() at sched_idletd+0x46
fork_exit() at fork_exit+0x11f
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xfffffffe80081d30, rbp = 0 ---
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2a
mi_switch() at mi_switch+0x2a8
sched_bind() at sched_bind+0x58
boot() at boot+0x3f
panic() at panic+0x16c
db_panic() at db_panic+0x17
db_command() at db_command+0x1ef
db_command_loop() at db_command_loop+0x50
db_trap() at db_trap+0x89
kdb_trap() at kdb_trap+0x95
trap() at trap+0x264
calltrap() at calltrap+0x8
--- trap 0x3, rip = 0xffffffff804d07cd, rsp = 0xfffffffe800819d0, rbp = 0xfffffffe800819f0 ---
kdb_enter_why() at kdb_enter_why+0x3d
siointr1() at siointr1+0x2c5
siointr() at siointr+0x58
intr_execute_handlers() at intr_execute_handlers+0x8b
Xapic_isr1() at Xapic_isr1+0x7f
--- interrupt, rip = 0xffffffff80727c36, rsp = 0xfffffffe80081b90, rbp = 0xfffffffe80081ba0 ---
acpi_cpu_c1() at acpi_cpu_c1+0x6
acpi_cpu_idle() at acpi_cpu_idle+0x19c
sched_idletd() at sched_idletd+0x46
fork_exit() at fork_exit+0x11f
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xfffffffe80081d30, rbp = 0 ---
db> bt
Tracing pid 17 tid 100009 td 0xffffff00013f3a50
kdb_enter_why() at kdb_enter_why+0x3d
siointr1() at siointr1+0x2c5
siointr() at siointr+0x58
intr_execute_handlers() at intr_execute_handlers+0x8b
Xapic_isr1() at Xapic_isr1+0x7f
--- interrupt, rip = 0xffffffff80727c36, rsp = 0xfffffffe80081b90, rbp = 0xfffffffe80081ba0 ---
acpi_cpu_c1() at acpi_cpu_c1+0x6
acpi_cpu_idle() at acpi_cpu_idle+0x19c
sched_idletd() at sched_idletd+0x46
fork_exit() at fork_exit+0x11f
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xfffffffe80081d30, rbp = 0 ---
=== end kdb output ===
Kernel/world are 7-STABLE amd64, in sync, from sources csup'ed Thursday, 14 May 2009.
Other information re em1 on this machine:
# pciconf -lvb
em1 at pci0:7:1:0: class=0x020000 card=0x10028086 chip=0x10118086 rev=0x01 hdr=0x00
vendor = 'Intel Corporation'
device = '82545EM Gigabit Ethernet Controller (Fiber)'
class = network
subclass = ethernet
bar [10] = type Memory, range 64, base 0xda300000, size 131072, enabled
bar [20] = type I/O Port, range 32, base 0x5000, size 64, enabled
# vmstat -i
interrupt total rate
irq4: sio0 1479 0
irq6: fdc0 10 0
irq14: ata0 58 0
irq16: skc0 em0 758850 85
irq18: twa0 2085338 234
irq24: em1 1 0
cpu0: timer 17806226 1999
cpu3: timer 17798161 1998
cpu2: timer 17798127 1998
cpu1: timer 17798043 1998
cpu5: timer 17798058 1998
cpu6: timer 17798161 1998
cpu4: timer 17798160 1998
cpu7: timer 17798160 1998
Total 145238832 16311
# ifconfig em1
em1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=db<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,POLLING,VLAN_HWCSUM>
ether 00:07:e9:1a:ae:dc
inet 192.168.1.62 netmask 0xfffff800 broadcast 192.168.7.255
media: Ethernet autoselect (1000baseLX <full-duplex>)
status: active
Any ideas?
On Wed, May 13, 2009 at 06:44:38PM +0200, Greg Byshenk wrote:
> On Wed, May 13, 2009 at 06:42:07PM +0200, Greg Byshenk wrote:
>
> > As a followup to my own previous message, I continue to have annoying
> > problems with "em?: watchdog timeout" on one of my machines (now running
> > 7.2-STABLE as of 2009-05-08).
> >
> > I have discontinued using the on-board (em, copper) NICs, and replaced
> > the original fibre NIC with a newer model, but the problem persists.
> > I've also set
> >
> > hw.pci.enable_msix=0
> > hw.pci.enable_msi=0
> > hw.em.rxd=1024
> > hw.em.txd=1024
> > net.inet.tcp.tso=0
> >
> > ...as suggested in some discussions of this problem, and set the em1
> > interface to 'polling', all to no avail. Frequently, though irregularly
> > (once or twice a day), the console begins to display
> >
> > em1: watchdog timeout -- resetting
> > em1: watchdog timeout -- resetting
> > em1: watchdog timeout -- resetting
> >
> > the nework is down, and the machine locks up.
> >
> > [Note: I am getting 'em1' now instead of 'em0' as previously, but this
> > is due to changing all of the nics, which led to a different numbering;
> > the timeout is still occurring on the (main) interface, the fibre
> > gigabit connection.]
> >
> > What is particularly perverse (IMO) is that, since changing the NIC to
> > the newer model (and updating the kernel), I can no longer break to the
> > debugger when the lockup occurs (there is no response to the break) --
> > bit I _can_ shut the machine down cleanly via hardware (a touch of the
> > power switch sends 'shutdown', and the machine shuts down cleanly --
> > after killing off processes waiting on network i/o).
> >
> > The machine is running nfs and samba (3.2.10, from ports), and pretty
> > much nothing else.
> >
> >
> > Anyone have any ideas about this...? I'm going mad with this.
--
greg byshenk - gbyshenk at byshenk.net - Leiden, NL
More information about the freebsd-stable
mailing list