Synopsis: process swi1: net, taskq em0 and dummynet gives 100% CPU usage

Mon Mar 16 08:41:01 PDT 2009

Synopsis: process swi1: net, taskq em0 and dummynet gives 100% CPU usage
Related to
http://lists.freebsd.org/pipermail/freebsd-net/2009-February/021120.html

Not depending on the conditions (no heavy load, not a lot of traffic passing
through, not a lot of ng nodes) server stops to work properly.

A:
1) swi1:net gives me 100% CPU usage.
2) server is not responding to icmp echo requests
3) ssh of course not working
4) mpd has an "ngsock" state at the top
5) rebooting the server helps.

B:
1) taskq: em0 gives me 100% CPU usage.
2) I have watchdog timeout in my /var/log/messages
3) server is not responding to icmp echo requests
4) ssh of course not working
5) mpd has an "ngsock" state at the top
6) rebooting the server helps.
7) swi1:net is 0%

C:
1) dummynet process gives 100% CPU usage.
2) server is not responding to icmp echo requests
3) ssh of course not working
4) mpd has an "ngsock" state at the top
5) rebooting the server helps.

I have few servers:
INTEL S3200SH with Q8200 or E8600
NICs: 82566DM-2 or 82571EB (em driver)
OSes: FreeBSD 7.0-RELEASE-p10, FreeBSD 7.0-RELEASE-p9, FreeBSD
6.4-RELEASE-p3
Soft: mpd 4.4.1, ipfw with dummynet shaping, pf (nat only)
PPPoE
I'm using only em0 card with about 550 vlans
2000 ng nodes created
About 500-700 simultaneous PPPoE sessions in a rush hour.

kernel:
device          bpf             # Berkeley packet filter

device          pf
options         IPFIREWALL
options         IPFIREWALL_VERBOSE
options         IPFIREWALL_FORWARD
options         IPFIREWALL_VERBOSE_LIMIT=1000
options         IPFIREWALL_DEFAULT_TO_ACCEPT
options         IPDIVERT
options         DUMMYNET

options         DEVICE_POLLING
options         HZ=2000

options         NETGRAPH
options         NETGRAPH_ETHER
options         NETGRAPH_IFACE
options         NETGRAPH_SOCKET
options         NETGRAPH_PPP
options         NETGRAPH_TCPMSS
options         NETGRAPH_TEE
options         NETGRAPH_VJC
options         NETGRAPH_PPPOE

On some servers i have netgraph as modules and polling option commented out.

sysctl.conf:
net.inet.ip.intr_queue_maxlen=1000
net.inet.tcp.blackhole=2
net.inet.udp.blackhole=1

net.inet.ip.dummynet.hash_size=1024
net.inet.ip.dummynet.io_fast=1
net.inet.ip.fw.one_pass=1

net.inet.ip.fastforwarding=1

net.isr.direct=0
#net.inet.ip.portrange.randomized=0
net.inet.tcp.syncookies=1

kern.ipc.maxsockbuf=1048576
net.graph.maxdgram=524288
net.graph.recvspace=524288

net.inet.ip.portrange.first=1024
net.inet.ip.portrange.last=65535

dev.em.0.rx_int_delay=160
dev.em.0.rx_abs_int_delay=160
dev.em.0.tx_int_delay=160
dev.em.0.tx_abs_int_delay=160

dev.em.0.rx_processing_limit=200

loader.conf:
autoboot_delay="2"
kern.ipc.maxpipekva=10000000
net.graph.maxalloc=2048
hw.em.rxd="512"
hw.em.txd="1024"

About 30 ipfw rules and 2 rules for shaping:
00300 pipe tablearg ip from any to table(4) out via ng*
00301 pipe tablearg ip from table(5) to any in via ng*

I have tested different network cards with different chipsets.
With and without lagg0.
I had the same problems with Freebsd 7.1-RELEASE-p1/p2.
I tried to start servers without em tuning in loader.conf and sysctl.conf.
Server uptime differs from one week to two month.

I have two another servers with the same hardware, but without using
dummynet, netgraph and mpd. There is only quagga + bgp, same chipsets,
FreeBSD 7.0-RELEAS-p10. No problems at all.

IMHO: problem is somewhere in netgraph. Something is causing an infinite
loop.

Any ideas?