Packet loss every 30.999 seconds
brde at optusnet.com.au
Mon Dec 17 21:43:52 PST 2007
On Mon, 17 Dec 2007, David G Lawrence wrote:
>> While trying to diagnose a packet loss problem in a RELENG_6 snapshot
>> November 8, 2007 it looks like I've stumbled across a broken driver or
>> kernel routine which stops interrupt processing long enough to severly
>> degrade network performance every 30.99 seconds.
I see the same behaviour under a heavily modified version of FreeBSD-5.2
(except the period was 2 ms longer and the latency was 7 ms instead
of 11 ms when numvnodes was at a certain value. Now with numvnodes =
17500, the latency is 3 ms.
> I noticed this as well some time ago. The problem has to do with the
> processing (syncing) of vnodes. When the total number of allocated vnodes
> in the system grows to tens of thousands, the ~31 second periodic sync
> process takes a long time to run. Try this patch and let people know if
> it helps your problem. It will periodically wait for one tick (1ms) every
> 500 vnodes of processing, which will allow other things to run.
However, the syncer should be running at a relative low priority and not
cause packet loss. I don't see any packet loss even in ~5.2 where the
network stack (but not drivers) is still Giant-locked.
Other too-high latencies showed up:
- syscons LED setting and vt switching gives a latency of 5.5 msec because
syscons still uses busy-waiting for setting LEDs :-(. Oops, I do see
packet loss -- this causes it under ~5.2 but not under -current. For
the bge and/or em drivers, the packet loss shows up in netstat output
as a few hundred errors for every LED setting on the receiving machine,
while receiving tiny packets at the maximum possible rate of 640 kpps.
sysctl is completely Giant-locked and so are upper layers of the
network stack. The bge hardware rx ring size is 256 in -current and
512 in ~5.2. At 640 kpps, 512 packets take 800 us so bge wants to
call the the upper layers with a latency of far below 800 us. I
don't know exactly where the upper layers block on Giant.
- a user CPU hog process gives a latency of over 200 ms every half a
second or so when the hog starts up, and a 300-400 ms after the
hog has been running for some time. Two user CPU hog processes
double the latency. Reducing kern.sched.quantum from 100 ms to 10
ms and/or renicing the hogs don't seem to affect this. Running the
hogs at idle priority fixes this. This won't affect packet loss,
but it might affect user network processes -- they might need to
run at real time priority to get low enough latency. They might need
to do this anyway -- a scheduling quantum of 100 ms should give a
latency of 100 ms per CPU hog quite often, though not usually since
the hogs should never be prefered to a higher-prioerity process.
Previously I've used a less specialized clock-watching program to
determine the syscall latency. It showed similar problems for CPU
hogs. I just remembered that I found the fix for these under ~5.2 --
remove a local hack that sacrifices latency for reduced context
switches between user threads. -current with SCHED_4BSD does this
non-hackishly, but seems to have a bug somehwhere that gives a latency
that is large enough to be noticeable in interactive programs.
More information about the freebsd-net