Packet loss every 30.999 seconds

Mark Fullmer maf at
Sun Dec 16 21:21:53 PST 2007

While trying to diagnose a packet loss problem in a RELENG_6 snapshot  
November 8, 2007 it looks like I've stumbled across a broken driver or
kernel routine which stops interrupt processing long enough to severly
degrade network performance every 30.99 seconds.

Packets appear to make it as far as ether_input() then get lost.

Test setup:

A - ethernet_switch - B

A sends UDP packets to B through an ethernet switch.  The interface  
packet count and output packet count on the switch match what A
is sending and B should be receiving.  A UDP receiver running on B
sees windows of packet loss with a period of 30.99 seconds.  The lost
packets are counted based on an incrementing sequence number.  On an
isolated network the <Link#1> Ipkts counter on B matches what A is
sending, but the packets never show up in any of the IP/UDP counters
or the program trying to receive them.

This behavior can be seen with both em and fxp interfaces.  Problem  
is it
only occurs after the receiving host has been up about a day.  Reboot,
problem clears.  GENERIC kernel, nothing more than default daemons
running.  Behavior seen on three different motherboards so far.

It also appears this is not just lost network interrupts.  Whatever
is spinning in the kernel also impacts syscall latency.  An easy way to
replicate what I'm seeing is to run gettimeofday() in a tight loop
and note when the real time syscall delay exceeds some value (which
is dependent on processor speed).  As an example on an 3.20GHz CPU a
small program will output when the syscall latency is > 5000 usecs.
Note the periodic behavior at 30.99 seconds.  These big jumps in
latency correspond to when packets are being dropped.

usecs (epoch)    latency diff
1197861705805078 478199 0
1197861721012298 25926 15207220
1197861726332036 11729 5319738
1197861757331549 11691 30999513
1197861788331266 11878 30999717
1197861819330647 11708 30999381
1197861850330192 11698 30999545
1197861881329733 11667 30999541
1197861900018297 6516 18688564
1197861912329282 11684 12310985
1197861943328849 11699 30999567
1197861974328413 11692 30999564
1197862005328228 11916 30999815
1197862036327598 11684 30999370
1197862067327229 11680 30999631
1197862098326860 11667 30999631
1197862129326559 11704 30999699
1197862160326377 11844 30999818
1197862191325890 11674 30999513

(output from packet loss tester)

window_start/window_end is packet counter
time_start/time_end is absolute time in usecs.
window_diff is # of packets missing

The test is run at about 15.5Kpps / 132Mbits/second, certainly a lot  
less than this hardware is capable of running BSD4.X.

:missing window_start=311510,  
time_end=1197861726332011, window_diff=128, time_diff=3
:missing window_start=794482,  
time_end=1197861757331509, window_diff=127, time_diff=4
:missing window_start=1277313,  
time_end=1197861788331249, window_diff=131, time_diff=4
:missing window_start=1760104,  
time_end=1197861819330629, window_diff=128, time_diff=4
:missing window_start=2242789,  
time_end=1197861850330174, window_diff=127, time_diff=4
:missing window_start=2725818,  
time_end=1197861881329715, window_diff=128, time_diff=3
:missing window_start=3208594,  
time_end=1197861912329264, window_diff=128, time_diff=3
:missing window_start=3691395,  
time_end=1197861943328805, window_diff=127, time_diff=3
:missing window_start=4173793,  
time_end=1197861974328373, window_diff=128, time_diff=4
:missing window_start=4656236,  
time_end=1197862005328179, window_diff=131, time_diff=3
:missing window_start=5139197,  
time_end=1197862036327580, window_diff=128, time_diff=4
:missing window_start=5621958,  
time_end=1197862067327211, window_diff=127, time_diff=3
:missing window_start=6104597,  
time_end=1197862098326843, window_diff=128, time_diff=4
:missing window_start=6587241,  
time_end=1197862129326534, window_diff=128, time_diff=20
:missing window_start=7070051,  
time_end=1197862160326371, window_diff=132, time_diff=3
:missing window_start=7552828,  
time_end=1197862191325876, window_diff=126, time_diff=3
:missing window_start=8035434,  
time_end=1197862222325576, window_diff=126, time_diff=4

I'm building a more up to date copy of RELENG_6 to make sure I'm not  
chasing something that's
been fixed.  As a side note this appears to also be happening on a  
RELENG_6 build dated  Mar 11 2007.

Included is the gettimeofday() looper.  Run as ./a.out 1 5000, where  
will depend on your system speed.  This probably won't provide any
meaningful results on a loaded system.  E-mail me off list for a copy  
of the
packet tester or more diagnostics.

#include <sys/time.h>
#include <sys/types.h>
#include <errno.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <err.h>

main(int argc, char **argv)
   struct timeval tv;
   struct timezone tz;
   u_int64_t time_now, time_last, time_mark;
   int quiet, max;

   if (argc != 3)
     errx(1, "Usage: %s <quiet> <max>", argv[0]);


   gettimeofday(&tv, &tz);
   time_last = (u_int64_t)tv.tv_sec * 1000000LL + (u_int64_t)tv.tv_usec;
   time_mark = 0LL;

   for (;;) {

     gettimeofday(&tv, &tz);
     time_now = (u_int64_t)tv.tv_sec * 1000000LL + (u_int64_t) 

     if (!quiet) {
       printf("%llu %llu %llu\n", time_now, time_now-time_last,  
       time_mark = time_now;
     } else {
       if ((time_now-time_last) > max) {
         if (time_mark == 0)
           time_mark = time_now;
         printf("%llu %llu %llu\n", time_now, time_now-time_last,  
         time_mark = time_now;
     time_last = time_now;


} /* main */

More information about the freebsd-stable mailing list