Packet loss every 30.999 seconds

Mon Dec 17 09:57:21 PST 2007

Back to back test with no ethernet switch between two em interfaces,
same result.  The receiving side has been up > 1 day and exhibits
the problem.  These are also two different servers.  The small
gettimeofday() syscall tester also shows the same ~30
second pattern of high latency between syscalls.

Receiver test application reports 3699 missed packets

Sender netstat -i:

(before test)
em1    1500 <Link#2>      00:04:23:cf:51:b7       20     0  
15975785     0     0
em1    1500 10.1/24       10.1.0.2                37     -  
15975801     -     -

(after test)
em1    1500 <Link#2>      00:04:23:cf:51:b7       22     0  
25975822     0     0
em1    1500 10.1/24       10.1.0.2                39     -  
25975838     -     -

total IP packets sent in during test = end - start
25975838-15975801 =  10000037 (expected, 1,000,000 packets test +  
overhead)

Receiver netstat -i:

(before test)
em1    1500 <Link#2>      00:04:23:c4:cc:89 15975785     0        
21     0     0
em1    1500 10.1/24       10.1.0.1          15969626     -        
19     -     -

(after test)
em1    1500 <Link#2>      00:04:23:c4:cc:89 25975822     0        
23     0     0
em1    1500 10.1/24       10.1.0.1          25965964     -        
21     -     -

total ethernet frames received during test = end - start
25975822-15975785 = 10000037 (as expected)

total IP packets processed during test = end - start
25965964-15969626 = 9996338 (expecting 10000037)

Missed packets = expected - received
10000037-9996338 = 3699

netstat -i accounts for the 3699 missed packets also reported by the
application

Looking closer at the tester output again shows the periodic
~30 second windows of packet loss.

There's a second problem here in that packets are just disappearing
before they make it to ip_input(), or there's a dropped packets
counter I've not found yet.

I can provide remote access to anyone who wants to take a look, this
is very easy to duplicate.  The ~ 1 day uptime before the behavior
surfaces is not making this easy to isolate.

--
mark

On Dec 17, 2007, at 12:43 AM, Jeremy Chadwick wrote:

> On Mon, Dec 17, 2007 at 12:21:43AM -0500, Mark Fullmer wrote:
>> While trying to diagnose a packet loss problem in a RELENG_6  
>> snapshot dated
>> November 8, 2007 it looks like I've stumbled across a broken  
>> driver or
>> kernel routine which stops interrupt processing long enough to  
>> severly
>> degrade network performance every 30.99 seconds.
>>
>> Packets appear to make it as far as ether_input() then get lost.
>
> Are you sure this isn't being caused by something the switch is doing,
> such as MAC/ARP cache clearing or LACP?  I'm just speculating, but it
> would be worthwhile to remove the switch from the picture (crossover
> cable to the rescue).
>
> I know that at least in the case of fxp(4) and em(4), Jack Vogel does
> some through testing of throughput using a professional/high-end  
> packet
> generator (some piece of hardware, I forget the name...)
>
> -- 
> | Jeremy Chadwick                                    jdc at  
> parodius.com |
> | Parodius Networking                           http:// 
> www.parodius.com/ |
> | UNIX Systems Administrator                      Mountain View,  
> CA, USA |
> | Making life hard for others since 1977.                  PGP:  
> 4BD6C0CB |
>
> _______________________________________________
> freebsd-stable at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable- 
> unsubscribe at freebsd.org"
>