quick summary results with ixgbe (was Re: datapoints on 10G throughput with TCP ?

Thu Dec 8 15:18:46 UTC 2011

On Fri, Dec 09, 2011 at 12:11:50AM +1100, Lawrence Stewart wrote:
> On 12/08/11 05:08, Luigi Rizzo wrote:
...
> >I ran a bunch of tests on the ixgbe (82599) using RELENG_8 (which
> >seems slightly faster than HEAD) using MTU=1500 and various
> >combinations of card capabilities (hwcsum,tso,lro), different window
> >sizes and interrupt mitigation configurations.
> >
> >default latency is 16us, l=0 means no interrupt mitigation.
> >lro is the software implementation of lro (tcp_lro.c)
> >hwlro is the hardware one (on 82599). Using a window of 100 Kbytes
> >seems to give the best results.
> >
> >Summary:
> 
> [snip]
> 
> >- enabling software lro on the transmit side actually slows
> >   down the throughput (4-5Gbit/s instead of 8.0).
> >   I am not sure why (perhaps acks are delayed too much) ?
> >   Adding a couple of lines in tcp_lro to reject
> >   pure acks seems to have much better effect.
> >
> >The tcp_lro patch below might actually be useful also for
> >other cards.
> >
> >--- tcp_lro.c   (revision 228284)
> >+++ tcp_lro.c   (working copy)
> >@@ -245,6 +250,8 @@
> >
> >         ip_len = ntohs(ip->ip_len);
> >         tcp_data_len = ip_len - (tcp->th_off<<  2) - sizeof (*ip);
> >+       if (tcp_data_len == 0)
> >+               return -1;      /* not on ack */
> >
> >
> >         /*
> 
> There is a bug with our LRO implementation (first noticed by Jeff 
> Roberson) that I started fixing some time back but dropped the ball on. 
> The crux of the problem is that we currently only send an ACK for the 
> entire LRO chunk instead of all the segments contained therein. Given 
> that most stacks rely on the ACK clock to keep things ticking over, the 
> current behaviour kills performance. It may well be the cause of the 
> performance loss you have observed.

I should clarify better.
First of all, i tested two different LRO implementations: our
"Software LRO" (tcp_lro.c), and the "Hardware LRO" which is implemented
by the 82599 (called RSC or receive-side-coalescing in the 82599
data sheets). Jack Vogel and Navdeep Parhar (both in Cc) can
probably comment on the logic of both.

In my tests, either SW or HW LRO on the receive side HELPED A LOT,
not just in terms of raw throughput but also in terms of system
load on the receiver. On the receive side, LRO packs multiple data
segments into one that is passed up the stack.

As you mentioned this also reduces the number of acks generated,
but not dramatically (consider, the LRO is bounded by the number
of segments received in the mitigation interval).
In my tests the number of reads() on the receiver was reduced by
approx a factor of 3 compared to the !LRO case, meaning 4-5 segment
merged by LRO. Navdeep reported some numbers for cxgbe with similar
numbers.

Using Hardware LRO on the transmit side had no ill effect.
Being done in hardware i have no idea how it is implemented.

Using Software LRO on the transmit side did give a significant
throughput reduction. I can't explain the exact cause, though it
is possible that between reducing the number of segments to the
receiver and collapsing ACKs that it generates, the sender starves.
But it could well be that it is the extra delay on passing up the ACKs
that limits performance.
Either way, since the HW LRO did a fine job, i was trying to figure
out whether avoiding LRO on pure acks could help, and the two-line
patch above did help.

Note, my patch was just a proof-of-concept, and may cause
reordering if a data segment is followed by a pure ack.
But this can be fixed easily, handling a pure ack as
an out-of-sequence packet in tcp_lro_rx().

>                                     WIP patch is at:
> http://people.freebsd.org/~lstewart/patches/misctcp/tcplro_multiack_9.x.r219723.patch
> 
> Jeff tested the WIP patch and it *doesn't* fix the issue. I don't have 
> LRO capable hardware setup locally to figure out what I've missed. Most 
> of the machines in my lab are running em(4) NICs which don't support 
> LRO, but I'll see if I can find something which does and perhaps 
> resurrect this patch.

a few comments:
1. i don't think it makes sense to send multiple acks on
   coalesced segments (and the 82599 does not seem to do that).
   First of all, the acks would get out with minimal spacing (ideally
   less than 100ns) so chances are that the remote end will see
   them in a single burst anyways. Secondly, the remote end can
   easily tell that a single ACK is reporting multiple MSS and
   behave as if an equivalent number of acks had arrived.

2. i am a big fan of LRO (and similar solutions), because it can save
   a lot of repeated work when passing packets up the stack, and the
   mechanism becomes more and more effective as the system load increases,
   which is a wonderful property in terms of system stability.

   For this reason, i think it would be useful to add support for software
   LRO in the generic code (sys/net/if.c) so that drivers can directly use
   the software implementation even without hardware support.

3. similar to LRO, it would make sense to implement a "software TSO"
   mechanism where the TCP sender pushes a large segment down to
   ether_output, and having code in if_ethersubr.c do the segmentation
   and checksum computation. This would save multiple traversals of
   the various layers on the stack recomputing essentially the same
   information on all segments.

cheers
luigi