quick summary results with ixgbe (was Re: datapoints on 10G
throughput with TCP ?
Luigi Rizzo
rizzo at iet.unipi.it
Wed Dec 7 17:51:59 UTC 2011
On Wed, Dec 07, 2011 at 11:59:43AM +0100, Andre Oppermann wrote:
> On 06.12.2011 22:06, Luigi Rizzo wrote:
...
> >Even in my experiments there is a lot of instability in the results.
> >I don't know exactly where the problem is, but the high number of
> >read syscalls, and the huge impact of setting interrupt_rate=0
> >(defaults at 16us on the ixgbe) makes me think that there is something
> >that needs investigation in the protocol stack.
> >
> >Of course we don't want to optimize specifically for the one-flow-at-10G
> >case, but devising something that makes the system less affected
> >by short timing variations, and can pass upstream interrupt mitigation
> >delays would help.
>
> I'm not sure the variance is only coming from the network card and
> driver side of things. The TCP processing and interactions with
> scheduler and locking probably play a big role as well. There have
> been many changes to TCP recently and maybe an inefficiency that
> affects high-speed single sessions throughput has crept in. That's
> difficult to debug though.
I ran a bunch of tests on the ixgbe (82599) using RELENG_8 (which
seems slightly faster than HEAD) using MTU=1500 and various
combinations of card capabilities (hwcsum,tso,lro), different window
sizes and interrupt mitigation configurations.
default latency is 16us, l=0 means no interrupt mitigation.
lro is the software implementation of lro (tcp_lro.c)
hwlro is the hardware one (on 82599). Using a window of 100 Kbytes
seems to give the best results.
Summary:
- with default interrupt mitigation, the fastest configuration
is with checksums enabled on both sender and receiver, lro
enabled on the receiver. This gets about 8.0 Gbit/s
- lro is especially good because it packs data packets together,
passing mitigation upstream and removing duplicate work in
the ip and tcp stack.
- disabling LRO on the receiver brings performance to 6.5 Gbit/s.
Also it increases the CPU load (also in userspace).
- disabling checksums on the sender reduces transmit speed to 5.5 Gbit/s
- checksums disabled on both sides (and no LRO on the receiver) go
down to 4.8 Gbit/s
- I could not try the receive side without checksum but with lro
- with default interrupt mitigation, setting both
HWCSUM and TSO on the sender is really disruptive.
Depending on lro settings on the receiver i get 1.5 to 3.2 Gbit/s
and huge variance
- Using both hwcsum and tso seems to work fine if you
disable interrupt mitigation (reaching a peak of 9.4 Gbit/s).
- enabling software lro on the transmit side actually slows
down the throughput (4-5Gbit/s instead of 8.0).
I am not sure why (perhaps acks are delayed too much) ?
Adding a couple of lines in tcp_lro to reject
pure acks seems to have much better effect.
The tcp_lro patch below might actually be useful also for
other cards.
--- tcp_lro.c (revision 228284)
+++ tcp_lro.c (working copy)
@@ -245,6 +250,8 @@
ip_len = ntohs(ip->ip_len);
tcp_data_len = ip_len - (tcp->th_off << 2) - sizeof (*ip);
+ if (tcp_data_len == 0)
+ return -1; /* not on ack */
/*
cheers
luigi
More information about the freebsd-current
mailing list