it's the output, not ack coalescing (Re: TSO and FreeBSD vs Linux)

Fri Aug 16 08:54:43 UTC 2013

On 8/14/13 6:21 PM, Luigi Rizzo wrote:
> On Wed, Aug 14, 2013 at 05:23:02PM +1000, Lawrence Stewart wrote:
>> On 08/14/13 16:33, Julian Elischer wrote:
>>> On 8/14/13 11:39 AM, Lawrence Stewart wrote:
>>>> On 08/14/13 03:29, Julian Elischer wrote:
>>>>> I have been tracking down a performance embarrassment on AMAZON EC2 and
>>>>> have found it I think.
>>>> Let us please avoid conflating performance with throughput. The
>>>> behaviour you go on to describe as a performance embarrassment is
>>>> actually a throughput difference, and the FreeBSD behaviour you're
>>>> describing is essentially sacrificing throughput and CPU cycles for
>>>> lower latency. That may not be a trade-off you like, but it is an
>>>> important factor in this discussion.
> ...
>> Sure, there's nothing wrong with holding throughput up as a key
>> performance metric for your use case.
>>
>> I'm just trying to pre-empt a discussion that focuses on one metric and
>> fails to consider the bigger picture.
> ...
>>> I could see no latency reversion.
>> You wouldn't because it would be practically invisible in the sorts of
>> tests/measurements you're doing. Our good friends over at HRT on the
>> other hand would be far more likely to care about latency on the order
>> of microseconds. Again, the use case matters a lot.
> ...
>>> so, does "Software LRO" mean that LRO on hte NIC should be ON or OFF to
>>> see this?
>> I think (check the driver code in question as I'm not sure) that if you
>> "ifconfig <if> lro" and the driver has hardware support or has been made
>> aware of our software implementation, it should DTRT.
> The "lower throughput than linux" that julian was seeing is either
> because of a slow (CPU-bound) sender or slow receiver. Given that
> the FreeBSD tx path is quite expensive (redoing route and arp lookups
> on every packet, etc.) I highly suspect the sender side is at fault.

if we send bigger packets then we do less lookups do we not?

>
> Ack coalescing, LRO, GRO are limited to the set of packets that you
> receive in the same batch, which in turn is upper bounded by the
> interrupt moderation delay. Apart from simple benchmarks with only
> a few flows, it is very hard that ack/lro/gro can coalesce more
> than a few segments for the same flow.
>
> 	But the real fix is in tcp_output.
>
> In fact, it has never been the case that an ack (single or coalesced)
> triggers an immediate transmission in the output path.  We had this
> in the past (Silly Window Syndrome) and there is code that avoids
> sending less than 1-mtu under appropriate conditions (there is more
> data to push out anyways, no NODELAY, there are outstanding acks,
> the window can open further).  In all these cases there is no
> reasonable way to experience the difference in terms of latency.
>
> If one really cares, e.g. the High Speed Trading example, this is
> a non issue because any reasonable person would run with TCP_NODELAY
> (and possibly disable interrupt moderation), and optimize for latency
> even on a per flow basis.
>
> In terms of coding effort, i suspect that by replacing the 1-mtu
> limit (t_maxseg i believe is the variable that we use in the SWS
> avoidance code) with 1-max-tso-segment we can probably achieve good
> results with little programming effort.
>
> Then the problem remains that we should keep a copy of route and
> arp information in the socket instead of redoing the lookups on
> every single transmission, as they consume some 25% of the time of
> a sendto(), and probably even more when it comes to large tcp
> segments, sendfile() and the like.
>
> 	cheers
> 	luigi
>
>