TSO and FreeBSD vs Linux

Wed Aug 14 06:33:15 UTC 2013

On 8/14/13 11:39 AM, Lawrence Stewart wrote:
> On 08/14/13 03:29, Julian Elischer wrote:
>> I have been tracking down a performance embarrassment on AMAZON EC2 and
>> have found it I think.
> Let us please avoid conflating performance with throughput. The
> behaviour you go on to describe as a performance embarrassment is
> actually a throughput difference, and the FreeBSD behaviour you're
> describing is essentially sacrificing throughput and CPU cycles for
> lower latency. That may not be a trade-off you like, but it is an
> important factor in this discussion.
it was an embarrassment in that in one class of test we performed very 
poorly.
It was not a disaster or a show-stopper, but for our product it is a 
critical number.
It is a throughput difference, as you say but that is a very important 
part of performance...
The latency of linux didn't seem to be any worse
than FreeBSD, just the throughput was a lot higher in the same scenario.
>
> Don't fall into the trap of labelling Linux's propensity for maximising
> throughput as superior to an alternative approach which strikes a
> different balance. It all depends on the use case.
well the linux balance seems t be "be better all around" at this 
moment so that is
embarrassing. :-) I could see no latency reversion.

>
>> Our OS cousins over at Linux land have implemented some interesting
>> behaviour when TSO is in use.
>>
>> They seem to aggregate ACKS when there is a lot of traffic so that they
>> can create the
>> largest possible TSO packet. We on the other hand respond to each and
>> every returning ACK, as it arrives and thus generally fall into the
>> behaviour of sending a bunch of small packets, the size of each ack.
> There's a thing controlled by ethtool called GRO (generic receive
> offload) which appears to be enabled by default on at least Ubuntu and I
> guess other Linux's too. It's responsible for aggregating ACKs and data
> to batch them up the stack if the driver doesn't provide a hardware
> offload implementation. Try rerunning your experiments with the ACK
> batching disabled on the Linux host to get an additional comparison point.
I will try that as soon as I get back to the machines in question.
>> for two examples look at:
>>
>>
>> http://www.freebsd.org/~julian/LvsF-tcp-start.tiff
>> and
>> http://www.freebsd.org/~julian/LvsF-tcp.tiff
>>
>> in each case, we can see FreeBSD on the left and Linux on the right.
>>
>> The first case shows the case as the sessions start, and the second case
>> shows
>> some distance later (when the sequence numbers wrap around.. no particular
>> reason to use that, it was just fun to see).
>> In both cases you can see that each Linux packet (white)(once they have got
>> going) is responding to multiple bumps in the send window sequence
>> number (green and yellow lines) (representing the arrival of several ACKs)
>> while FreeBSD produces a whole bunch of smaller packets, slavishly
>> following
>> exactly the size of each incoming ack.. This gives us quite  a
>> performance debt.
> Again, please s/performance/what-you-really-mean/ here.
ok, In my tests this makes FreeBSD data transfers much slower, by as 
much as 60%.
>
>> Notice that this behaviour in Linux seems to be modal.. it seems to
>> 'switch on' a little bit
>> into the 'starting' trace.
>>
>> In addition, you can see also that Linux gets going faster even in the
>> beginning where
>> TSO isn't in play, by sending a lot more packets up-front. (of course
>> the wisdom of this
>> can be argued).
> They switched to using an initial window of 10 segments some time ago.
> FreeBSD starts with 3 or more recently, 10 if you're running recent
> 9-STABLE or 10-CURRENT.
I tried setting initial values as shown:
   net.inet.tcp.local_slowstart_flightsize: 10
   net.inet.tcp.slowstart_flightsize: 10
it didn't seem to make too much difference but I will redo the test.

>
>> Has anyone done any work on aggregating ACKs, or delaying responding to
>> them?
> As noted by Navdeep, we already have the code to aggregate ACKs in our
> software LRO implementation. The bigger problem is that appropriate byte
> counting places a default 2*MSS limit on the amount of ACKed data the
> window can grow by i.e. if an ACK for 64k of data comes up the stack,
> we'll grow the window by 2 segments worth of data in response. That
> needs to be addressed - we could send the ACK count up with the
> aggregated single ACK or just ignore abc_l_var when LRO is in use for a
> connection.
so, does "Software LRO" mean that LRO on hte NIC should be ON or OFF 
to see this?


>
> Cheers,
> Lawrence
>
>