TCP Rx window auto sizing relies on TCP timestamp option?

Vlad Zolotarov vladz at cloudius-systems.com
Tue Aug 12 11:55:15 UTC 2014


On Aug 11, 2014 8:06 PM, "John-Mark Gurney" <jmg at funkthat.com 
<mailto:jmg at funkthat.com>> wrote:
 >
 > Vlad Zolotarov wrote this message on Mon, Aug 11, 2014 at 15:16 +0300:
 > > Hi, I have the most strange question about the TCP Rx window auto 
sizing
 > > implementation in a FreeBSD networking stack.
 > > When I looked at the FreeBSD code (hash
 > > 9abce0e567c9a5a0520cdd94d5c633c7baf9a184) I noticed that
 > > the mentioned above feature will not be "enabled" if there isn't a TCP
 > > timestamp option present in the current TCP session:
 > >
 > > See sys/netinet/tcp_input.c: line 1813 in tcp_do_segment() function:
 > >
 > >                       if (V_tcp_do_autorcvbuf &&
 > >                       *to.to_tsecr*  && <-------- this is what I'm
 > >                       talking about
 > >                           (so->so_rcv.sb_flags & SB_AUTOSIZE))
 > >
 > > So, if i read the code correctly, if there isn't a TS option 
(negotiated
 > > and thus present in every received packet) the receive socket buffer
 > > won't grow thus preventing the growth of the Rx window.
 > > If that's the case this is very strange since TS option is not promised
 > > and even more - in many cases it won't be present.
 > > For example in Linux this feature is disabled by default (controlled by
 > > /proc/sys/net/ipv4/tcp_timestamps).
 > > This is how I actually noticed the problem the first place: I ran iperf
 > > test where Linux was an initiator and a transmitter (iperf -c) FreeBSD
 > > box was a receiver (iperf -s) and I noticed that the Rx window wasn't
 > > opening up because Linux box hasn't negotiated the TS option in the 
SYN.
 > > As a result, the throughput numbers were significantly lower 
compared to
 > > Linux-to-Linux setup (Linux uses a Dynamic Right-Sizing (DRS) algorithm
 > > http://public.lanl.gov/radiant/pubs.html#DRS, which doesn't rely on 
TS).
 > >
 > > Could anybody comment on this, pls.?
 > > Did I miss anything?
 > > Is it true that FreeBSD assumes that TS option is always present and if
 > > not how can I cause an Rx Window to open up when TS option hasn't been
 > > negotiated?
 >
 > This means the receive buffer won't grow beyond the default of 64k...
 > But, as the comment says:
 >                  * On the receive side the socket buffer memory is 
only rarely
 >                  * used to any significant extent.  This allows us to 
be much
 >
 > The receive buffer will only get used if the application takes too long
 > to read it's buffer, or it isn't currently waiting... If that's the
 > case, then the application should be fixed to be able to process the
 > data as quickly as it comes in...

U r right about the Rx buffer and as a result the Rx window will not 
grow beyond this value too.

See the following lines:

tcp_output.c: tcp_output():

line 509:

	recwin = sbspace(&so->so_rcv);


line 1034:

	/*
	 * According to RFC1323 the window field in a SYN (i.e., a <SYN>
	 * or <SYN,ACK>) segment itself is never scaled.  The <SYN,ACK>
	 * case is handled in syncache.
	 */
	if (flags & TH_SYN)
		th->th_win = htons((u_short)
				(min(sbspace(&so->so_rcv), TCP_MAXWIN)));
	else
		th->th_win = htons((u_short)(recwin >> tp->rcv_scale));


As a result the Tx window of a transmitter will not grow beyond 64K as 
well and this is a single full LSO/LRO frame.
So this will limit a transmitter by a single LSO frame (64K) frame per 
RTT since the receiver will only "see" the new bytes only after they are 
delivered by a HW and this will be after all 64KB (full LRO aggregation) 
are received and only then it will send an ACK.

Now let's consider u have a 0.2ms RTT like I have on my setup with 
40Gbps ConnectX 3 NICs connected back to back.
So, in this case the best throughput u'll ever get with the 64K window 
will be 8*64K/0.2ms ~ 2.5Gbps which is 1/16 of a line rate and u need at 
least 64K*16 ~ 1MB window to reach the line rate. And the higher RTT the 
larger Window we'll need. And this is in case the application frees the 
socket buffer immediately once it arrives which may never be the case of 
course.

I suppose use cases like above were exactly the motivation for Window 
Scaling option in RFC 1323.

 >
 > So, I don't see much of an issue w/ the code you pointed out, yes,
 > the receive buffer won't grow,

 > but there are options that you can set
 > (sysctl net.inet.tcp.recvspace) and SO_RCVBUF in the application that
 > will address it otherwise...

Exactly! If there is no TS - it won't and FreeBSD will not be able to 
utilize the network link.
Frankly, I don't understand your advice - u suggest for each and every 
application  to go and manually configure a receive socket buffer size? 
Or increase the initial socket buffer globally, which is even worse?! 
And which value should we choose? As u may see above the proper value 
depends on the RTT and RTT may change while application runs due to 
routing change. I doubt your suggestion is feasible.

So, my first question stands - doesn't FreeBSD community think that it 
would be beneficial for FreeBSD to use a DRS (or similar?) algorithm 
when there are no TS negotiated?

thanks,
vlad

 >
 > Obviously setting the default too large will just waste memory...
 >
 > --
 >   John-Mark Gurney                              Voice: +1 415 225 5579
 >
 >      "All that I will do, has been done, All that I have, has not."



More information about the freebsd-net mailing list