question about trimning data "len" conditions in TSO in tcp_output.c

Fri Nov 13 23:02:06 UTC 2015

On 11/6/15, 5:39 PM, "Hans Petter Selasky" <hps at selasky.org> wrote:

>On 11/06/15 22:56, Cui, Cheng wrote:
>> On Nov 6, 2015, at 4:16 PM, Hans Petter Selasky <hps at selasky.org> wrote:
>>
>>> On 11/06/15 21:46, Cui, Cheng wrote:
>>>> Hello Hans,
>>>>
>>>> Sorry if my previous email does not reach you because of a bad
>>>>subject.
>>>>
>>>> This is Cheng Cui. I am reading the CURRENT FreeBSD code in
>>>>tcp_output.c, and find this question regarding your change in revision
>>>>271946.
>>>> 
>>>>https://svnweb.freebsd.org/base/head/sys/netinet/tcp_output.c?r1=271946
>>>>&r2=271945&pathrev=271946
>>>>
>>>> trim data "len" under TSO:
>>>>
>>>> 885	                        /*
>>>> 886	                         * Prevent the last segment from being
>>>> 887	                         * fractional unless the send sockbuf can
>>>>be
>>>> 888	                         * emptied:
>>>> 889	                         */
>>>> 890	                        max_len = (tp->t_maxopd - optlen);
>>>> 891	                        if ((off + len) < sbavail(&so->so_snd)) {
>>>>   <==
>>>> 892	                                moff = len % max_len;
>>>> 893	                                if (moff != 0) {
>>>> 894	                                        len -= moff;
>>>> 895	                                        sendalot = 1;
>>>> 896	                                }
>>>> 897	                        }
>>>>
>>>> Is there a specific reason that it should skip trimming the data
>>>>"len" under the condition of "(off + len) == sbavail(&so->so_snd)" in
>>>>TSO?
>>>> Because I am wondering if we can trim the data "len" directly without
>>>>checking the "(off + len)" condition.
>>>
>>> Hi Cheng,
>>>
>>> I believe the reason is to avoid looping one more time outputting a
>>>single packet containing the remainder of the available data, with
>>>regard to max_len.
> > How did you envision the removal of this check would influence the
>generated packet sequence?
>>>
>>> --HPS
>>>
>> Hi Hans,
>>
>> I may be wrong but my assumption is that the remainder of the available
>>data may be larger than one single packet.
>>
>> Suppose max_len==1500, sb_acc==3001, off==2, and (off+len)==3001. In
>>this case, the current code will not trim the "len"
>> and let it go directly to the NIC. I think it skips the Nagle's
>>algorithm. As len==2999, the last packet is 1499,
>> it is supposed to be held until all outstanding data are ACKed, but it
>>has been sent out.
>
>Hi Cheng,
>
>That is correct. Nagle's algorithm is not active when "(off+len) ==
>sb_acc". Anyhow, the check for "(off+len) == sb_acc" does not go away.
>It has to be put before sendalot = 1 to avoid sending the so-called
>"small packet" in the next iteration. Possibly you will need to add a
>check for TCP nodelay being active, which disable Nagle's algorithm.
>Have you done any tests removing this check?
>
>--HPS
Hi Hans,

Sorry for the delay to continue this discussion. I did some tests and
collected 
some trace files by using iperf and tcpdump.

Well, I did not find anything wrong with the Nagle's algorithm. But I
found the 
remainder chunk of data could be larger than a single packet, which will
push 
NIC to send extra fractional packet, if the send buf size is under a
certain 
condition.

Here is my test. The iperf command I choose pushes 5793 bytes data to the
7240bytes write buffer by setting the "-l" option and the "-w" option. I
tested this TCP connection performance on a pair of FreeBSD 10.2 nodes (s1
and 
r1) with a switch in between. Both nodes have TSO and delayed ACK enabled.

root at s1:~ # ping -c 3 r1
PING r1-link1 (10.1.2.3): 56 data bytes
64 bytes from 10.1.2.3: icmp_seq=0 ttl=64 time=0.154 ms
64 bytes from 10.1.2.3: icmp_seq=1 ttl=64 time=0.144 ms
64 bytes from 10.1.2.3: icmp_seq=2 ttl=64 time=0.142 ms

--- r1-link1 ping statistics ---
3 packets transmitted, 3 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.142/0.147/0.154/0.005 ms

root at r1:~ # ping -c 3 s1
PING s1-link1 (10.1.2.2): 56 data bytes
64 bytes from 10.1.2.2: icmp_seq=0 ttl=64 time=0.163 ms
64 bytes from 10.1.2.2: icmp_seq=1 ttl=64 time=0.145 ms
64 bytes from 10.1.2.2: icmp_seq=2 ttl=64 time=0.143 ms

--- s1-link1 ping statistics ---
3 packets transmitted, 3 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.143/0.150/0.163/0.009 ms

iperf -s  <== iperf command at receiver
iperf -c 10.1.2.3 -l 5793 -w 5793 -n 10M -m -f B  <== iperf command at sender

------------------------------------------------------------
Client connecting to 10.1.2.3, TCP port 5001
TCP window size: 7240 Byte (WARNING: requested 5793 Byte)
------------------------------------------------------------
[  3] local 10.1.2.2 port 16338 connected with 10.1.2.3 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 0.5 sec  10491123 Bytes  22615589 Bytes/sec
[  3] MSS size 1448 bytes (MTU 1500 bytes, ethernet)

I sent 10MBytes of data, and collected the packet trace from both nodes by
tcpdump. I did this test twice to confirm the result can be reproduced.

From the trace files of both nodes before my code change, I see a lot of
fractional packets. See the attached trace files in
"before_code_change.zip".

Then, I did my code change in 10.2 src by commenting out the data trim
condition 
below:
868                         /*
 869 * Prevent the last segment from being
 870                          * fractional unless the send sockbuf can be
 871                          * emptied:
 872                          */
 873                         max_len = (tp->t_maxopd - optlen);
 874 //                      if ((off + len) < so->so_snd.sb_cc) {
 875                                 moff = len % max_len;
 876                                 if (moff != 0) {
 877                                         len -= moff;
 878                                         sendalot = 1;
 879                                 }
 880 //                      }

And I did the same iperf test and gathered trace files. I did not find
many 
fractional packets this time. See the attached trace files in
"after_code_change.zip".

Compared with the receiver traces, I see receiver got the same 7251
packets in 
the two tests, instead of 9060 packets before the change. That's a save of
20%
on the wire.

Compared with the sender traces, I see sender's TSO handled 2185 packets
and 
1839 packets in the two tests, instead of 4498 packets and 4473 packets
before 
the change. That's also a save of roughly more than 40% on the handling of
TSO
chunks.

There may be other conditions I did not cover, but I think the current
data 
trime can be improved in TSO by removing the above condition.

Trace files before/after code change are attached.