recwin change
Scheffenegger, Richard
Richard.Scheffenegger at netapp.com
Tue Apr 21 19:48:50 UTC 2020
That code is not the problem apparently; the right-shifting (without rounding) later on, when assigning th_ack can be (in particular with older Linux versions).
Eg. Assume rcv_scale = 6 (64 bytes granularity), ack=1000, recwin 65536 bytes
And recwin was calculated to be 64 kB (a th_window value of 1024d), a single byte send, and not retrieved from the receive socket buffer, will signal a downscaled window value of 65535 >> 6 = 1023.
Before, the sender though the right edge of the receive window was 1024 << 6 = 65536+1000 = 66536. After sending a single byte, the new right edge is calculated to be 1023<<6 = 65472+1001 = 66473.
For a BSD sender, this is not an issue, but older Linux senders would sometimes arrive at interesting Seq# when eventually send out window updates / new data, which can lead to deadlock situations.
We found, (and here I misunderstood our engineering initially, thinking this can only happen when we are sending zero- or close to zero windows), that rounding up the scaled-down window value, that is, whenever it’s fractionally larger than a full multiple of 1<<rcv_scale, would work around these broken clients.
In the broken case, a trace could look like this:
580174 5.327895000 .223 .24 TCP 70 968→2049 [ACK] Seq=1595000295 Ack=3293041980 Win=24576 Len=0 TSval=3196234470 TSecr=2795460094
580175 5.327897000 .223 .24 TCP 70 968→2049 [ACK] Seq=1595000271 Ack=3293054636 Win=24576 Len=0 TSval=3196234470 TSecr=2795460094
...
580686 5.531477000 .223 .24 TCP 70 968→2049 [ACK] Seq=1595000247 Ack=3296187708 Win=24576 Len=0 TSval=3196234674 TSecr=2795460299
-e frame.number -e ip.src -e tcp.srcport -e tcp.seq -e tcp.ack -e tcp.window_size -e tcp.nxtseq
579940 .24 2049 2204388169 2413854113 11188 2204397089
579941 .223 968 2414003217 2203237489 24576 2414012165
579942 .24 2049 2204397089 2413854113 11188 2204406009 << ACKed by frame 580174
579943 .223 968 2414021113 2203237489 24576 2414030061
579944 .223 968 2414039009 2203237489 24576 2414047957
579945 .24 2049 2204406009 2413854113 11188 2204414929
579946 .24 2049 2204414929 2413872009 11048 2204418665 << ACked by frame 580175
. . . . . .
580174 .223 968 2415286177 2204406009 24576
580175 .223 968 2415286153 2204418665 24576
(the client reneging it’s seq# is RHEL 7.2)
The problem is, that these acks with reneged seq# are not fully processed; if they are window updates (indicating an open window again in the reverse direction, remember this is transactional nfs), they get dropped and not processed.
Now, the bug on the Linux side to fix the reneging with scaled window has been fixed for some time, but old clients are still around…
Richard Scheffenegger
Consulting Solution Architect
NAS & Networking
NetApp
+43 1 3676 811 3157 Direct Phone
+43 664 8866 1857 Mobile Phone
Richard.Scheffenegger at netapp.com<mailto:Richard.Scheffenegger at netapp.com>
https://ts.la/richard49892
From: Jonathan Looney <jtl at netflix.com>
Sent: Dienstag, 21. April 2020 21:09
To: Scheffenegger, Richard <Richard.Scheffenegger at netapp.com>
Cc: transport at freebsd.org; Michael Tuexen <tuexen at freebsd.org>; Randall Stewart <rrs at netflix.com>; Lawrence Stewart <lstewart at netflix.com>; rgrimes at freebsd.org; Cui, Cheng <Cheng.Cui at netapp.com>
Subject: Re: recwin change
NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.
This looks to me like it is working normally. The system is refusing to renege on advertised window.
See the code from tcp_output() below:
/*
* Calculate receive window. Don't shrink window,
* but avoid silly window syndrome.
* If a RST segment is sent, advertise a window of zero.
*/
if (flags & TH_RST) {
recwin = 0;
} else {
if (recwin < (so->so_rcv.sb_hiwat / 4) &&
recwin < tp->t_maxseg)
recwin = 0;
if (SEQ_GT(tp->rcv_adv, tp->rcv_nxt) &&
recwin < (tp->rcv_adv - tp->rcv_nxt))
recwin = (tp->rcv_adv - tp->rcv_nxt);
}
Or, have I mis-understood what you have identified as the bug?
Jonathan
On Tue, Apr 21, 2020 at 10:37 AM Scheffenegger, Richard <Richard.Scheffenegger at netapp.com<mailto:Richard.Scheffenegger at netapp.com>> wrote:
Ah, sorry I must have misread earlier.
My test script for packetdrill, where I wanted to look into premature shrinking of the right edge of the receive window when scaling is in effect, when I try to clamp down the receive window really low, it is set to no less than 64kB:
[root at freebsd ~]# cat newreno-shrinking-window.pkt
// A simple server-side test that sends exactly an initial window (IW10)
// worth of packets.
--tolerance_usecs=500000
// Flush Hostcache
//0.0 `kldload cc_cubic`
0.0 `sysctl net.inet.tcp.cc.algorithm=newreno`
0.1 `sysctl net.inet.tcp.initcwnd_segments=10`
0.2 `sysctl net.inet.tcp.hostcache.purgenow=1`
0.3 `sysctl net.inet.tcp.rfc3465=0`
//0.3 `sync` // in case of crash
// Create a listening TCP socket.
0.50 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0.005 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0.005 setsockopt(3, SOL_SOCKET, SO_DEBUG, [1], 4) = 0
+0.005 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [1048576], 4) = 0
+0.005 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [70000], 4) = 0
+0.005 bind(3, ..., ...) = 0
+0.005 listen(3, 1) = 0
// Establish a TCP connection with ECN to explicitly track CWR
// Set WindowScale to multiplicative factor of 1kB to allow huge increase
+0.035 < S 0:0(0) win 65535 <mss 1460, sackOK, wscale 10, eol, nop, nop>
+0.000 > S. 0:0(0) ack 1 win 65535 <mss 1460,nop,wscale 6,sackOK,eol,eol>
+0.000 < . 1:1(0) ack 1 win 65535
+0.000 accept(3, ..., ...) = 4
+0.005 setsockopt(4, SOL_SOCKET, SO_SNDBUF, [1048576], 4) = 0
+0.005 setsockopt(4, SOL_SOCKET, SO_RCVBUF, [10000], 4) = 0
//+0 > . 1:1(0) ack 1
// Filling up the receive buffer
+0 < . 1:1461(1460) ack 1 win 65535
+0 < . 1461:2921(1460) ack 1 win 65535
+0 > . 1:1(0) ack 2921 win 978 // 62592
+0.005 setsockopt(4, SOL_SOCKET, SO_SNDBUF, [10000], 4) = 0
+0.005 setsockopt(4, SOL_SOCKET, SO_RCVBUF, [10000], 4) = 0
+0 < . 2921:4381(1460) ack 1 win 65535
+0 < . 4381:5841(1460) ack 1 win 65535
+0 > . 1:1(0) ack 5841 win 932 // 59648
+0 < . 5841:5999(158) ack 1 win 65535
+0 < P. 5999:6000(1) ack 1 win 65535
+0 > . 1:1(0) ack 6000 win 930 // 59520
Richard Scheffenegger
Consulting Solution Architect
NAS & Networking
NetApp
+43 1 3676 811 3157 Direct Phone
+43 664 8866 1857 Mobile Phone
Richard.Scheffenegger at netapp.com<mailto:Richard.Scheffenegger at netapp.com>
https://ts.la/richard49892
From: Jonathan Looney <jtl at netflix.com<mailto:jtl at netflix.com>>
Sent: Dienstag, 21. April 2020 16:27
To: Scheffenegger, Richard <Richard.Scheffenegger at netapp.com<mailto:Richard.Scheffenegger at netapp.com>>
Cc: transport at freebsd.org<mailto:transport at freebsd.org>; Michael Tuexen <tuexen at freebsd.org<mailto:tuexen at freebsd.org>>; Randall Stewart <rrs at netflix.com<mailto:rrs at netflix.com>>; Lawrence Stewart <lstewart at netflix.com<mailto:lstewart at netflix.com>>; rgrimes at freebsd.org<mailto:rgrimes at freebsd.org>; Cui, Cheng <Cheng.Cui at netapp.com<mailto:Cheng.Cui at netapp.com>>
Subject: Re: recwin change
NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.
On Tue, Apr 21, 2020 at 9:59 AM Scheffenegger, Richard <Richard.Scheffenegger at netapp.com<mailto:Richard.Scheffenegger at netapp.com>> wrote:
Hi Jonathan,
In your larger patch to fix up long int to int32_t
https://reviews.freebsd.org/rS306769#change-l6GoMSS8L7SS
You seem to have slipped in a functional change for the receive window:
- recwin = sbspace(&so->so_rcv);
+ recwin = lmin(lmax(sbspace(&so->so_rcv), 0),
+ (long)TCP_MAXWIN << tp->rcv_scale);
While https://reviews.freebsd.org/D7073
Makes it clear that the lmax(sbspace(&so->so_rcv), 0) is to prevent any potential negative value to end up being signaled as a very large receive window.
However, that change also signals at least TCP_MAXWIN, even when the socket receive buffer may be much smaller.
I don't think I understand what you are suggesting. Can you give an example where this may occur?
And the typecast long was missed in your fix-up to get rid of all longs in the tcp stack 😉.
Actually, that was purposeful. Because this is being sent through a function which expects a long, this ensures the value will be treated as a long. It is probably unnecessary, but it shouldn't be harmful.
Jonathan
More information about the freebsd-transport
mailing list