recwin change

Tue Apr 21 19:48:50 UTC 2020

That code is not the problem apparently; the right-shifting (without rounding) later on, when assigning th_ack can be (in particular with older Linux versions).

Eg. Assume rcv_scale = 6 (64 bytes granularity), ack=1000, recwin 65536 bytes

And recwin was calculated to be 64 kB (a th_window value of 1024d), a single byte send, and not retrieved from the receive socket buffer, will signal a downscaled window value of 65535 >> 6 = 1023.

Before, the sender though the right edge of the receive window was  1024 << 6 = 65536+1000 = 66536. After sending a single byte, the new right edge is calculated to be 1023<<6 = 65472+1001 = 66473.

For a BSD sender, this is not an issue, but older Linux senders would sometimes arrive at interesting Seq# when eventually send out window updates / new data, which can lead to deadlock situations.

We found, (and here I misunderstood our engineering initially, thinking this can only happen when we are sending zero- or close to zero windows), that rounding up the scaled-down window value, that is, whenever it’s fractionally larger than a full multiple of 1<<rcv_scale, would work around these broken clients.

In the broken case, a trace could look like this:

580174 5.327895000      .223       .24         TCP        70          968→2049 [ACK] Seq=1595000295 Ack=3293041980 Win=24576 Len=0 TSval=3196234470 TSecr=2795460094

580175 5.327897000      .223       .24         TCP        70          968→2049 [ACK] Seq=1595000271 Ack=3293054636 Win=24576 Len=0 TSval=3196234470 TSecr=2795460094

...

580686 5.531477000      .223       .24         TCP        70          968→2049 [ACK] Seq=1595000247 Ack=3296187708 Win=24576 Len=0 TSval=3196234674 TSecr=2795460299

-e frame.number -e ip.src -e tcp.srcport -e tcp.seq -e tcp.ack -e tcp.window_size -e tcp.nxtseq
    579940    .24    2049    2204388169    2413854113    11188    2204397089
    579941    .223    968    2414003217    2203237489    24576    2414012165
    579942    .24    2049    2204397089    2413854113    11188    2204406009 << ACKed by frame 580174
    579943    .223    968    2414021113    2203237489    24576    2414030061
    579944    .223    968    2414039009    2203237489    24576    2414047957
    579945    .24    2049    2204406009    2413854113    11188    2204414929
    579946    .24    2049    2204414929    2413872009    11048    2204418665 << ACked by frame 580175
    . . . . . .
    580174    .223    968    2415286177    2204406009    24576
    580175    .223    968    2415286153    2204418665    24576

(the client reneging it’s seq# is RHEL 7.2)

The problem is, that these acks with reneged seq# are not fully processed; if they are window updates (indicating an open window again in the reverse direction, remember this is transactional nfs), they get dropped and not processed.

Now, the bug on the Linux side to fix the reneging with scaled window has been fixed for some time, but old clients are still around…

Richard Scheffenegger
Consulting Solution Architect
NAS & Networking

NetApp
+43 1 3676 811 3157 Direct Phone
+43 664 8866 1857 Mobile Phone
Richard.Scheffenegger at netapp.com<mailto:Richard.Scheffenegger at netapp.com>

https://ts.la/richard49892

From: Jonathan Looney <jtl at netflix.com>
Sent: Dienstag, 21. April 2020 21:09
To: Scheffenegger, Richard <Richard.Scheffenegger at netapp.com>
Cc: transport at freebsd.org; Michael Tuexen <tuexen at freebsd.org>; Randall Stewart <rrs at netflix.com>; Lawrence Stewart <lstewart at netflix.com>; rgrimes at freebsd.org; Cui, Cheng <Cheng.Cui at netapp.com>
Subject: Re: recwin change

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.

This looks to me like it is working normally. The system is refusing to renege on advertised window.

See the code from tcp_output() below:

        /*
         * Calculate receive window.  Don't shrink window,
         * but avoid silly window syndrome.
         * If a RST segment is sent, advertise a window of zero.
         */
        if (flags & TH_RST) {
                recwin = 0;
        } else {
                if (recwin < (so->so_rcv.sb_hiwat / 4) &&
                    recwin < tp->t_maxseg)
                        recwin = 0;
                if (SEQ_GT(tp->rcv_adv, tp->rcv_nxt) &&
                    recwin < (tp->rcv_adv - tp->rcv_nxt))
                        recwin = (tp->rcv_adv - tp->rcv_nxt);
        }

Or, have I mis-understood what you have identified as the bug?

Jonathan

On Tue, Apr 21, 2020 at 10:37 AM Scheffenegger, Richard <Richard.Scheffenegger at netapp.com<mailto:Richard.Scheffenegger at netapp.com>> wrote:
Ah, sorry I must have misread earlier.

My test script for packetdrill, where I wanted to look into premature shrinking of the right edge of the receive window when scaling is in effect, when I try to clamp down the receive window really low, it is set to no less than 64kB:

[root at freebsd ~]# cat newreno-shrinking-window.pkt
// A simple server-side test that sends exactly an initial window (IW10)
// worth of packets.

--tolerance_usecs=500000

// Flush Hostcache
//0.0 `kldload cc_cubic`
0.0 `sysctl net.inet.tcp.cc.algorithm=newreno`
0.1 `sysctl net.inet.tcp.initcwnd_segments=10`
0.2 `sysctl net.inet.tcp.hostcache.purgenow=1`
0.3 `sysctl net.inet.tcp.rfc3465=0`
//0.3 `sync` // in case of crash

// Create a listening TCP socket.
0.50 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0.005 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0.005 setsockopt(3, SOL_SOCKET, SO_DEBUG, [1], 4) = 0
+0.005 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [1048576], 4) = 0
+0.005 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [70000], 4) = 0
+0.005 bind(3, ..., ...) = 0
+0.005 listen(3, 1) = 0

// Establish a TCP connection with ECN to explicitly track CWR
// Set WindowScale to multiplicative factor of 1kB to allow huge increase
+0.035 < S  0:0(0) win 65535 <mss 1460, sackOK, wscale 10, eol, nop, nop>
+0.000 > S. 0:0(0) ack 1 win 65535 <mss 1460,nop,wscale 6,sackOK,eol,eol>
+0.000 <  . 1:1(0) ack 1 win 65535
+0.000 accept(3, ..., ...) = 4

+0.005 setsockopt(4, SOL_SOCKET, SO_SNDBUF, [1048576], 4) = 0
+0.005 setsockopt(4, SOL_SOCKET, SO_RCVBUF, [10000], 4) = 0
//+0     >  . 1:1(0) ack 1

// Filling up the receive buffer
+0 < .     1:1461(1460) ack 1 win 65535
+0 < .  1461:2921(1460) ack 1 win 65535
+0 > .     1:1(0) ack 2921 win 978  // 62592

+0.005 setsockopt(4, SOL_SOCKET, SO_SNDBUF, [10000], 4) = 0
+0.005 setsockopt(4, SOL_SOCKET, SO_RCVBUF, [10000], 4) = 0

+0 < .  2921:4381(1460) ack 1 win 65535
+0 < .  4381:5841(1460) ack 1 win 65535
+0 > .     1:1(0) ack 5841 win 932  // 59648

+0 < .  5841:5999(158) ack 1 win 65535
+0 < P. 5999:6000(1) ack 1 win 65535
+0 > .     1:1(0) ack 6000 win 930  // 59520

Richard Scheffenegger
Consulting Solution Architect
NAS & Networking

NetApp
+43 1 3676 811 3157 Direct Phone
+43 664 8866 1857 Mobile Phone
Richard.Scheffenegger at netapp.com<mailto:Richard.Scheffenegger at netapp.com>

https://ts.la/richard49892

From: Jonathan Looney <jtl at netflix.com<mailto:jtl at netflix.com>>
Sent: Dienstag, 21. April 2020 16:27
To: Scheffenegger, Richard <Richard.Scheffenegger at netapp.com<mailto:Richard.Scheffenegger at netapp.com>>
Cc: transport at freebsd.org<mailto:transport at freebsd.org>; Michael Tuexen <tuexen at freebsd.org<mailto:tuexen at freebsd.org>>; Randall Stewart <rrs at netflix.com<mailto:rrs at netflix.com>>; Lawrence Stewart <lstewart at netflix.com<mailto:lstewart at netflix.com>>; rgrimes at freebsd.org<mailto:rgrimes at freebsd.org>; Cui, Cheng <Cheng.Cui at netapp.com<mailto:Cheng.Cui at netapp.com>>
Subject: Re: recwin change

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.

On Tue, Apr 21, 2020 at 9:59 AM Scheffenegger, Richard <Richard.Scheffenegger at netapp.com<mailto:Richard.Scheffenegger at netapp.com>> wrote:
Hi Jonathan,

In your larger patch to fix up long int to int32_t

https://reviews.freebsd.org/rS306769#change-l6GoMSS8L7SS

You seem to have slipped in a functional change for the receive window:

-       recwin = sbspace(&so->so_rcv);
+       recwin = lmin(lmax(sbspace(&so->so_rcv), 0),
+           (long)TCP_MAXWIN << tp->rcv_scale);

While https://reviews.freebsd.org/D7073
Makes it clear that the lmax(sbspace(&so->so_rcv), 0) is to prevent any potential negative value to end up being signaled as a very large receive window.

However, that change also signals at least TCP_MAXWIN, even when the socket receive buffer may be much smaller.

I don't think I understand what you are suggesting. Can you give an example where this may occur?

And the typecast long was missed in your fix-up to get rid of all longs in the tcp stack 😉.

Actually, that was purposeful. Because this is being sent through a function which expects a long, this ensures the value will be treated as a long. It is probably unnecessary, but it shouldn't be harmful.

Jonathan