kern/154006: tcp "window probe" bug on 64bit

Sat Jan 15 05:40:07 UTC 2011

The following reply was made to PR kern/154006; it has been noted by GNATS.

From: Bruce Evans <brde at optusnet.com.au>
To: Stefan `Sec` Zehl <sec at 42.org>
Cc: FreeBSD-gnats-submit at FreeBSD.org, freebsd-bugs at FreeBSD.org
Subject: Re: kern/154006: tcp "window probe" bug on 64bit
Date: Sat, 15 Jan 2011 16:31:10 +1100 (EST)

 On Sat, 15 Jan 2011, Stefan `Sec` Zehl wrote:

 >> Description:
 >
 > On amd64 the PERSIST timer does not get started (and consecquently executed)
 > for tcp connections stalled on a 0-size receive window. This means that no
 > single-byte probe packet is sent, so connections might hang indefinitely.
 >
 > This is due to a missing (long) conversion in tcp_output.c around line 562
 > where "adv" is calculated.
 >
 > After this patch, amd64 behaves the same way as i386 again.

 >> Fix:
 >
 > --- src/sys/netinet/tcp_output.c	2010-09-20 17:49:17.000000000 +0200
 > +++ src/sys/netinet/tcp_output.c	2011-01-14 19:30:46.000000000 +0100
 > @@ -571,7 +559,7 @@
 > 		 * TCP_MAXWIN << tp->rcv_scale.
 > 		 */
 > 		long adv = min(recwin, (long)TCP_MAXWIN << tp->rcv_scale) -
 > -			(tp->rcv_adv - tp->rcv_nxt);
 > +			(long) (tp->rcv_adv - tp->rcv_nxt);
 >
 > 		if (adv >= (long) (2 * tp->t_maxseg))
 > 			goto send;
 >

 Many other type errors are visible in this patch:
 - min() takes 'unsigned int' args, but is passed 'signed long' args:
    - recwin has type long.  This is smaller )same size but smaller max)
      than 'unsigned int' on 32-bit arches, and larger on 64-bit arches
    - TCP_MAXWIN has type int (except on 16-bit arches, which are not
      supported and are no longer permitted by POSIX).  Then we explicitly
      make its type incompatible with min() by casting to long.  The 16-bit
      arches don't matter, except they are responsible for many of the type
      errors here.  recvwin is long and TCP_WIN is cast to long since plain
      int was not long enough on 16-bit arches.
    Hopefully both of min()'s parameters are non-negative and <= UINT_MAX.
    Then nothing bad happens when min() converts them to u_int.  The result
    of min() has type u_int.
 - rcv_adv has type tcp_seq.  Seems correct
 - tcp_seq has type u_int32_t.  Seems correct, except for its old spelling.
    The spelling is not so old that it is u_long (to support the 16-bit arches),
    but it hasn't caught up with C99 yet.
 - rcv_next has type u_int32_t.  Seems logically incorrect -- should be tcp_seq.
 - (tp->rcv_adv - tp->rcv_nxt) has type [ the default promotion of { tcp_seq,
    u_int32_t } ].  This is u_int on all supported arches.  Apparently, the
    value of this should always be positive, since the cast doesn't change
    this on 64-bit arches.  However, the cast might break this on 32-bit
    arches (it breaks the value whenever it exceeds 0x80000000, if that can
    happen, since longs are smaller than u_int's on 32-bit arches.
 - the type of the expression for the rvalue is [ the default promotion of
    { u_int, u_int } ] in the old version, and the same with the last u_int
    replaced by long in the patched version.  It is most natural to subtract
    u_int's here, like the old version did -- everything in sight is (except
    for all the type errors) a sequence number or a difference of sequence
    numbers; the differences are always taken mod 2**32 and are non-negative,
    but must be careful if the difference should really be negative.  The
    SEQ_LT() family of macros can be used to determine if differences should
    be negative (this family is further towards losing 16-bitness -- it casts
    to int instead of to long).  Unfortunately there is no SEQ_DIFF() macro
    to simplify easy cases of taking differences.  I think there are scattered
    casts for this as here.

 So casting to long is not good.  It gives another type error to analyse,
 and works accidentally.

 Futher analysis: without the patch:

  		long adv = x - y;

 where x has type u_int and y had type u_int.  The difference always has
 type u_int; if x is sequentially less than y, then the difference should
 be negative, but its type forces it to be positive.  We should use
 SEQ_FOO() if this is possible, or we can use delicate conversions if we
 do only 2 pages of analysis per line to justify the delicacies (not too
 bad if there is a macro for this).

 - On 32-bit arches, long is smaller than u_int, so the assignment overflows
    if the difference should have been negative.  The behaviour is undefined,
    but on normal 2's complement arches, it is benign and fixes up the sign
    error.

 - On 64-bit arches, long is larger than u_int, so the difference remains
    nonnegative when it should have been negative, and is normally huge
    (something like 0U - 1U = 0xFFFFFFFF).  The huge value is near UINT_MAX.
    LONG_MAX is much larger, so the assignment doesn't overflow and the
    value remains near UINT_MAX.

 With the patch:

  		long adv = x - (long)y;

 where x has type u_int and (long)y had type long:

 - On 32-bit arches, long is smaller than u_int, so (long)y may overflow;
    overflow gives undefined behaviour which happens to be benign.  Then
    the binary promotions apply.  Although I have been describing long as
    being smaller than u_int on 32-bit arches, in the C type system it is
    logically larger, so the binary promotions promote x to long too, and
    leave (long)y unchanged.  "Promotion" of x is really demotion, so it
    may overflow beningly just like for y.  I think the difference doesn't
    overflow, and even if it does then the result is the same as before,
    since everything will be done in 32-bit registers using the same code
    as before.

 - On 64-bit arches: long is larger than u_int, so (long)y doesn't change
    the value of y.  The binary promotions then promote x to long without
    changing its value, and don't change (long)y's type or value.  Both
    terms remain nonnegative.  (long)y can still be garbage -- something
    like 0xFFFFFFFF when it should be -1.  I think this causes problems,
    but much smaller than before.  Oops, the above may be wrong about y possibly
    wanting to be negative.  Things are not quite as complicated if this
    sequence cannot occur:
    - if this can occur, then (x - (long)y) is a large negative number when
      it should be a small positive number (not much larger than x).  This
      doesn't seem to be what causes the main problem.
    - the main problem is just when x < y.  Then (x - y) gives a huge
      unsigned int value (which bogusly assigning to a long doesn't fix
      up for the 64-bit case).  But (x - (long)y) gives a negative value
      when x < y, without additional type errors or overflows on either
      32-bit or 64-bit arches provideded x and y are not very large.

 Better fixes:

 (A) explicitly convert to int instead implicitly converting to long:

  		long adv = (int)
  		    min(recwin, (long)TCP_MAXWIN << tp->rcv_scale) -
  		    (tp->rcv_adv - tp->rcv_nxt);

 or more complete fixes for type errors (beware of things needing to remaining
 bogusly long):

  		/* Also change recwin to int32_t. */
  		int adv = imin(recwin, TCP_MAXWIN << tp->rcv_scale) -
  		    (int)(tp->rcv_adv - tp->rcv_nxt);

 This doesn't fix some style bugs:
 - nested declaration.
 - initialization in declaration

 tcp code already uses scattered conversions like this a bit too much.  E.g.,
 in tcp_input.c, there is one imax() very like the above imin().  This seems
 to be the only one involving the window, however; it initializes `win'
 which already has type int, but some other window variables have type
 u_int...

 Later code in tcp_output uses bogus casts to long and larger code instead:

 % 	if (recwin < (long)(tp->rcv_adv - tp->rcv_nxt))
 % 		recwin = (long)(tp->rcv_adv - tp->rcv_nxt);
 % 	if (recwin > (long)TCP_MAXWIN << tp->rcv_scale)
 % 		recwin = (long)TCP_MAXWIN << tp->rcv_scale;
 % 	...
 % 	if (recwin > 0 && SEQ_GT(tp->rcv_nxt + recwin, tp->rcv_adv))
 % 		tp->rcv_adv = tp->rcv_nxt + recwin;

 Note that the first statement avoids using the technically incorrect
 SEQ_FOO() although its internals are better (cast to int instead of
 long).  It uses cases essentially like yours.  Then further analysis
 is simpler because everything is converted to long.  The second starement
 is similar to the first half of the broken expression.  Large code using
 if's and else's and tests (x >= y) before subtracting y from x is much
 easier to get right than 1 complicated 1-statement expression like the
 broken one.  It takes these (x >= y) tests to make code with mixed types
 obviously correct.  But I prefer small fast code with ints for everything,
 since type analyis is too hard.

 (B) Use SEQ_FOO().  This can be used for the difference of the sequence
 numbers, but using it on the final difference is not quite right since
 neither x nor y is a sequence number.  In practice SEQ_LT(x, y) will work.

 (C) Put (A) or (B) in a macro.  It can depend on benign overflow, or test
 values if necessary.  All this macro is about is subtracting 2 seqence
 values, or possibly differences of and bounds of sequence values, with
 a result that is negative iff that is needed, and a type that is signed
 iff a negative value makes sense or can be handled by the caller (int
 should do for the signed cases, else the type should remain tcp_seq or
 its promotion).  Using ints for tcp_seq is technically invalid since
 they overflow at value INT_MAX.

 Bruce