SCTP huge connect delays (at amd64) and yet another question

Thu Dec 5 10:32:05 UTC 2013

On Dec 5, 2013, at 9:41 AM, Valentin Nechayev <netch at netch.kiev.ua> wrote:

> Hi,
> 
> I've got some test results which are surprising and I would get
> a clarification.
> 
> A simple connection is created between two one-to-one SCTP sockets
> (AF_INET, SOCK_STREAM, IPPROTO_SCTP) at loopback (127.0.0.1). The
> server side sends 6 3-byte messages to client side and optionally
> designates writing shutdown. Client receives all them and measures
> a time before each receiving.
> Code is showed at the end of this message.
> Tested systems are:
> * FreeBSD 9.2-release/amd64
> * FreeBSD 9.1-release/amd64
> * FreeBSD 9.1-release/i386
> * Linux OpenSuSE 12.2, kernel 3.4.63-2.44-default, x86_64
> * Linux RHEL 6.3, kernel 2.6.32-279.22.1.38.0.el6.x86_64
> 
> The first discrepancy found is specific for FreeBSD on amd64 and not
> for i386 version; it's that connection setup lasts 2-4 seconds (!!)
>  Tcpdump shows indication that could be parsed as message miss:
Hi Valentin,

could you send me the .pcap file instead of the tcpdump output.
I would like to see the addresses listed in the INIT and INIT-ACK.

You can send that file to tuexen at freebsd.org.
> 
> tcpdump: listening on lo0, link-type NULL (BSD loopback), capture size 65535 byt
> es
> 08:18:34.639422 IP (tos 0x0, ttl 64, id 65094, offset 0, flags [none], proto SCT
> P (132), length 188, bad cksum 0 (->f274)!)
>    10.0.0.2.50025 > 127.0.0.1.2500: sctp
I'm wondering why 10.0.0.2 is the source address and not 127.0.0.1
>        1) [INIT] [init tag: 3943463987] [rwnd: 1864135] [OS: 10] [MIS: 2048] [i
> nit TSN: 3475830004]
> 08:18:34.639450 IP (tos 0x0, ttl 64, id 42621, offset 0, flags [none], proto SCT
> P (132), length 524, bad cksum 0 (->48ee)!)
>    127.0.0.1.2500 > 10.0.0.2.50025: sctp
>        1) [INIT ACK] [init tag: 59811639] [rwnd: 1864135] [OS: 10] [MIS: 2048]
> [init TSN: 466863335]
> 08:18:34.639467 IP (tos 0x0, ttl 64, id 52783, offset 0, flags [none], proto SCT
> P (132), length 424, bad cksum 0 (->21a0)!)
>    10.0.0.2.50025 > 127.0.0.1.2500: sctp
>        1) [COOKIE ECHO]
> 08:18:35.639618 IP (tos 0x0, ttl 64, id 12109, offset 0, flags [DF], proto SCTP
> (132), length 424, bad cksum 0 (->8082)!)
>    10.0.0.2.50025 > 127.0.0.1.2500: sctp
>        1) [COOKIE ECHO]
> 08:18:36.692628 IP (tos 0x0, ttl 64, id 48682, offset 0, flags [DF], proto SCTP
> (132), length 76, bad cksum 0 (->7e01)!)
>    127.0.0.1.2500 > 127.0.0.1.50025: sctp
The retransmission goes from 127.0.0.1. Hmm. Not sure why.
>        1) [HB REQ]
> 08:18:36.692668 IP (tos 0x0, ttl 64, id 10809, offset 0, flags [DF], proto SCTP (132), length 76, bad cksum 0 (->86f2)!)
>    10.0.0.2.50025 > 127.0.0.1.2500: sctp
>        1) [HB ACK] 
> 08:18:36.692707 IP (tos 0x2,ECT(0), ttl 64, id 16588, offset 0, flags [DF], proto SCTP (132), length 52, bad cksum 0 (->fb75)!)
>    127.0.0.1.2500 > 127.0.0.1.50025: sctp
>        1) [DATA] (B)(E) [TSN: 466863335] [SID: 0] [SSEQ 0] [PPID 0x0] [Payload:
>        0x0000:  6162 63                                  abc
> [...]
> 
> At 08:18:34.639467, cookie echo was sent but likely ignored. One
> second later it was resent. Then, yet another strange timeout was
> invented before HB REQ.
> 
> Test series show this can spend more than 4 seconds, average value
> is about 3 seconds. Two 20-times run summary times are 58 to 63
> seconds, so, I've got 2.9...3.15 average connect time.
> 
> Neither Linux nor 32-bit FreeBSD shows this.
FreeBSD should neither... Do you see this on FreeBSD 9.2 amd64?
> 
> The second discrepancy is well known case of so-called "Nagle"
> algorithm adapted for SCTP but details are confusing. If
> SCTP_NODELAY isn't turned on on server side, tcpdump shows that the
> second packet is sent from sender side without delay, but receiver's
> SACK is delayed for 200 ms by default. These results are identical for
> FreeBSD (32 bit) and Linux, but not amd64 FreeBSD (see below). But
> why? A common sense suggests that, if client receives all immediately,
> and server has already prepared its data, no additional delay shall be
> invented. In analogue to TCP, I would expect that, until acknoledge
> for "abc" is got, "def" isn't sent, but then the latter is sent
> immediately.
> 
> 09:28:11.374335 IP (tos 0x2,ECT(0), ttl 64, id 24204, offset 0, flags [DF], prot
> o SCTP (132), length 52, bad cksum 0 (->ddb5)!)
>    127.0.0.1.2500 > 127.0.0.1.41007: sctp
>        1) [DATA] (B)(E) [TSN: 183313025] [SID: 0] [SSEQ 0] [PPID 0x0] [Payload:
>        0x0000:  6162 63                                  abc
> 09:28:11.374349 IP (tos 0x0, ttl 64, id 522, offset 0, flags [none], proto SCTP 
> (132), length 48, bad cksum 0 (->7a3e)!)
>    127.0.0.1.41007 > 127.0.0.1.2500: sctp
>        1) [SACK] [cum ack 183313025] [a_rwnd 1863876] [#gap acks 0] [#dup tsns 
> 0] 
> 09:28:11.374368 IP (tos 0x2,ECT(0), ttl 64, id 64629, offset 0, flags [DF], prot
> o SCTP (132), length 52, bad cksum 0 (->3fcc)!)
>    127.0.0.1.2500 > 127.0.0.1.41007: sctp
>        1) [DATA] (B)(E) [TSN: 183313026] [SID: 0] [SSEQ 1] [PPID 0x0] [Payload:
>        0x0000:  6465 66                                  def
> 09:28:11.573780 IP (tos 0x0, ttl 64, id 12179, offset 0, flags [none], proto SCT
> P (132), length 48, bad cksum 0 (->4cb5)!)
>    127.0.0.1.41007 > 127.0.0.1.2500: sctp
>        1) [SACK] [cum ack 183313026] [a_rwnd 1864135] [#gap acks 0] [#dup tsns 
> 0] 
> 
Please note, that the first SACK is returned without the 200ms delay. This is
required by the RFC and the above trace seems to show that.
> But, if server shuts its writing side down ("s" in argv[]), this
> laziness disappears. Again, the logic is too opaque and confusing.
What do you mean by this?
> 
> 64-bit (amd64) FreeBSD shows another behavior (both 9.1 and 9.2): in
> addition to setup delay (see above), the delay between 2nd and 3rd
> received packet (case SCTP_NODELAY isn't activated) could be longer
> than minimally needed one and spreads between a few hundreds of
> microseconds up to full 0.2 second delay shown on other platforms.
> In average, 1/8 of runs show this delay:
> 
> $ fgrep ghi ll | sort -rn -k2,2 -t= | uniq -c
>   1 got: ghi (with MSG_EOR) tdiff=200835
>   1 got: ghi (with MSG_EOR) tdiff=200829
>   1 got: ghi (with MSG_EOR) tdiff=200826
>   1 got: ghi (with MSG_EOR) tdiff=200822
>   1 got: ghi (with MSG_EOR) tdiff=200819
>   1 got: ghi (with MSG_EOR) tdiff=200800
>   1 got: ghi (with MSG_EOR) tdiff=200792
>   1 got: ghi (with MSG_EOR) tdiff=199885
>   1 got: ghi (with MSG_EOR) tdiff=163816
>   1 got: ghi (with MSG_EOR) tdiff=55849
>   1 got: ghi (with MSG_EOR) tdiff=1825
>  21 got: ghi (with MSG_EOR) tdiff=2
>  38 got: ghi (with MSG_EOR) tdiff=1
> 
> It's definitely better than delay each run, as on other platforms
> (but the initial delay annoys roughly).
Without SCTP_NODELAY bundling can happen or not, it depends on timing.
It would be great, if you can provide a .pcap file for a transfer you
think shows some buggy behaviour. Then we can figure out what is going on.
> 
> The testing code:
> ===
> #include <sys/types.h>
> #include <sys/socket.h>
> #include <sys/poll.h>
> #include <sys/time.h>
> #include <netinet/in.h>
> #include <netinet/sctp.h>
> #include <arpa/inet.h>
> #include <fcntl.h>
> #include <unistd.h>
> #include <stdio.h>
> #include <string.h>
> #include <err.h>
> 
> #define PORT 2500
> 
> int main(int argc, char *argv[])
> {
>  int s_li, s_ac, s_cl;
>  struct sockaddr_in sia;
>  struct iovec iov[1];
>  struct msghdr msg;
>  socklen_t slen;
>  struct timeval tv0, tv1;
>  int tdiff;
>  int i;
> 
>  s_li = socket(AF_INET, SOCK_STREAM, IPPROTO_SCTP);
>  if (s_li < 0)
>    err(1, "socket");
>  memset(&sia, 0, sizeof(sia));
>  sia.sin_family = AF_INET;
>  sia.sin_addr.s_addr = htonl(0x7F000001);
>  sia.sin_port = htons(PORT);
>  if (bind(s_li, (struct sockaddr*)&sia, sizeof(sia)) < 0)
>    err(1, "bind");
>  if (listen(s_li, 1) < 0)
>    err(1, "listen");
>  s_cl = socket(AF_INET, SOCK_STREAM, IPPROTO_SCTP);
>  if (s_cl < 0)
>    err(1, "socket");
>  if (connect(s_cl, (struct sockaddr*)&sia, sizeof(sia)) < 0)
>    err(1, "connect");
>  slen = sizeof(sia);
>  s_ac = accept(s_li, (struct sockaddr*) &sia, &slen);
>  if (s_ac < 0)
>    err(1, "accept");
>  for (i = 1; i < argc; ++i) {
>    if (!strcmp(argv[i], "nn")) {
>      const int one = 1;
>      if (setsockopt(s_ac, IPPROTO_SCTP, SCTP_NODELAY, &one, sizeof(one)) < 0)
>        warn("setsockopt(SCTP_NODELAY)");
>    }
>  }
>  if (send(s_ac, "abc", 3, 0) != 3)
>    err(1, "send");
>  if (send(s_ac, "def", 3, MSG_EOR) != 3)
MSG_EOR is nothing you provide at a send() call. The flag is only
returned by the recvmsg() call.
>    err(1, "send");
>  if (send(s_ac, "ghi", 3, 0) != 3)
>    err(1, "send");
>  if (send(s_ac, "jkl", 3, MSG_EOR) != 3)
>    err(1, "send");
>  if (send(s_ac, "mno", 3, 0) != 3)
>    err(1, "send");
>  if (send(s_ac, "pqr", 3, MSG_EOR) != 3)
>    err(1, "send");
>  for (i = 1; i < argc; ++i) {
>    if (!strcmp(argv[i], "s"))
>      shutdown(s_ac, SHUT_WR);
>  }
>  for(;;) {
>    char buf[1024];
>    memset(&msg, 0, sizeof(msg));
>    iov[0].iov_base = buf; iov[0].iov_len = sizeof(buf) - 1;
>    msg.msg_iov = iov; msg.msg_iovlen = 1;
>    gettimeofday(&tv0, NULL);
>    ssize_t got = recvmsg(s_cl, &msg, 0);
>    gettimeofday(&tv1, NULL);
>    tdiff = (int)tv1.tv_usec - (int)tv0.tv_usec;
>    if (tdiff < 0)
>      tdiff += 1000000;
>    if (got == 0)
>      break;
>    if (got == -1) {
>      perror("recvmsg");
>      break;
>    }
>    buf[got] = 0;
>    printf("got: %s (%s MSG_EOR) tdiff=%d\n",
>        buf,
>        (msg.msg_flags & MSG_EOR) ? "with" : "without",
>        tdiff);
>    if (!strncmp(buf, "pqr", 3))
>      break;
>  }
>  return 0;
> }
OK. Here is what I would expect on the wire:

Without SCTP_NODELAY:

> INIT
< INIT_ACK
> COOKIE_ECHO
< COOKIE_ACK
< DATA(abc)
> SACK
< DATA(def);DATA(ghi);DATA(jkl);DATA(mno);DATA(pqr)
> SACK
> SHUTDOWN
< SHUTDOWN_ACK
> SHUTDOWN_COMPLETE

There should be no substantial delay between any messages above.

With SCTP_NODELAY
> INIT
< INIT_ACK
> COOKIE_ECHO
< COOKIE_ACK
< DATA(abc)
< DATA(def)
< DATA(ghi)
< DATA(mno)
< DATA(pqr)
> SHUTDOWN
< SHUTDOWN_ACK
> SHUTDOWN_COMPLETE

There will be three SACK somewhere between the DATA chunks depending on
the timing.

There should be no substantial delay between any messages above.

I think if you see anything else, there is a bug. So do you see a different
behavior on FreeBSD 9.2 (i386/amd64)? If yes, can you provide a .pcap file?

Here is what I see on a 9.2 amd64 system:

tuexen at bsd9:~ % uname -a
FreeBSD bsd9.fh-muenster.de 9.2-RELEASE FreeBSD 9.2-RELEASE #0 r255898: Thu Sep 26 22:50:31 UTC 2013     root at bake.isc.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64
tuexen at bsd9:~ % ./valentin 
got: abc (with MSG_EOR) tdiff=3
got: def (with MSG_EOR) tdiff=1
got: ghi (with MSG_EOR) tdiff=1
got: jkl (with MSG_EOR) tdiff=1
got: mno (with MSG_EOR) tdiff=1
got: pqr (with MSG_EOR) tdiff=0
tuexen at bsd9:~ % ./valentin nn
got: abc (with MSG_EOR) tdiff=4
got: def (with MSG_EOR) tdiff=2
got: ghi (with MSG_EOR) tdiff=1
got: jkl (with MSG_EOR) tdiff=1
got: mno (with MSG_EOR) tdiff=1
got: pqr (with MSG_EOR) tdiff=1

Do you have any special routing setup?

Best regards
Michael
> // vim:ts=2:sts=2:sw=2:et:si:
> ===
> 
> 
> -netch-
> _______________________________________________
> freebsd-net at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe at freebsd.org"
>