SCTP huge connect delays (at amd64) and yet another question

Michael Tuexen Michael.Tuexen at lurchi.franken.de
Thu Dec 5 10:57:45 UTC 2013


More thinking and testing.

Without SCTP_NODELAY the following can also happen:

  > INIT
  < INIT-ACK
  < COOKIE-ECHO
  > COOKIE-ACK
  < DATA(abc)
  > SACK
  < DATA(def) possibly more...
200 ms delay
  > SACK
  < all remaining DATA chunks
  > SHUTDOWN
  < SHUTDOWN-ACK
  > SHUTDOWN-COMPLETE

Timing comes into the game. The question is if all send() calls have been completed
before the first SACK is received. Not sure this depends in i386 vs. amd64, but
timing is important. On a Raspberry Pi I saw in a reproducable way

  > INIT
  < INIT-ACK
  < COOKIE-ECHO
  > COOKIE-ACK
  < DATA(abc)
  > SACK
  < DATA(def)
200 ms delay
  > SACK
  < DATA(ghi);DATA(jkl);DATA(mno);DATA(pqr);
  > SHUTDOWN
  < SHUTDOWN-ACK
  > SHUTDOWN-COMPLETE

Best regards
Michael
On Dec 5, 2013, at 11:32 AM, Michael Tuexen <Michael.Tuexen at lurchi.franken.de> wrote:

> On Dec 5, 2013, at 9:41 AM, Valentin Nechayev <netch at netch.kiev.ua> wrote:
> 
>> Hi,
>> 
>> I've got some test results which are surprising and I would get
>> a clarification.
>> 
>> A simple connection is created between two one-to-one SCTP sockets
>> (AF_INET, SOCK_STREAM, IPPROTO_SCTP) at loopback (127.0.0.1). The
>> server side sends 6 3-byte messages to client side and optionally
>> designates writing shutdown. Client receives all them and measures
>> a time before each receiving.
>> Code is showed at the end of this message.
>> Tested systems are:
>> * FreeBSD 9.2-release/amd64
>> * FreeBSD 9.1-release/amd64
>> * FreeBSD 9.1-release/i386
>> * Linux OpenSuSE 12.2, kernel 3.4.63-2.44-default, x86_64
>> * Linux RHEL 6.3, kernel 2.6.32-279.22.1.38.0.el6.x86_64
>> 
>> The first discrepancy found is specific for FreeBSD on amd64 and not
>> for i386 version; it's that connection setup lasts 2-4 seconds (!!)
>> Tcpdump shows indication that could be parsed as message miss:
> Hi Valentin,
> 
> could you send me the .pcap file instead of the tcpdump output.
> I would like to see the addresses listed in the INIT and INIT-ACK.
> 
> You can send that file to tuexen at freebsd.org.
>> 
>> tcpdump: listening on lo0, link-type NULL (BSD loopback), capture size 65535 byt
>> es
>> 08:18:34.639422 IP (tos 0x0, ttl 64, id 65094, offset 0, flags [none], proto SCT
>> P (132), length 188, bad cksum 0 (->f274)!)
>>   10.0.0.2.50025 > 127.0.0.1.2500: sctp
> I'm wondering why 10.0.0.2 is the source address and not 127.0.0.1
>>       1) [INIT] [init tag: 3943463987] [rwnd: 1864135] [OS: 10] [MIS: 2048] [i
>> nit TSN: 3475830004]
>> 08:18:34.639450 IP (tos 0x0, ttl 64, id 42621, offset 0, flags [none], proto SCT
>> P (132), length 524, bad cksum 0 (->48ee)!)
>>   127.0.0.1.2500 > 10.0.0.2.50025: sctp
>>       1) [INIT ACK] [init tag: 59811639] [rwnd: 1864135] [OS: 10] [MIS: 2048]
>> [init TSN: 466863335]
>> 08:18:34.639467 IP (tos 0x0, ttl 64, id 52783, offset 0, flags [none], proto SCT
>> P (132), length 424, bad cksum 0 (->21a0)!)
>>   10.0.0.2.50025 > 127.0.0.1.2500: sctp
>>       1) [COOKIE ECHO]
>> 08:18:35.639618 IP (tos 0x0, ttl 64, id 12109, offset 0, flags [DF], proto SCTP
>> (132), length 424, bad cksum 0 (->8082)!)
>>   10.0.0.2.50025 > 127.0.0.1.2500: sctp
>>       1) [COOKIE ECHO]
>> 08:18:36.692628 IP (tos 0x0, ttl 64, id 48682, offset 0, flags [DF], proto SCTP
>> (132), length 76, bad cksum 0 (->7e01)!)
>>   127.0.0.1.2500 > 127.0.0.1.50025: sctp
> The retransmission goes from 127.0.0.1. Hmm. Not sure why.
>>       1) [HB REQ]
>> 08:18:36.692668 IP (tos 0x0, ttl 64, id 10809, offset 0, flags [DF], proto SCTP (132), length 76, bad cksum 0 (->86f2)!)
>>   10.0.0.2.50025 > 127.0.0.1.2500: sctp
>>       1) [HB ACK] 
>> 08:18:36.692707 IP (tos 0x2,ECT(0), ttl 64, id 16588, offset 0, flags [DF], proto SCTP (132), length 52, bad cksum 0 (->fb75)!)
>>   127.0.0.1.2500 > 127.0.0.1.50025: sctp
>>       1) [DATA] (B)(E) [TSN: 466863335] [SID: 0] [SSEQ 0] [PPID 0x0] [Payload:
>>       0x0000:  6162 63                                  abc
>> [...]
>> 
>> At 08:18:34.639467, cookie echo was sent but likely ignored. One
>> second later it was resent. Then, yet another strange timeout was
>> invented before HB REQ.
>> 
>> Test series show this can spend more than 4 seconds, average value
>> is about 3 seconds. Two 20-times run summary times are 58 to 63
>> seconds, so, I've got 2.9...3.15 average connect time.
>> 
>> Neither Linux nor 32-bit FreeBSD shows this.
> FreeBSD should neither... Do you see this on FreeBSD 9.2 amd64?
>> 
>> The second discrepancy is well known case of so-called "Nagle"
>> algorithm adapted for SCTP but details are confusing. If
>> SCTP_NODELAY isn't turned on on server side, tcpdump shows that the
>> second packet is sent from sender side without delay, but receiver's
>> SACK is delayed for 200 ms by default. These results are identical for
>> FreeBSD (32 bit) and Linux, but not amd64 FreeBSD (see below). But
>> why? A common sense suggests that, if client receives all immediately,
>> and server has already prepared its data, no additional delay shall be
>> invented. In analogue to TCP, I would expect that, until acknoledge
>> for "abc" is got, "def" isn't sent, but then the latter is sent
>> immediately.
>> 
>> 09:28:11.374335 IP (tos 0x2,ECT(0), ttl 64, id 24204, offset 0, flags [DF], prot
>> o SCTP (132), length 52, bad cksum 0 (->ddb5)!)
>>   127.0.0.1.2500 > 127.0.0.1.41007: sctp
>>       1) [DATA] (B)(E) [TSN: 183313025] [SID: 0] [SSEQ 0] [PPID 0x0] [Payload:
>>       0x0000:  6162 63                                  abc
>> 09:28:11.374349 IP (tos 0x0, ttl 64, id 522, offset 0, flags [none], proto SCTP 
>> (132), length 48, bad cksum 0 (->7a3e)!)
>>   127.0.0.1.41007 > 127.0.0.1.2500: sctp
>>       1) [SACK] [cum ack 183313025] [a_rwnd 1863876] [#gap acks 0] [#dup tsns 
>> 0] 
>> 09:28:11.374368 IP (tos 0x2,ECT(0), ttl 64, id 64629, offset 0, flags [DF], prot
>> o SCTP (132), length 52, bad cksum 0 (->3fcc)!)
>>   127.0.0.1.2500 > 127.0.0.1.41007: sctp
>>       1) [DATA] (B)(E) [TSN: 183313026] [SID: 0] [SSEQ 1] [PPID 0x0] [Payload:
>>       0x0000:  6465 66                                  def
>> 09:28:11.573780 IP (tos 0x0, ttl 64, id 12179, offset 0, flags [none], proto SCT
>> P (132), length 48, bad cksum 0 (->4cb5)!)
>>   127.0.0.1.41007 > 127.0.0.1.2500: sctp
>>       1) [SACK] [cum ack 183313026] [a_rwnd 1864135] [#gap acks 0] [#dup tsns 
>> 0] 
>> 
> Please note, that the first SACK is returned without the 200ms delay. This is
> required by the RFC and the above trace seems to show that.
>> But, if server shuts its writing side down ("s" in argv[]), this
>> laziness disappears. Again, the logic is too opaque and confusing.
> What do you mean by this?
>> 
>> 64-bit (amd64) FreeBSD shows another behavior (both 9.1 and 9.2): in
>> addition to setup delay (see above), the delay between 2nd and 3rd
>> received packet (case SCTP_NODELAY isn't activated) could be longer
>> than minimally needed one and spreads between a few hundreds of
>> microseconds up to full 0.2 second delay shown on other platforms.
>> In average, 1/8 of runs show this delay:
>> 
>> $ fgrep ghi ll | sort -rn -k2,2 -t= | uniq -c
>>  1 got: ghi (with MSG_EOR) tdiff=200835
>>  1 got: ghi (with MSG_EOR) tdiff=200829
>>  1 got: ghi (with MSG_EOR) tdiff=200826
>>  1 got: ghi (with MSG_EOR) tdiff=200822
>>  1 got: ghi (with MSG_EOR) tdiff=200819
>>  1 got: ghi (with MSG_EOR) tdiff=200800
>>  1 got: ghi (with MSG_EOR) tdiff=200792
>>  1 got: ghi (with MSG_EOR) tdiff=199885
>>  1 got: ghi (with MSG_EOR) tdiff=163816
>>  1 got: ghi (with MSG_EOR) tdiff=55849
>>  1 got: ghi (with MSG_EOR) tdiff=1825
>> 21 got: ghi (with MSG_EOR) tdiff=2
>> 38 got: ghi (with MSG_EOR) tdiff=1
>> 
>> It's definitely better than delay each run, as on other platforms
>> (but the initial delay annoys roughly).
> Without SCTP_NODELAY bundling can happen or not, it depends on timing.
> It would be great, if you can provide a .pcap file for a transfer you
> think shows some buggy behaviour. Then we can figure out what is going on.
>> 
>> The testing code:
>> ===
>> #include <sys/types.h>
>> #include <sys/socket.h>
>> #include <sys/poll.h>
>> #include <sys/time.h>
>> #include <netinet/in.h>
>> #include <netinet/sctp.h>
>> #include <arpa/inet.h>
>> #include <fcntl.h>
>> #include <unistd.h>
>> #include <stdio.h>
>> #include <string.h>
>> #include <err.h>
>> 
>> #define PORT 2500
>> 
>> int main(int argc, char *argv[])
>> {
>> int s_li, s_ac, s_cl;
>> struct sockaddr_in sia;
>> struct iovec iov[1];
>> struct msghdr msg;
>> socklen_t slen;
>> struct timeval tv0, tv1;
>> int tdiff;
>> int i;
>> 
>> s_li = socket(AF_INET, SOCK_STREAM, IPPROTO_SCTP);
>> if (s_li < 0)
>>   err(1, "socket");
>> memset(&sia, 0, sizeof(sia));
>> sia.sin_family = AF_INET;
>> sia.sin_addr.s_addr = htonl(0x7F000001);
>> sia.sin_port = htons(PORT);
>> if (bind(s_li, (struct sockaddr*)&sia, sizeof(sia)) < 0)
>>   err(1, "bind");
>> if (listen(s_li, 1) < 0)
>>   err(1, "listen");
>> s_cl = socket(AF_INET, SOCK_STREAM, IPPROTO_SCTP);
>> if (s_cl < 0)
>>   err(1, "socket");
>> if (connect(s_cl, (struct sockaddr*)&sia, sizeof(sia)) < 0)
>>   err(1, "connect");
>> slen = sizeof(sia);
>> s_ac = accept(s_li, (struct sockaddr*) &sia, &slen);
>> if (s_ac < 0)
>>   err(1, "accept");
>> for (i = 1; i < argc; ++i) {
>>   if (!strcmp(argv[i], "nn")) {
>>     const int one = 1;
>>     if (setsockopt(s_ac, IPPROTO_SCTP, SCTP_NODELAY, &one, sizeof(one)) < 0)
>>       warn("setsockopt(SCTP_NODELAY)");
>>   }
>> }
>> if (send(s_ac, "abc", 3, 0) != 3)
>>   err(1, "send");
>> if (send(s_ac, "def", 3, MSG_EOR) != 3)
> MSG_EOR is nothing you provide at a send() call. The flag is only
> returned by the recvmsg() call.
>>   err(1, "send");
>> if (send(s_ac, "ghi", 3, 0) != 3)
>>   err(1, "send");
>> if (send(s_ac, "jkl", 3, MSG_EOR) != 3)
>>   err(1, "send");
>> if (send(s_ac, "mno", 3, 0) != 3)
>>   err(1, "send");
>> if (send(s_ac, "pqr", 3, MSG_EOR) != 3)
>>   err(1, "send");
>> for (i = 1; i < argc; ++i) {
>>   if (!strcmp(argv[i], "s"))
>>     shutdown(s_ac, SHUT_WR);
>> }
>> for(;;) {
>>   char buf[1024];
>>   memset(&msg, 0, sizeof(msg));
>>   iov[0].iov_base = buf; iov[0].iov_len = sizeof(buf) - 1;
>>   msg.msg_iov = iov; msg.msg_iovlen = 1;
>>   gettimeofday(&tv0, NULL);
>>   ssize_t got = recvmsg(s_cl, &msg, 0);
>>   gettimeofday(&tv1, NULL);
>>   tdiff = (int)tv1.tv_usec - (int)tv0.tv_usec;
>>   if (tdiff < 0)
>>     tdiff += 1000000;
>>   if (got == 0)
>>     break;
>>   if (got == -1) {
>>     perror("recvmsg");
>>     break;
>>   }
>>   buf[got] = 0;
>>   printf("got: %s (%s MSG_EOR) tdiff=%d\n",
>>       buf,
>>       (msg.msg_flags & MSG_EOR) ? "with" : "without",
>>       tdiff);
>>   if (!strncmp(buf, "pqr", 3))
>>     break;
>> }
>> return 0;
>> }
> OK. Here is what I would expect on the wire:
> 
> Without SCTP_NODELAY:
> 
>> INIT
> < INIT_ACK
>> COOKIE_ECHO
> < COOKIE_ACK
> < DATA(abc)
>> SACK
> < DATA(def);DATA(ghi);DATA(jkl);DATA(mno);DATA(pqr)
>> SACK
>> SHUTDOWN
> < SHUTDOWN_ACK
>> SHUTDOWN_COMPLETE
> 
> There should be no substantial delay between any messages above.
> 
> With SCTP_NODELAY
>> INIT
> < INIT_ACK
>> COOKIE_ECHO
> < COOKIE_ACK
> < DATA(abc)
> < DATA(def)
> < DATA(ghi)
> < DATA(mno)
> < DATA(pqr)
>> SHUTDOWN
> < SHUTDOWN_ACK
>> SHUTDOWN_COMPLETE
> 
> There will be three SACK somewhere between the DATA chunks depending on
> the timing.
> 
> There should be no substantial delay between any messages above.
> 
> I think if you see anything else, there is a bug. So do you see a different
> behavior on FreeBSD 9.2 (i386/amd64)? If yes, can you provide a .pcap file?
> 
> 
> Here is what I see on a 9.2 amd64 system:
> 
> tuexen at bsd9:~ % uname -a
> FreeBSD bsd9.fh-muenster.de 9.2-RELEASE FreeBSD 9.2-RELEASE #0 r255898: Thu Sep 26 22:50:31 UTC 2013     root at bake.isc.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64
> tuexen at bsd9:~ % ./valentin 
> got: abc (with MSG_EOR) tdiff=3
> got: def (with MSG_EOR) tdiff=1
> got: ghi (with MSG_EOR) tdiff=1
> got: jkl (with MSG_EOR) tdiff=1
> got: mno (with MSG_EOR) tdiff=1
> got: pqr (with MSG_EOR) tdiff=0
> tuexen at bsd9:~ % ./valentin nn
> got: abc (with MSG_EOR) tdiff=4
> got: def (with MSG_EOR) tdiff=2
> got: ghi (with MSG_EOR) tdiff=1
> got: jkl (with MSG_EOR) tdiff=1
> got: mno (with MSG_EOR) tdiff=1
> got: pqr (with MSG_EOR) tdiff=1
> 
> Do you have any special routing setup?
> 
> Best regards
> Michael
>> // vim:ts=2:sts=2:sw=2:et:si:
>> ===
>> 
>> 
>> -netch-
>> _______________________________________________
>> freebsd-net at freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-unsubscribe at freebsd.org"
>> 
> 
> _______________________________________________
> freebsd-net at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe at freebsd.org"
> 



More information about the freebsd-net mailing list