DCTCP for FreeBSD

Wed Feb 19 09:18:14 UTC 2014

Hi,

Midori Kato has implemented Microsoft's/Stanford's Datacenter TCP (DCTCP) for FreeBSD as part of her MS thesis with me. Find a patch attached.

Also note that we're documenting a specification for DCTCP in an IETF draft: http://tools.ietf.org/html/draft-bensley-tcpm-dctcp

Microsoft has made a licensing statement (RAND-Z) on the technology to the IETF: https://datatracker.ietf.org/ipr/2319/ (I'm not sure what this means for an eventual inclusion in FreeBSD.)

Roughly, Midori's patch consists of an extension of the modular congestion control framework to expose ECN information to the modules, a module to implement DCTCP, and a few experimental variants. See Midori's explanation:

> [1] A change for the modular congestion control framework (See Section 4.1 if needed)
> DCTCP uses the difference ECN processing from RFC3168. We need to prepare three functions to do the following ECN processing. 
>  a) The kernel decides whether an ECE flag should be set in the next outgoing TCP segment by snooping reserved bits in IP and TCP headers. (tcp_input.c)
>  b) The kernel controls a congestion if an ECE flag is set in an arriving TCP segment. (tcp_input.c)
>  c) After the outgoing TCP segment is generated, the kernel decides whether an ECT bit should be set in an ECN field of IP header in the outgoing packet. (tcp_output.c)
> The current framework has no housekeeping functions for (a) and (b). Therefore, I add two functions into the moduler cc framework: ecnpkt_handler() and ect_handler().
> 
> - ecnpkt_handler() allows the kernel to do the additional ECN processing by snooping ECN field in IP and TCP headers. As an option, this function takes a flag, which tells whether this function is in the delayed ACK. This function returns an integer value. When the return value is set, the kernel force to disable delayed ACK.
> - ect_handler() allows the kernel to use different rule from RFC3168 in terms of an ECT marking in the outgoing segment. This function returns an integer value. If the value is set, an ECT bit is set to the outgoing segment.
> 
> 
> [2] Five changes from the original DCTCP algorithm
> In order to reflect the DCTCP motivation, I modified the following processing. First four modifications are for senders and the last modification is for receivers.
> 
> (1) no congestion recovery in the receipt of ECE flags (See section 4.2.1 if needed)
> FreeBSD handles ECN as a congestion event but it's not true for DCTCP senders. A DCTCP sender uses ECN as a means to understand the extent of congestions. Therefore, I remove congestion recovery mode in any situation for DCTCP senders.
> 
> (2) selective initial alpha value (See section 4.2.2 if needed) 
> DCTCP defines alpha as a parameter to see the depth of a congestion. When the alpha value is large, it allows a saw-toothed CWND behavior to a DCTCP sender.
> A problem is that the alpha value is not reliable during a dozen of RTTs because there is no way to identify the depth of a congestion over a network from the beginning. When considering the alpha reliability, I think the initial alpha should be selective for applications by users. When a user chooses DCTCP for latency-sensitive applications, the initial alpha is preferred. Otherwise, DCTCP senders had better to set the initial alpha value to zero from my experimental result (See section 7.2 of attaching file).
> The default alpha value is set to zero in my implementation.
> 
> (3) alpha value initialization after an idle period (See section 4.2.3 if needed)
> How long an idle period is no longer predictable. Therefore, for a DCTCP sender, using the out-dated alpha after an idle period is not good idea. A DCTCP sender resets alpha to the initial value when an idle period occurs.
> 
> The following changes is applied to eliminate a compatibility issue to standard ECN defined in RFC3465. DCTCP and standard ECN servers have no way to identify which mechanism is working on the peer. Thus, we need to eliminate the worst situation in a network mixing DCTCP senders/receivers and standard ECN senders/receivers.
> (4) using CWR flag when the ECE flag is found for a RTT (See section 5.1 if needed)
> This change is applied for a situation when a sender uses DCTCP and a reciever uses standard ECN. 
> Under the situation, I find that a DCTCP sender minimizes CWND. The detailed technical reason is described in section 4.2 of my attaching file. Fortunately, the current tcp_input()  function complement this change, thus, there is no modification in my patch.
> 
> (5) enabling delayed ACK in the receipt of the CWR flag (See section 5.2 if needed)
> This change is applied for a situation when a sender uses standard ECN and a reciever uses DCTCP. Under the situation, I find that a standard ECN sender increases smaller CWND than expected without this change. The detailed technical reason is described in section 5.2 of my attaching file.

The patch is attached and should apply to a recent -CURRENT. Midori's thesis (which she refers to in the quoted text above) is at https://eggert.org/students/kato-thesis.pdf

Lars

-------------- next part --------------
A non-text attachment was scrubbed...
Name: dctcp.patch
Type: application/octet-stream
Size: 17423 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-net/attachments/20140219/0d01eaa1/attachment.obj>
-------------- next part --------------

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 273 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.freebsd.org/pipermail/freebsd-net/attachments/20140219/0d01eaa1/attachment.sig>