compressed TIME-WAIT to be decomissioned

From: Gleb Smirnoff <>
Date: Wed, 12 Jan 2022 18:48:59 UTC

[crossposted to current@, but let's keep discussion at net@]

I have already touched the topic with rrs@, jtl@, tuexen@, rscheff@ and
Igor Sysoev (author of nginx).  Now posting for wider discussion.

TLDR: struct tcptw shall be decomissioned

Longer version covers three topics: why does tcptw exist? why is it no
longer necessary? what would we get removing it?

Why does struct tcptw exist?

When TCP connection goes to TIME-WAIT state, it can only retransmit
the very last ACK, thus doesn't need all of the control data in the kernel.
However, we are required to keep it in memory for certain amount of time
(2*MSL). So, let's save memory: free the socket, free the tcpcb and
leave only inpcb that will point at small tcptw (much smaller than tcpcb)
that holds enough info to retransmit the last ACK. This was done in
early 2003, see 340c35de6a2.

What was different in 2003 compared to 2022?

* First of all, internet servers were running i386 with only 2 Gb of KVA
  space. Unlike today, they were memory constrained in the first place, not
  CPU bound like they are today.

* Many of HTTP connections were made by older browsers, which were not able
  to use persistent HTTP connections.  Those browsers that could, would
  recycle connections more often, then today.  Default timeouts in Apache
  for persistent connections were short.  So, the ratio of connections
  in TIME-WAIT compared to live connections was much bigger than today.
  Here is sample data from 2008 provided to me by Igor Sysoev:

  tcpcb:        728,   163840,    22938,    72722, 13029632,        0
  tcptw:         88,   163842,    10253,    72949,  2447928,        0

  We see that TIME-WAITs are ~ 50% of live connections.

  Today I see that TIME-WAITs are ~ 1% of connections. My data is biased
  here, since I'm looking at servers that do mostly video streaming. I'd
  be grateful if anybody replies to this email with some other modern data
  on ratio between tcpcb and tcptw allocations.

* The Internet bandwidth was lower and thus average size of HTTP object
  much smaller.  That made the average send socket buffer size much smaller
  than today.  Note that TCP socket buffers autosizing came in 2009 only.
  This means that today most significant portion of kernel memory consumed
  by an average TCP connection is the send socket buffer, and
  socket+inpcb+tcpcb is just a fraction of that.  Thus, swapping tcpcb to
  tcptw we are saving a fraction of a fraction of memory consumed by average

* Who told that 2*MSL (60 seconds) is adequate time to keep TIME-WAIT?
  In 71d2d5adfe1 I added some stats on usage of tcptw and experimented a bit
  with lowering net.inet.tcp.msl. It appeared that lowering it down three
  times doesn't have statistically significant effect on TIME-WAIT use stats.
  This means that the already miniscule number of TIME-WAIT connection on a
  modern HTTP server can be lowered 3 times more.  Feel free to lower
  net.inet.tcp.msl and do your own measurements with
  'netstat -sp tcp | grep TIME-WAIT'.  I'd be glad to see your results.

Ok, now what would removal give us?

* One less alloc/free during socket lifetime (immediately).
* Reduced code complexity. inp->inp_ppcb always can be dereferenced as tcpcb.
  Lot's of checking for inp->inp_flags & INP_TIMEWAIT goes away (eventually).
* Shrink of struct inpcb. Today inpcb has some TCP-only data, e.g. HPTS.
  Reason for that is obvious - compressed TIME-WAIT. A HPTS-driven connection
  may transition to TIME-WAIT, so we can't use tcpcb. Now we would be able to.
  So, for non TCP connections memory footprint shrinks (with following changes).
* Embedding inpcb into protocols cb. An inpcb becomes one piece of memory with
  tcpcb. One more less alloc/free during socket lifetime. Reduced code
  complexity, since now inpcb == tcpb (following changes).

How much memory are we going to lose?

(kgdb) p tcpcb_zone->uz_keg->uk_rsize
$5 = 1064
(kgdb) p tcptw_zone->uz_keg->uk_rsize
$6 = 72
(kgdb) p tcpcbstor->ips_zone->uz_keg->uk_rsize
$8 = 424

After change a connection in TIME-WAIT would consume 424+1064 bytes instead
of 424+72. Multiply that by expected number of connections in TIME-WAIT on
your machine.

Comments welcome.

Gleb Smirnoff