From nobody Wed Jan 12 18:48:59 2022 X-Original-To: current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 5EB081958F6D; Wed, 12 Jan 2022 18:49:08 +0000 (UTC) (envelope-from glebius@freebsd.org) Received: from cell.glebi.us (glebi.us [162.251.186.162]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "cell.glebi.us", Issuer "cell.glebi.us" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 4JYxRl2mxPz4WpD; Wed, 12 Jan 2022 18:49:07 +0000 (UTC) (envelope-from glebius@freebsd.org) Received: from cell.glebi.us (localhost [127.0.0.1]) by cell.glebi.us (8.16.1/8.16.1) with ESMTPS id 20CImxfP068241 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NO); Wed, 12 Jan 2022 10:48:59 -0800 (PST) (envelope-from glebius@freebsd.org) Received: (from glebius@localhost) by cell.glebi.us (8.16.1/8.16.1/Submit) id 20CImxYd068240; Wed, 12 Jan 2022 10:48:59 -0800 (PST) (envelope-from glebius@freebsd.org) X-Authentication-Warning: cell.glebi.us: glebius set sender to glebius@freebsd.org using -f Date: Wed, 12 Jan 2022 10:48:59 -0800 From: Gleb Smirnoff To: net@freebsd.org Cc: current@freebsd.org Subject: compressed TIME-WAIT to be decomissioned Message-ID: List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Rspamd-Queue-Id: 4JYxRl2mxPz4WpD X-Spamd-Bar: - Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=softfail (mx1.freebsd.org: 162.251.186.162 is neither permitted nor denied by domain of glebius@freebsd.org) smtp.mailfrom=glebius@freebsd.org X-Spamd-Result: default: False [-1.37 / 15.00]; MID_RHS_MATCH_FROM(0.00)[]; ARC_NA(0.00)[]; FREEFALL_USER(0.00)[glebius]; FROM_HAS_DN(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; NEURAL_HAM_LONG(-0.99)[-0.993]; MIME_GOOD(-0.10)[text/plain]; HAS_XAW(0.00)[]; TO_DN_NONE(0.00)[]; R_SPF_SOFTFAIL(0.00)[~all]; DMARC_NA(0.00)[freebsd.org]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_SPAM_SHORT(0.72)[0.719]; RCPT_COUNT_TWO(0.00)[2]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:27348, ipnet:162.251.186.0/24, country:US]; RCVD_COUNT_TWO(0.00)[2]; RCVD_TLS_ALL(0.00)[] X-ThisMailContainsUnwantedMimeParts: N Hi! [crossposted to current@, but let's keep discussion at net@] I have already touched the topic with rrs@, jtl@, tuexen@, rscheff@ and Igor Sysoev (author of nginx). Now posting for wider discussion. TLDR: struct tcptw shall be decomissioned Longer version covers three topics: why does tcptw exist? why is it no longer necessary? what would we get removing it? Why does struct tcptw exist? When TCP connection goes to TIME-WAIT state, it can only retransmit the very last ACK, thus doesn't need all of the control data in the kernel. However, we are required to keep it in memory for certain amount of time (2*MSL). So, let's save memory: free the socket, free the tcpcb and leave only inpcb that will point at small tcptw (much smaller than tcpcb) that holds enough info to retransmit the last ACK. This was done in early 2003, see 340c35de6a2. What was different in 2003 compared to 2022? * First of all, internet servers were running i386 with only 2 Gb of KVA space. Unlike today, they were memory constrained in the first place, not CPU bound like they are today. * Many of HTTP connections were made by older browsers, which were not able to use persistent HTTP connections. Those browsers that could, would recycle connections more often, then today. Default timeouts in Apache for persistent connections were short. So, the ratio of connections in TIME-WAIT compared to live connections was much bigger than today. Here is sample data from 2008 provided to me by Igor Sysoev: ITEM SIZE LIMIT USED FREE REQUESTS FAILURES tcpcb: 728, 163840, 22938, 72722, 13029632, 0 tcptw: 88, 163842, 10253, 72949, 2447928, 0 We see that TIME-WAITs are ~ 50% of live connections. Today I see that TIME-WAITs are ~ 1% of connections. My data is biased here, since I'm looking at servers that do mostly video streaming. I'd be grateful if anybody replies to this email with some other modern data on ratio between tcpcb and tcptw allocations. * The Internet bandwidth was lower and thus average size of HTTP object much smaller. That made the average send socket buffer size much smaller than today. Note that TCP socket buffers autosizing came in 2009 only. This means that today most significant portion of kernel memory consumed by an average TCP connection is the send socket buffer, and socket+inpcb+tcpcb is just a fraction of that. Thus, swapping tcpcb to tcptw we are saving a fraction of a fraction of memory consumed by average connection. * Who told that 2*MSL (60 seconds) is adequate time to keep TIME-WAIT? In 71d2d5adfe1 I added some stats on usage of tcptw and experimented a bit with lowering net.inet.tcp.msl. It appeared that lowering it down three times doesn't have statistically significant effect on TIME-WAIT use stats. This means that the already miniscule number of TIME-WAIT connection on a modern HTTP server can be lowered 3 times more. Feel free to lower net.inet.tcp.msl and do your own measurements with 'netstat -sp tcp | grep TIME-WAIT'. I'd be glad to see your results. Ok, now what would removal give us? * One less alloc/free during socket lifetime (immediately). * Reduced code complexity. inp->inp_ppcb always can be dereferenced as tcpcb. Lot's of checking for inp->inp_flags & INP_TIMEWAIT goes away (eventually). * Shrink of struct inpcb. Today inpcb has some TCP-only data, e.g. HPTS. Reason for that is obvious - compressed TIME-WAIT. A HPTS-driven connection may transition to TIME-WAIT, so we can't use tcpcb. Now we would be able to. So, for non TCP connections memory footprint shrinks (with following changes). * Embedding inpcb into protocols cb. An inpcb becomes one piece of memory with tcpcb. One more less alloc/free during socket lifetime. Reduced code complexity, since now inpcb == tcpb (following changes). How much memory are we going to lose? (kgdb) p tcpcb_zone->uz_keg->uk_rsize $5 = 1064 (kgdb) p tcptw_zone->uz_keg->uk_rsize $6 = 72 (kgdb) p tcpcbstor->ips_zone->uz_keg->uk_rsize $8 = 424 After change a connection in TIME-WAIT would consume 424+1064 bytes instead of 424+72. Multiply that by expected number of connections in TIME-WAIT on your machine. Comments welcome. -- Gleb Smirnoff