Re: Hibernating sockets to support C10M
- In reply to: Mark Delany: "Hibernating sockets to support C10M"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Tue, 11 Nov 2025 19:54:04 UTC
On Tue, 11 Nov 2025 13:46:52 +0000 "Mark Delany" <n6t@oscar.emu.st> wrote: > This must surely be an old idea so I'm curious as to whether much discussion > has happened previously and what sort of conclusions they came to. > > This is mostly thinking about huge numbers of relatively idle TCP sockets on > servers where clients wish to stay connected for very long periods of time > waiting for infrequent server pushes. Two examples are RSS and DNS Push > Notifications (rfc8765). In these cases a client might establish a socket > and not get server pushed traffic for hours or days. > > As I understand it, the main limitation on the number of concurrent TCP > sockets a system can support is memory. I expect that socket buffers are the > biggest consumers of kernel memory and while user-space is very language and > application dependent, it wouldn't surprise me if a typical application - > even a lean one - requires multiple kB of memory for each client connection > (and quite possibly other non-memory resources). > > So the idea is that these idle sockets are "hibernated", which is to say > that the application and the kernel release most of their memory associated > with an idle socket. > > When the server application makes the decision to hibernate a socket - > perhaps based on an inactivity timer - it releases as many application > resources as it can and retains *just* enough state to reconstitute those > resources at a later time. It then calls the kernel to do the same thing - > namely release as many kernel resources as possible associated the socket > and retain *just* enough state to reconstitute the socket at a later time. > > Assuming the TCP session is completely idle, the kernel should be able to > release most of the memory associated with the socket. I don't know exactly > how much state the kernel needs to reconstitute an idle socket, but > considering addrs/ports tuple, sequence numbers, socket buffer size values, > socket options and a few other odds and sods, does the state representation > of a socket require much more than a couple of hundred bytes? It's already on those order of magnitude: % vmstat -z | grep socket shows 944 bytes on FreeBSD 13. However, I'm too lazy to count in sockbuf's and PCB's on idle socket, let's think it about 3 Kb. > When the kernel sees incoming traffic for the hibernating socket, it > reconstitutes the socket by reallocating socket buffers and so on, then > notifies the application that the socket is readable in the usual way via > kqueue(), select(), read(), etc. The server application recognizes a > hibernating socket by fd and reconstitutes the client state prior to > processing the inbound data. > > If the server application wants to send traffic to a hibernating socket, it > reconstitutes the client state and writes to the socket. On seeing the > write(), the kernel recognizes a hibernating socket and revivifies it from > the saved state prior to sending the traffic down thru the network stack. From where will it revivify the state? Swap? In kernel, handling of incoming traffic is in interrupt context which isn't allowed to sleep, and paging in data from disk (swapping) will require exactly this. So it's not an option. > In the best-case scenario, the kernel requires perhaps 200-ish bytes of > state memory per hibernating socket and the application may need as little > as 0-8 bytes of state memory if the fd can be used as an index into > disk-based state or a pointer array. > > What sort of memory savings can hibernating sockets offer? > > If we say that an active socket of a lean server application consumes 100kB > of kernel+user memory and a hibernating socket consumes 0.2kB of kernel+user > memory then the memory required to support 10M idle sockets reduces from > 1,000Gb to 2Gb. More realistically, if we set the number of idle sockets to, > say, 80%, then the memory reduction is from 1,000Gb to 201Gb which still > seems pretty useful. > > All that's needed to support this optimization is a hibernate(socketfd) > syscall and a revivify(socketfd) kernel function triggered by inbound > traffic and write(). Well, that and a bunch of code, but you get the idea. > > Thoughts? From userland (application) side, swapping will achieve exactly this, for free and transparently to application. If 97 Kb per-socket in application is too much, then rewrite your application to waste less memory :) nginx is an excellent example. -- WBR, @nuclight