Re: removing support for kernel stack swapping

From: John Baldwin <jhb_at_FreeBSD.org>
Date: Tue, 04 Jun 2024 16:59:24 UTC
On 6/2/24 7:57 PM, Mark Johnston wrote:
> FreeBSD will, when free pages are scarce, try to swap out the kernel
> stacks (typically 16KB per thread) of sleeping user threads.  I'm told
> that this mechanism was first implemented in BSD for the VAX port and
> that stabilizing it was quite an endeavour.
> 
> This feature has wide-ranging implications for code in the kernel.  For
> instance, if a thread allocates a structure on its stack, links it into
> some data structure visible to other threads, and goes to sleep, it must
> use PHOLD to ensure that the stack doesn't get swapped out while
> sleeping.  A missing PHOLD can thus result in a kernel panic, but this
> kind of mistake is very easy to make and hard to catch without thorough
> stress testing.  The kernel stack allocator also requires a fair bit of
> code to implement this feature, and we've had multiple bugs in that
> area, especially in relation to NUMA support.  Moreover, this feature
> will leave threads swapped out after the system has recovered, resulting
> in high scheduling latency once they're ready to run again.
> 
> In a very stressed system, it's possible that we can free up something
> like 1MB of RAM using this mechanism.  I argue that this mechanism is
> not worth it on modern systems: it isn't going to make the difference
> between a graceful recovery from memory pressure and a catatonic state
> which forces a reboot.  The complexity and resulting bugs it induces is
> not worth it.
> 
> At the BSDCan devsummit I proposed removing support for kernel stack
> swapping and got only positive feedback.  Does anyone here have any
> comments or objections?

+1

Things like epoch and rm(9) locks follow the pattern of storing on-stack
items in linked lists FWIW.

In terms of the memory savings, I don't really think 1MB (or even a few
MB's) is really worth the complexity.

I agree that if we want to find ways to free up RAM while under memory
pressure, there are probably other caches we can prune with less
complexity.  (And in fact, just keeping the kstacks around might
lead to some of this "naturally" since we would just invoke vm_lowmem
a bit sooner to drain caches hooked up to it.)

In terms of swapping out PCB's, that would have a negative impact on
debugging (e.g. if the PCB is swapped out that means you can't look
at the kthread in question in a crash dump, or remotely over the
remote GDB connection).  Similar for if we were to swap out other
parts of the PCB like the XSAVE area on x86.  For XSAVE in particular
we should probably look at using the XSAVE compact format if we are
worried about RAM consumption.

-- 
John Baldwin