From nobody Tue Jun 04 16:59:24 2024 X-Original-To: freebsd-arch@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4Vtxdp0lp7z5M9xX for ; Tue, 04 Jun 2024 16:59:26 +0000 (UTC) (envelope-from jhb@FreeBSD.org) Received: from smtp.freebsd.org (smtp.freebsd.org [IPv6:2610:1c1:1:606c::24b:4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "smtp.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4Vtxdp06XBz4FN0; Tue, 4 Jun 2024 16:59:26 +0000 (UTC) (envelope-from jhb@FreeBSD.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1717520366; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=7vC1qj2jrL1YpYxsd1GxHRmpEQ2LOFwmBptmWX0owVg=; b=q5asJMM7YVIUS+Hs1gJg2qiJ22PSTTwzHNvPC638Jrb7tjiMyVAhE0/doHLYi32mVG9lKz q9FWnfo1RKulkA5dug0i0AwBZ8s/h5tZ1uulWSqc22r3Bsq2YXLDut4UvbWbLcudZjoxri 746u9Lu/gVH8WDaWEZYTdoZn9ldxut/ISj931zIqmJB2Q0bdWCTOcE2nSc0SH+RkKUBdet TrGuRB7kD3fita3u6Xq620Kj37TLIuCN0+IPK0uf1Po7xV5+fh92+CAcTSmh6j5tRKu9Gb D+T8TTtoN8vv2BFCQEEx+jB0iS4qA0tbdBzE87eIt8mO0Mz+rd5ZEqGdJYXprg== ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1717520366; a=rsa-sha256; cv=none; b=YnjRyr/GoCsNwAr2YS5LCHAAYDzFBA7GYiNw7XmQQ9CB5tdudItuHxO09d6bBC+AGgLJVY xx24U2dDedGk0cSM5N01HXvizps3/DNOjuaaFXvZGDwfQzO8T/OKyDYnDV3L9NyvR8bhdD gDZgDCQmGdgA0vBmPE8dZhHHQC3UYnWd3QL+kBZyBWHH/uUaPzD/24tgNkyxNTy1JiBZzV avKhUHtbMcTT1qVuivO3yGh8nzGQu/jG+IrLvRGH9mw6PAJSXxrSyJttjzX2Ch35ZGAzWT I5Db6DeRQbrm+SNhfXV8NvfH5oHZtCwpvS/DG+ilDZOY6Dsx2jMB+kjvQfmzYg== ARC-Authentication-Results: i=1; mx1.freebsd.org; none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1717520366; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=7vC1qj2jrL1YpYxsd1GxHRmpEQ2LOFwmBptmWX0owVg=; b=w5YLrPxTFlUPnLi4WIWrQl/bWnYNwh+oN+djEgPuMRP33j9cmbLXB3YUDluEXl8Nllyt3j HN3LElngdNupjWALLsToPYK0MbvKzM2930G366fKj4Nyxi4QHFisxHS8XoszSwKD9smlH5 sfxiMXJZn4cXiYG9lPnTKdipLOAcYERYzSAkie38MX9W/WskceAvFMKR66zxjyPEuCpztK 01PXbUHXx+nqVqbD2Bs7VEgnwX4WDBjcmKHTc3tPdSuvNwkQZ7GcHR+vuKVUwNCSsI6RPW qUR/UTVkSEryLrTtxr3h+4W9lObncCA4a8fmjPGE35RWj/kzRC8lARuLlNWtrA== Received: from [IPV6:2601:644:937f:4c50:e0fb:ab3b:bf07:f04] (unknown [IPv6:2601:644:937f:4c50:e0fb:ab3b:bf07:f04]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) (Authenticated sender: jhb) by smtp.freebsd.org (Postfix) with ESMTPSA id 4Vtxdn4nddzFRm; Tue, 4 Jun 2024 16:59:25 +0000 (UTC) (envelope-from jhb@FreeBSD.org) Message-ID: <6ddedba5-fc2f-4caa-aab5-bd29ca4fdf0b@FreeBSD.org> Date: Tue, 4 Jun 2024 09:59:24 -0700 List-Id: Discussion related to FreeBSD architecture List-Archive: https://lists.freebsd.org/archives/freebsd-arch List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-arch@FreeBSD.org MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: removing support for kernel stack swapping Content-Language: en-US To: Mark Johnston , freebsd-arch@freebsd.org References: From: John Baldwin In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit On 6/2/24 7:57 PM, Mark Johnston wrote: > FreeBSD will, when free pages are scarce, try to swap out the kernel > stacks (typically 16KB per thread) of sleeping user threads. I'm told > that this mechanism was first implemented in BSD for the VAX port and > that stabilizing it was quite an endeavour. > > This feature has wide-ranging implications for code in the kernel. For > instance, if a thread allocates a structure on its stack, links it into > some data structure visible to other threads, and goes to sleep, it must > use PHOLD to ensure that the stack doesn't get swapped out while > sleeping. A missing PHOLD can thus result in a kernel panic, but this > kind of mistake is very easy to make and hard to catch without thorough > stress testing. The kernel stack allocator also requires a fair bit of > code to implement this feature, and we've had multiple bugs in that > area, especially in relation to NUMA support. Moreover, this feature > will leave threads swapped out after the system has recovered, resulting > in high scheduling latency once they're ready to run again. > > In a very stressed system, it's possible that we can free up something > like 1MB of RAM using this mechanism. I argue that this mechanism is > not worth it on modern systems: it isn't going to make the difference > between a graceful recovery from memory pressure and a catatonic state > which forces a reboot. The complexity and resulting bugs it induces is > not worth it. > > At the BSDCan devsummit I proposed removing support for kernel stack > swapping and got only positive feedback. Does anyone here have any > comments or objections? +1 Things like epoch and rm(9) locks follow the pattern of storing on-stack items in linked lists FWIW. In terms of the memory savings, I don't really think 1MB (or even a few MB's) is really worth the complexity. I agree that if we want to find ways to free up RAM while under memory pressure, there are probably other caches we can prune with less complexity. (And in fact, just keeping the kstacks around might lead to some of this "naturally" since we would just invoke vm_lowmem a bit sooner to drain caches hooked up to it.) In terms of swapping out PCB's, that would have a negative impact on debugging (e.g. if the PCB is swapped out that means you can't look at the kthread in question in a crash dump, or remotely over the remote GDB connection). Similar for if we were to swap out other parts of the PCB like the XSAVE area on x86. For XSAVE in particular we should probably look at using the XSAVE compact format if we are worried about RAM consumption. -- John Baldwin