From nobody Thu Aug 28 00:59:21 2025 X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4cC32R6yCkz65xwC for ; Thu, 28 Aug 2025 00:59:27 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-pl1-x631.google.com (mail-pl1-x631.google.com [IPv6:2607:f8b0:4864:20::631]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "WR4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4cC32R4xgjz3VTG for ; Thu, 28 Aug 2025 00:59:27 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Authentication-Results: mx1.freebsd.org; none Received: by mail-pl1-x631.google.com with SMTP id d9443c01a7336-24633f57e0bso2772665ad.0 for ; Wed, 27 Aug 2025 17:59:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20230601.gappssmtp.com; s=20230601; t=1756342765; x=1756947565; darn=freebsd.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=k2LNQW4vRALEK9PhwCjMADbZ8tm+9DoX895syshiB18=; b=crIf+kKDPPdTlh/vWOmWrA6w1pYEMZiep70iA0UtMvABpPH/kL/gXREc02NX/MicKa q19afodtyFS/IJxrMstOvRD6nnYb5XNFbsiDElIO4kk8AFD3d2CHO3Sau5VqxJF9HUz0 Kpx5GkHzFBP6OLyQE8fFJf+ZHdYKOdkBv1FOtA8NHXzgJTvu+F6Hw0jwKeaww60J0V9A gc2m0tL5bYIPmehIEEt1xafA+cJ9AKcMX/hE33ob30QFOIv4PpnW42cur9CP+MF5RCDt ym9249aNju8q0NCnMFEYP+5brrsV4syutSRyOURQpSdPG7ZEZvI7/fS+0LXRGSWI6sNT 0ABw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1756342765; x=1756947565; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=k2LNQW4vRALEK9PhwCjMADbZ8tm+9DoX895syshiB18=; b=lXMTdY1fPXRpqHCWQzaIqKf0PFakh+50VwNpaNSMrL3kFbDy4ZJzxzY7q6qqpTJjki O+jh9J96vBT6mW+2pv5lvwOBtAJjBdA2qUH3xydyylKitH7b3lPK8/kIInC3atMhKW0y renQ0zsee2bNHROzV7dwaUX5YPU8971pP7NpYJ8i1rv9T4Nj5Aogn5a7cbcY4CZhcimY YRbWwu8TumA5VmvW2prEx0urYuj2+livpDb9hRvzRZB5L90RdqPQ7CP2/YWzH3B3v46R pvBBDvfSZ0bqaWPa+b3ysCdIIMxpcQMUzoVq1Z16V8x2m8uYB/nP5BG1xvE+hkJkOWJv ITVg== X-Forwarded-Encrypted: i=1; AJvYcCUyBPZrLIxcb0/ZYYurqWSy55Ds8WNmzyioxFxer3U+7VpOA3Mu2T3PoPY5LToTRfdx3l3hFIsYG/E9+Y4pjzQ=@freebsd.org X-Gm-Message-State: AOJu0YzzG4XcNVDioL5vEc71Y/5V+CK3iAzzQsPOSm1nY7Y7bdGmtjxc TnEw1j7mZ4+WuQEh8JjHS/MBMNlfMfd3U+PJAd7XP8YMFE7TUXISdD1EcA8EJWT9/K142UnI2CG knhgtlD9CqZjRK90yEm8JSKNeNlbNYUO1puNCs7usZg== X-Gm-Gg: ASbGncuYFivuKX1awtiKxQDh+CftEQV8AiXrmVfThMTdDRXNXGHjRuYPl9b1ZhHDFEJ dVYiWzQ8yFiodH5so7aV1X6Lq9mtGzvKhxoLg1Fj2gY3UZlvzUb7ArXB0x9pSCOfcqFmqB5LPbq p3y4OEgk/yXB9WzCJBgh8cHrylsB8MGjB12C1mmMiDzRq/klgo1bT73yJbXmPWZHKhpOv1U+DbN 0/Iuj1iWkZnN9mDqmzC6JUGEuUDhf7Icka60uc2MhNCvKTP5g== X-Google-Smtp-Source: AGHT+IGqvKA34zzqbUV+nygFXA35oyiaEzMjRjIpf6IepBkeO7SxR9zyA8yeXSKPKqA3l9PlmucP9tXKJUMWMKGjHuo= X-Received: by 2002:a17:903:38d0:b0:246:2e9:daaa with SMTP id d9443c01a7336-2462edd744bmr319157775ad.2.1756342765541; Wed, 27 Aug 2025 17:59:25 -0700 (PDT) List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@FreeBSD.org MIME-Version: 1.0 References: <202508271938.57RJcX0n009344@critter.freebsd.dk> In-Reply-To: From: Warner Losh Date: Wed, 27 Aug 2025 18:59:21 -0600 X-Gm-Features: Ac12FXwntYI1cTgWokmqjxHD_Rs9I3k1ITsO9lZQe-VDV8ShFlsfjFUVflzegZ0 Message-ID: Subject: Re: S4 hibernate support for FreeBSD To: obiwac Cc: Poul-Henning Kamp , freebsd-arch@freebsd.org, freebsd-current@freebsd.org Content-Type: multipart/alternative; boundary="000000000000a4ea84063d626a4f" X-Spamd-Bar: ---- X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US] X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Rspamd-Queue-Id: 4cC32R4xgjz3VTG --000000000000a4ea84063d626a4f Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Wed, Aug 27, 2025 at 3:34=E2=80=AFPM obiwac wrote: > Thanks for your in-depth responses! > > > It's not clear to me what the kernel should do if it decides that it > can't resume. > > We could pass the burden on the user to select a different option at > the loader. E.g. the kernel could show a persistent error message if > failing to resume from hibernate, waiting for user input to reboot. I > don't think the least-surprising option would be to have the loader > just boot normally if the previous S4 resume failed at least. > The trouble here is that the ability to persist data in the loader is limited. Waiting for the user to reboot would be good, maybe, assuming that the video device is good. And on a lot of laptops, you might not have enough of the kernel going to survive the handoff from the loader to the kernel. All other systems only try to do the resume once, at least that I've used. > > The linux approach of having a resume kernel is interesting, and maybe > shouldn't be discounted given the kexec work that's lurking in Phabricato= r. > > Open to trying this out, and it might be quicker to get something > working with kexec than working in the loader. Actually, doing it this > way would open the doors to having the initial kernel load a small > graphical environment to prompt for a password for decrypting the > drive/swap file, rather than having to shove this in loader. This is > what systemd-ask-password-plymouth does, though I don't know if any > Linux distros support this on resuming from hibernate. Maybe a little > heavy though vs putting in loader. > Yea. There's some measure of functionality that would still need to be in the mini-kernel, since the loader would have to know where it could or couldn't put things. It also occurs to me that unlike a kernel crash dump, you could easily invalidate all the data that's on a stable backing store, and it will fault in on first touch. You'd still have to save all the kernel, all its state but by and large, you don't need to save the user space of the running programs at all, since it will all either be in stable files (libc, etc) or in swap space (which might pose problems of allocation if you are writing the kernel resume dump to there as well... Another challenge that occurred to me is 'how do I know I can resume' while the resume functions are mostly synchronous, you have the same timing problem as 'mount root' for all the mount points, it seems. Though it may be enough to just stall until the device reconfigures all I/O requests (or some longish timeout happens). phk@, I like the idea of operating at the kernel/userland boundary, > but this would make resuming from S4 have pretty high latency right? > How would passing driver state from the previous kernel to the new one > work in practice without first initializing the driver? We couldn't > just suspend/resume in this case I guess. > It's an interesting notion, but I think it would need to be a carefully thought out refinement, rather than the first stop. But I'm sure I'm missing a lot. Warner > On Wed, 27 Aug 2025 at 21:38, Poul-Henning Kamp > phk@phk.freebsd.dk> wrote: > > > > -------- > > Warner Losh writes: > > > > > The the extent you can do it, even to the extent of heroics, you don'= t > want > > > to destroy and recreate geom_disks. > > > [=E2=80=A6] > > > but once destroyed, the upper layers are orphaned and there's > > > no way to recreate them. > > > > In terms of "getting to S4" I agree 100%, but I dont think > > the road should end there. > > > > It was a design decision that geom treat all arriving disk as "a > > new disk", because apart from a few tour-de-force academic exercises, > > all current filesystems assume the existence of a "mount-session" > > during which they are in supreme control of the content of their > > underlying block-store, and there no useful way to determine if the > > block-store was modified while not under our control. > > > > We reasonably expect that nobody mess with our disks while in S3, > > even though much modern hardware would allow it, and again, that > > can help us "get to S4". > > > > > > However, in "real S4" filesystems need to learn to suspend, and to > > resume when geom-tasting offers up a provider which contains their > > data - even if all other aspects of that provider is different. > > > > But... > > > > If it were up to me, S4 suspend would operate at the kernel/user-land > > boundary and not the of kernel/hardware boundary. > > > > Ideally we own one side of the kernel/hardware boundary and the > > other side is well documented. > > > > In practice: Not so much. > > > > In comparison we own 100% of both sides of the kernel/user-land > > boundary - nothing can prevent us from making it work. > > > > > > Suspend: > > > > * Send all processes SIGSUSPEND which defaults to calling a new > > "zzz(2)" syscall. Smart procs catch and do something sensible first. > > > > * Pause any processes that did not take the hint. > > > > * EAGAIN all userland threads in the kernel up to the syscall level. > > > > * Save all processes to storage along with their kernel state. > > > > * Save global kernel state to storage. > > > > * Tell the firmware to go ahead. > > > > > > Resume: > > > > * Boot a kernel on some hardware. > > Usually the same kernel on the same hardware, but > > it doesn't have to be (!) > > > > * Instead of /sbin/init execute /sbin/resume, which: > > > > * replays global kernel state > > > > * reloads the saved processes > > > > * replays their individual kernel state (open files etc.) > > > > * Mark their zzz(2) as done and hand them to the scheduler. > > Smart processes do smart thing when zzz(2) returns. > > > > * Send the EAGAIN user threads in syscall level back down. > > > > > > The kernel state to be saved amounts to something like: > > > > Per process: > > > > * open filedescriptors, including filesystem state > > * mapped files > > * POSIX IPC and SHMEM > > * AF_UNIX sockets (& pipes) > > * Per process device driver state. > > > > Global: > > > > * mounts > > * sysctls > > * jails > > * network interface and route config > > * device driver state, as required. > > > > Poul-Henning > > > > -- > > Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 > > phk@FreeBSD.ORG | TCP/IP since RFC 956 > > FreeBSD committer | BSD since 4.3-tahoe > > Never attribute to malice what can adequately be explained by > incompetence. > --000000000000a4ea84063d626a4f Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


On Wed, Aug 27,= 2025 at 3:34=E2=80=AFPM obiwac <o= biwac@freebsd.org> wrote:
Thanks for your in-depth responses!

> It's not clear to me what the kernel should do if it decides that = it can't resume.

We could pass the burden on the user to select a different option at
the loader. E.g. the kernel could show a persistent error message if
failing to resume from hibernate, waiting for user input to reboot. I
don't think the least-surprising option would be to have the loader
just boot normally if the previous S4 resume failed at least.

The trouble here is that the ability to persist data= in the loader is limited. Waiting for the user to reboot would be good, ma= ybe, assuming that the video device is good. And on a lot of laptops, you m= ight not have enough of the kernel going to survive the handoff from the lo= ader to the kernel. All other systems only try to do the resume once, at le= ast that I've used.
=C2=A0
> The linux approach of having a resume kernel is interesting, and maybe= shouldn't be discounted given the kexec work that's lurking in Pha= bricator.

Open to trying this out, and it might be quicker to get something
working with kexec than working in the loader. Actually, doing it this
way would open the doors to having the initial kernel load a small
graphical environment to prompt for a password for decrypting the
drive/swap file, rather than having to shove this in loader. This is
what systemd-ask-password-plymouth does, though I don't know if any
Linux distros support this on resuming from hibernate. Maybe a little
heavy though vs putting in loader.

Yea.= There's some measure of functionality that would still need to be in t= he mini-kernel, since the loader would have to know where it could or could= n't put things.

It also occurs to me that unli= ke a kernel crash dump, you could easily invalidate all the data that's= on a stable backing store, and it will fault in on first touch. You'd = still have to save all the kernel, all its state but by and large, you don&= #39;t need to save the user space of the running programs at all, since it = will all either be in stable files (libc, etc) or in swap space (which migh= t pose problems of allocation if you are writing the kernel resume dump to = there as well...

Another challenge that occurred t= o me is 'how do I know I can resume' while the resume functions are= mostly synchronous, you have the same timing problem as 'mount root= 9; for all the mount points, it seems. Though it may be enough to just stal= l until the device reconfigures all I/O requests (or some longish timeout h= appens).


phk@, I like the idea of operating at the kernel/userland boundary,
but this would make resuming from S4 have pretty high latency right?
How would passing driver state from the previous kernel to the new one
work in practice without first initializing the driver? We couldn't
just suspend/resume in this case I guess.

It's an interesting notion, but I think it would need to be a carefu= lly thought out refinement, rather than the first stop. But I'm sure I&= #39;m missing a lot.

Warner
=C2=A0
=
On Wed, 27 Aug 2025 at 21:38, Poul-Henning Kamp


=C2=A0
phk@phk.fre= ebsd.dk> wrote:
>
> --------
> Warner Losh writes:
>
> > The the extent you can do it, even to the extent of heroics, you = don't want
> > to destroy and recreate geom_disks.
> > [=E2=80=A6]
> > but once destroyed, the upper layers are orphaned and there's=
> > no way to recreate them.
>
> In terms of "getting to S4" I agree 100%, but I dont think > the road should end there.
>
> It was a design decision that geom treat all arriving disk as "a<= br> > new disk", because apart from a few tour-de-force academic exerci= ses,
> all current filesystems assume the existence of a "mount-session&= quot;
> during which they are in supreme control of the content of their
> underlying block-store, and there no useful way to determine if the > block-store was modified while not under our control.
>
> We reasonably expect that nobody mess with our disks while in S3,
> even though much modern hardware would allow it, and again, that
> can help us "get to S4".
>
>
> However, in "real S4" filesystems need to learn to suspend, = and to
> resume when geom-tasting offers up a provider which contains their
> data - even if all other aspects of that provider is different.
>
> But...
>
> If it were up to me, S4 suspend would operate at the kernel/user-land<= br> > boundary and not the of kernel/hardware boundary.
>
> Ideally we own one side of the kernel/hardware boundary and the
> other side is well documented.
>
> In practice:=C2=A0 Not so much.
>
> In comparison we own 100% of both sides of the kernel/user-land
> boundary - nothing can prevent us from making it work.
>
>
> Suspend:
>
> * Send all processes SIGSUSPEND which defaults to calling a new
>=C2=A0 =C2=A0"zzz(2)" syscall.=C2=A0 Smart procs catch and do= something sensible first.
>
> * Pause any processes that did not take the hint.
>
> * EAGAIN all userland threads in the kernel up to the syscall level. >
> * Save all processes to storage along with their kernel state.
>
> * Save global kernel state to storage.
>
> * Tell the firmware to go ahead.
>
>
> Resume:
>
> * Boot a kernel on some hardware.
>=C2=A0 =C2=A0Usually the same kernel on the same hardware, but
>=C2=A0 =C2=A0it doesn't have to be (!)
>
> * Instead of /sbin/init execute /sbin/resume, which:
>
> * replays global kernel state
>
> * reloads the saved processes
>
> * replays their individual kernel state (open files etc.)
>
> * Mark their zzz(2) as done and hand them to the scheduler.
>=C2=A0 =C2=A0Smart processes do smart thing when zzz(2) returns.
>
> * Send the EAGAIN user threads in syscall level back down.
>
>
> The kernel state to be saved amounts to something like:
>
> Per process:
>
> * open filedescriptors, including filesystem state
> * mapped files
> * POSIX IPC and SHMEM
> * AF_UNIX sockets (& pipes)
> * Per process device driver state.
>
> Global:
>
> * mounts
> * sysctls
> * jails
> * network interface and route config
> * device driver state, as required.
>
> Poul-Henning
>
> --
> Poul-Henning Kamp=C2=A0 =C2=A0 =C2=A0 =C2=A0| UNIX since Zilog Zeus 3.= 20
> phk@FreeBSD.ORG=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0| TCP/IP since RFC 95= 6
> FreeBSD committer=C2=A0 =C2=A0 =C2=A0 =C2=A0| BSD since 4.3-tahoe
> Never attribute to malice what can adequately be explained by incompet= ence.
--000000000000a4ea84063d626a4f--