Re: S4 hibernate support for FreeBSD

From: obiwac <obiwac_at_freebsd.org>
Date: Wed, 27 Aug 2025 21:33:49 UTC
Thanks for your in-depth responses!

> It's not clear to me what the kernel should do if it decides that it can't resume.

We could pass the burden on the user to select a different option at
the loader. E.g. the kernel could show a persistent error message if
failing to resume from hibernate, waiting for user input to reboot. I
don't think the least-surprising option would be to have the loader
just boot normally if the previous S4 resume failed at least.

> The linux approach of having a resume kernel is interesting, and maybe shouldn't be discounted given the kexec work that's lurking in Phabricator.

Open to trying this out, and it might be quicker to get something
working with kexec than working in the loader. Actually, doing it this
way would open the doors to having the initial kernel load a small
graphical environment to prompt for a password for decrypting the
drive/swap file, rather than having to shove this in loader. This is
what systemd-ask-password-plymouth does, though I don't know if any
Linux distros support this on resuming from hibernate. Maybe a little
heavy though vs putting in loader.

phk@, I like the idea of operating at the kernel/userland boundary,
but this would make resuming from S4 have pretty high latency right?
How would passing driver state from the previous kernel to the new one
work in practice without first initializing the driver? We couldn't
just suspend/resume in this case I guess.


On Wed, 27 Aug 2025 at 21:38, Poul-Henning Kamp <phk@phk.freebsd.dk> wrote:
>
> --------
> Warner Losh writes:
>
> > The the extent you can do it, even to the extent of heroics, you don't want
> > to destroy and recreate geom_disks.
> > […]
> > but once destroyed, the upper layers are orphaned and there's
> > no way to recreate them.
>
> In terms of "getting to S4" I agree 100%, but I dont think
> the road should end there.
>
> It was a design decision that geom treat all arriving disk as "a
> new disk", because apart from a few tour-de-force academic exercises,
> all current filesystems assume the existence of a "mount-session"
> during which they are in supreme control of the content of their
> underlying block-store, and there no useful way to determine if the
> block-store was modified while not under our control.
>
> We reasonably expect that nobody mess with our disks while in S3,
> even though much modern hardware would allow it, and again, that
> can help us "get to S4".
>
>
> However, in "real S4" filesystems need to learn to suspend, and to
> resume when geom-tasting offers up a provider which contains their
> data - even if all other aspects of that provider is different.
>
> But...
>
> If it were up to me, S4 suspend would operate at the kernel/user-land
> boundary and not the of kernel/hardware boundary.
>
> Ideally we own one side of the kernel/hardware boundary and the
> other side is well documented.
>
> In practice:  Not so much.
>
> In comparison we own 100% of both sides of the kernel/user-land
> boundary - nothing can prevent us from making it work.
>
>
> Suspend:
>
> * Send all processes SIGSUSPEND which defaults to calling a new
>   "zzz(2)" syscall.  Smart procs catch and do something sensible first.
>
> * Pause any processes that did not take the hint.
>
> * EAGAIN all userland threads in the kernel up to the syscall level.
>
> * Save all processes to storage along with their kernel state.
>
> * Save global kernel state to storage.
>
> * Tell the firmware to go ahead.
>
>
> Resume:
>
> * Boot a kernel on some hardware.
>   Usually the same kernel on the same hardware, but
>   it doesn't have to be (!)
>
> * Instead of /sbin/init execute /sbin/resume, which:
>
> * replays global kernel state
>
> * reloads the saved processes
>
> * replays their individual kernel state (open files etc.)
>
> * Mark their zzz(2) as done and hand them to the scheduler.
>   Smart processes do smart thing when zzz(2) returns.
>
> * Send the EAGAIN user threads in syscall level back down.
>
>
> The kernel state to be saved amounts to something like:
>
> Per process:
>
> * open filedescriptors, including filesystem state
> * mapped files
> * POSIX IPC and SHMEM
> * AF_UNIX sockets (& pipes)
> * Per process device driver state.
>
> Global:
>
> * mounts
> * sysctls
> * jails
> * network interface and route config
> * device driver state, as required.
>
> Poul-Henning
>
> --
> Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
> phk@FreeBSD.ORG         | TCP/IP since RFC 956
> FreeBSD committer       | BSD since 4.3-tahoe
> Never attribute to malice what can adequately be explained by incompetence.