Re: S4 hibernate support for FreeBSD

From: Poul-Henning Kamp <phk_at_phk.freebsd.dk>
Date: Wed, 27 Aug 2025 19:38:33 UTC
--------
Warner Losh writes:

> The the extent you can do it, even to the extent of heroics, you don't want
> to destroy and recreate geom_disks.
> […]
> but once destroyed, the upper layers are orphaned and there's
> no way to recreate them.

In terms of "getting to S4" I agree 100%, but I dont think
the road should end there.

It was a design decision that geom treat all arriving disk as "a
new disk", because apart from a few tour-de-force academic exercises,
all current filesystems assume the existence of a "mount-session"
during which they are in supreme control of the content of their
underlying block-store, and there no useful way to determine if the
block-store was modified while not under our control.

We reasonably expect that nobody mess with our disks while in S3,
even though much modern hardware would allow it, and again, that
can help us "get to S4".


However, in "real S4" filesystems need to learn to suspend, and to
resume when geom-tasting offers up a provider which contains their
data - even if all other aspects of that provider is different.

But...

If it were up to me, S4 suspend would operate at the kernel/user-land
boundary and not the of kernel/hardware boundary.

Ideally we own one side of the kernel/hardware boundary and the
other side is well documented.

In practice:  Not so much.

In comparison we own 100% of both sides of the kernel/user-land
boundary - nothing can prevent us from making it work.


Suspend:

* Send all processes SIGSUSPEND which defaults to calling a new
  "zzz(2)" syscall.  Smart procs catch and do something sensible first.

* Pause any processes that did not take the hint.

* EAGAIN all userland threads in the kernel up to the syscall level.

* Save all processes to storage along with their kernel state.

* Save global kernel state to storage.

* Tell the firmware to go ahead.


Resume:

* Boot a kernel on some hardware.
  Usually the same kernel on the same hardware, but
  it doesn't have to be (!)

* Instead of /sbin/init execute /sbin/resume, which:

* replays global kernel state

* reloads the saved processes

* replays their individual kernel state (open files etc.)

* Mark their zzz(2) as done and hand them to the scheduler.
  Smart processes do smart thing when zzz(2) returns.

* Send the EAGAIN user threads in syscall level back down.


The kernel state to be saved amounts to something like:

Per process:

* open filedescriptors, including filesystem state
* mapped files
* POSIX IPC and SHMEM
* AF_UNIX sockets (& pipes)
* Per process device driver state.

Global:

* mounts
* sysctls
* jails
* network interface and route config
* device driver state, as required.

Poul-Henning

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.