Re: S4 hibernate support for FreeBSD

From: Warner Losh <imp_at_bsdimp.com>
Date: Wed, 27 Aug 2025 16:02:04 UTC
On Wed, Aug 27, 2025 at 7:20 AM obiwac <obiwac@freebsd.org> wrote:

> Hi all!
>
> The FreeBSD Foundation is beginning work on adding S4 (hibernate) support
> to
> FreeBSD. Currently we have S4BIOS support, but no hibernate support on
> modern
> platforms.
>
> We have started exploring what would be required to bring S4 to FreeBSD and
> kib@ has written up some initial findings, along with some open design
> questions. We'd like to share this early document with the community to
> gather
> feedback or identify any pitfalls, and generally open the discussion around
> hibernate.
>
> You can find the document here (which anyone with the link can add comments
> to):
>
>
> https://docs.google.com/document/d/1L6b-gEUQcbRMfSuKIytMPlsZfa_q6HCZmmYtN4ysg1M
>
> At this stage, we're mostly setting our focus to something similar to the
> approach taken by OpenBSD's hibernate implementation. We're also thinking
> of
> giving a lot of the responsibility to reloading the hibernated kernel to
> loader(8), as opposed to e.g. Linux, which first boots a preliminary kernel
> which then goes on to load the hibernated kernel.
>
> But nothing is set in stone. We're mostly hoping to hear from people with
> prior
> experience in this area, so feedback/comments are welcome!
>
> Feedback received before the end of September will be easiest to
> incorporate.
> Please add comments directly into the document shared above, or respond to
> this
> email.


I added specific comments to the document.

LinuxBoot has had to solve a small subset of this problem, at least the
trampoline into the kernel. We setup things to look just like the start
entry point, and then change the memory map to be what the kernel wants.
For resume, you'd need to do something similar. loader(8) will just run on
boot, but there's some information that we'll need to have in the headers
because the loader is going to have to re-load things... It will need to do
a pass over the saved memory (so a summary map early in the dump would make
the incompatible determination faster). So the image will need to be
accessible to EFI (don't bother with BIOS boot support, the loader is full
there). This will requrie a pass over the disk partitions to read the
headers for the partition(s) we might look at. So if we dump this in 'core
dump' format that we use for panics, we'll need to tag it so that the
loader doesn't try to resume a core dump from a panic. The loader will need
to pass the EFI memory map through the trampoline. LinuxBoot does this as
another metadata thing (just like ACPI), which suggests that we'd want to
build a metadata bundle from the driver that allows us to pass many
different types of data bundles to the kernel. We moved away from bootinfo
a long time ago via metadata, and I think we'll need to do something akin
to it for resuming the kernel, and the resume code in the kernel will need
to look at that (I didn't put these details in the doc).

I'd very strongly suggest you don't reinvent resume newbus methods. They
are your friend, though they might need an augment to hint how we're
resuming (though many already assume that they have to redo everything).
The PCI bus already saves / restores the BARs because those disappear in D3
state. PCI likely will have to restore bus numbers, and/or finish jhb's
work to grow/shrink them, the bridge windows, etc. Hotplug PCIe may change
between suspend and resume, and this can affect bus numbering from the
Firmware. The pci resume code may need to grow support for this. You may
need to manage links, etc here, but that's usually fairly automatic. A lot
would depend on what the Firmware does with them.

The the extent you can do it, even to the extent of heroics, you don't want
to destroy and recreate geom_disks. The upper layers just can't cope. There
be a huge lift to make them cope, if you'd even be able to do it. The
information just isn't there enough to support departure / arrival. There's
a geom layer thing that does this, but you gotta interpose this before the
mount for it to be effective. You can defer destroying the soft state
(goem_disk, etc) for a time to allow transient failures today (umass does
this IIRC), but once destroyed, the upper layers are orphaned and there's
no way to recreate them.

Network drivers are generally in good shape, and there's good mechanisms to
reconstruct the state on a resume, though it's by no means perfect. You'll
run into the well known problems, like I suspended on network ROMEO and
resumed with only network MUST_DIE available. It's no different than
suspend to memory, though.

Audio is the same, etc. USB even is the same: it would do the same things
that it does now, but it does some things via hardware state, iirc, so that
might change.

The loader would need some changes here to reload the state, do the
trampolines, etc. I made a lot of comments on that on the doc. LinuxBoot
has done this sort of thing for a while (the loader gives the memory to the
linux kernel, and a trampoline. Linux reboots by tearing the CPU state down
to 'boot ready' and then jumps to the trampoline). You'd likely have some
lua work to add menu items for this, you'd need to expose this state to
lua, and likely give the user a chance to abort a reload, a chance to poke
around, and maybe even a chance to interact with the loader and then choose
to resume even. You'll need to pass a bundle of data from the boot loader
to the kernel. I'd suggest that we do this via the current metadata
mechanism (or something similar) because bootinfo was a bad idea in the 90s
and I'd hate to remake that mistake. There's some decisions that the kernel
can make, and some that the loader will have to make. It's not clear to me
what the kernel should do if it decides that it can't resume. It's unlikely
to be able to write anything to the resume image early enough to keep the
loader from looping. So we need to have a return from this trampoline
ability? Or does the boot loader need to make a note with the firmware env
variable that 'we tried to resume' so it doesn't loop if the kernel decides
too late to return to just reboot. Or does the kernel then read the
tea-leaves, note it can't resume, and then acts like the boot loader to
harvest the metadata the resume path sent to it, maybe augment it, and jump
back to its start routine or an alternative start vector. Given that we
write to BSS, I'm guessing this path would be a non-starter: resume the
current kernel or reboot with an indication that we have a corrupt image
(I've not looked at ACPI to see if supports that directly).

And so let's say we are resuming. And we have a watchdog enabled and the
watchdog fires, what do we do? The loader would need to grow support for
detecting that. Right now, we just disable the watchdog, so maybe this
isn't a worry. We don't set it again before jumping to the kernel.

Don't bother with boot1.efi support. I plan on removing it soon.

I'd also make it a non-goal to support resume in a chain-booted environment.

I'd also augment / include a system UUID in the resume dump. I'd refuse to,
by default, use files that have a mismatch. This allows us to suspend to
disk, then remove that disk (say the laptop is really dead) and attach it
to another machine and not have that other machine get confused if it
reboots while it's attached (which can be days if we're mining a subset of
it to preserve). The reader code likely won't care, so it could read even
from files in a filesystem (though creating them might be tricky for the
kernel, it might be helpful for testing).

Finally, I'd suggest that 'if it doesn't work now' for suspend to memory,
deferring those issues until all the other stuff that does work now is
working with suspend to disk.

I'd avoid hard-coding the decisions in the loader. I'd suggest we'd want
flexibility for a variety of scenarios that you cannot today anticipate
(not least, kernel resume bugs that cause a crash w/o writing some reboot
cause reason). I'm not sure if you'd need to squelch core dumps in the
early stages of this process, or not. During early boot we solve this
problem by dumpdev being ignored until devices are back. Since the core
dump process goes through the drivers, we may need to avoid it while
suspended. Even though we poll in this path, we assume that the device has
been initialized and don't check each write.

The linux approach of having a resume kernel is interesting, and maybe
shouldn't be discounted given the kexec work that's lurking in Phabricator.
Here, the resume kernel knows what memory it can use, runs its thing, and
then 'kexecs' back to the old kernel (this is from memory of a very old
conference presentation, I've not verified this: it might actually be like
the crash dump kernel). We also have a 'crush dump' kernel waiting in the
wings to write crash dumps in the wings, and that might help us freeze the
system, boot into a full kernel, write out the system memory and reboot.
This would avoid having to do it in a polling mode and might open up
features...

Finally, there will be issues you don't anticipate that will cause trouble.
This is typical, but I suspect even more so here.

Warner