Re: init / supervisor in jail

From: James Gritton <jamie_at_freebsd.org>
Date: Tue, 11 Nov 2025 18:04:01 UTC
On 2025-11-11 00:10, Konstantin Belousov wrote:
> On Mon, Nov 10, 2025 at 11:16:01AM -0800, James Gritton wrote:
>> On 2025-11-10 04:27, Andriy Gapon wrote:
>> > I played a little bit with OCI containers and podman.
>> > I had a hiccup with one specific container created for Docker / Linux.
>> > Its difference from other containers is that it uses multiple daemons
>> > and a supervisor process to take care of them.  That particular
>> > supervisor is another variation of "advanced init", it's called s6.
>> > Apparently, it is relatively popular for container use (not sure about
>> > host systems).  Probably other alternatives can be / are used for that
>> > purpose as well.
>> >
>> > I think that this is what a supervisor in a container needs:
>> > 1. its PID is 1;
>> > 2. orphaned processes get re-parented to it.
>> >
>> > I think that (1) is not a hard requirement, but it's an easy way to
>> > check if the process would be able to work as init.
>> > Also, some other processes might expect to find init at PID 1, but I am
>> > not sure about that.
>> >
>> > (2) is important for doing the supervising (at least, when
>> > procctl(PROC_REAP*) is not used) .
>> >
>> > I think that on Linux they have separate PID namespace per container, so
>> > the first process to run naturally gets PID 1.
>> >
>> > I think that per-container PID namespace may be an overkill.
>> > Maybe there is a way to make PID 1 special without going that way.
>> >
>> > E.g., a jail could record the first process it runs.
>> > We can patch up getpid() to return 1 for that process.
>> > Also, we could patch up the process lookup to return the first process
>> > in the jail for PID 1.
>> >
>> > Re-parenting to the "jail init" sounds harder but should be possible as
>> > well (e.g., using PROC_REAP).
> This is why PROC_REAP was initially implemented: to allow something to
> manage zombies of all its descendants, for surrogate init processes.
> Later it appeared that at least timeout(1) benefits from it as well.

Good, that would make it that much easier to implement.  It wasn't
there when I did this in the early 2000s (I said a decade ago, but
time passes faster than I give it credit for).

> A side note: machinery to reliably signal all specific descendands of
> the reaper is way too complicated.
> 
>> >
>> > Not sure what to do if the "jail init" dies... should all processes in
>> > the jail get killed and the jail should die as well (unless persistent)?
>> >
>> > This proposal sounds like a kludge but it could be a shortcut to support
>> > more Linux containers and to allow similar FreeBSD jails / containers
>> > with alternative init-s / supervisors.
>> 
>> Far from being a kludge, I think it's a feature we need, and one at 
>> the top
>> of my list.  Forcing it to look like PID 1 from jailed perspective is
>> definitely doable (and something I'd done outside of the project a 
>> decade
>> ago).  In addition to those two requirements, I would add one that 
>> answers
>> your last question:
>> 
>> 3. signals to init and reboot(2) work as they would on the host side.
>> 
>> A jailed reboot would kill all processes and restart rc, and possibly 
>> do
>> other kernel-side cleanups yet to be clearly defined.  A jailed halt 
>> would
>> remove the jail.  A jailed single-user mode could exist where instead 
>> of
>> init spawning a shell, it just sits around while the system has a 
>> chance to
>> jexec into it.
>> 
>> init handles various signals by rebooting/halting/etc, and it should 
>> be able
>> to do that as it does now, by calling reboot(2), directing the kernel 
>> to do
>> what it needs to with the jail.  If init goes away, it's probably like 
>> a
>> halt and removes the jail.
> 
> I completely disagree with this design, I insist that init(8) should
> stay as full system init, and reboot(2) should be kept as the machine
> reboot.

Why?

With the system calls hooks, init(8) was nearly 100% identical.
There was a place or two that needed to be context-aware, which
is very easy to add.  It seems silly to re-implement init with
just a couple of changes.

reboot(2) wouldn't be the first system call to act differently for
jailed access.  Is doing a useful thing for jails worse than just
doing nothing?  I see in this the beauty of a container that moves
that much closer to feeling like a virtual machine, while retaining
its lightweight nature.  The ideal is that jails "just work," and
working at the syscall level is part of that.

> For jail-contained inits, it should be a separate/dedicated 
> implementation
> of init.  It would be aware of its usage model, in particular, it 
> should
> proclaim itself the reaper, it should use reaper signalling facilities
> for killing processes when shutting the container down (not ever 
> tweaking
> the reboot(2)).  It must not have the ugly protection against signals
> delivery we have for real init.

I haven't looked into that protection, so I'm neutral on that for
now.  It makes sense to exempt virtual init from virtual killall for
example, but I wouldn't expect to just not deliver certain signals.
I don't recall how I dealth with that specific issue 20 years ago.

- Jamie