Re: init / supervisor in jail

From: Konstantin Belousov <kostikbel_at_gmail.com>
Date: Tue, 11 Nov 2025 08:10:15 UTC
On Mon, Nov 10, 2025 at 11:16:01AM -0800, James Gritton wrote:
> On 2025-11-10 04:27, Andriy Gapon wrote:
> > I played a little bit with OCI containers and podman.
> > I had a hiccup with one specific container created for Docker / Linux.
> > Its difference from other containers is that it uses multiple daemons
> > and a supervisor process to take care of them.  That particular
> > supervisor is another variation of "advanced init", it's called s6.
> > Apparently, it is relatively popular for container use (not sure about
> > host systems).  Probably other alternatives can be / are used for that
> > purpose as well.
> > 
> > I think that this is what a supervisor in a container needs:
> > 1. its PID is 1;
> > 2. orphaned processes get re-parented to it.
> > 
> > I think that (1) is not a hard requirement, but it's an easy way to
> > check if the process would be able to work as init.
> > Also, some other processes might expect to find init at PID 1, but I am
> > not sure about that.
> > 
> > (2) is important for doing the supervising (at least, when
> > procctl(PROC_REAP*) is not used) .
> > 
> > I think that on Linux they have separate PID namespace per container, so
> > the first process to run naturally gets PID 1.
> > 
> > I think that per-container PID namespace may be an overkill.
> > Maybe there is a way to make PID 1 special without going that way.
> > 
> > E.g., a jail could record the first process it runs.
> > We can patch up getpid() to return 1 for that process.
> > Also, we could patch up the process lookup to return the first process
> > in the jail for PID 1.
> > 
> > Re-parenting to the "jail init" sounds harder but should be possible as
> > well (e.g., using PROC_REAP).
This is why PROC_REAP was initially implemented: to allow something to
manage zombies of all its descendants, for surrogate init processes.
Later it appeared that at least timeout(1) benefits from it as well.

A side note: machinery to reliably signal all specific descendands of
the reaper is way too complicated.

> > 
> > Not sure what to do if the "jail init" dies... should all processes in
> > the jail get killed and the jail should die as well (unless persistent)?
> > 
> > This proposal sounds like a kludge but it could be a shortcut to support
> > more Linux containers and to allow similar FreeBSD jails / containers
> > with alternative init-s / supervisors.
> 
> Far from being a kludge, I think it's a feature we need, and one at the top
> of my list.  Forcing it to look like PID 1 from jailed perspective is
> definitely doable (and something I'd done outside of the project a decade
> ago).  In addition to those two requirements, I would add one that answers
> your last question:
> 
> 3. signals to init and reboot(2) work as they would on the host side.
> 
> A jailed reboot would kill all processes and restart rc, and possibly do
> other kernel-side cleanups yet to be clearly defined.  A jailed halt would
> remove the jail.  A jailed single-user mode could exist where instead of
> init spawning a shell, it just sits around while the system has a chance to
> jexec into it.
> 
> init handles various signals by rebooting/halting/etc, and it should be able
> to do that as it does now, by calling reboot(2), directing the kernel to do
> what it needs to with the jail.  If init goes away, it's probably like a
> halt and removes the jail.

I completely disagree with this design, I insist that init(8) should
stay as full system init, and reboot(2) should be kept as the machine
reboot.

For jail-contained inits, it should be a separate/dedicated implementation
of init.  It would be aware of its usage model, in particular, it should
proclaim itself the reaper, it should use reaper signalling facilities
for killing processes when shutting the container down (not ever tweaking
the reboot(2)).  It must not have the ugly protection against signals
delivery we have for real init.