Re: init / supervisor in jail

From: Doug Rabson <dfr_at_rabson.org>
Date: Tue, 11 Nov 2025 10:36:08 UTC
On Tue, 11 Nov 2025 at 08:10, Konstantin Belousov <kostikbel@gmail.com>
wrote:

> On Mon, Nov 10, 2025 at 11:16:01AM -0800, James Gritton wrote:
> > On 2025-11-10 04:27, Andriy Gapon wrote:
> > > I played a little bit with OCI containers and podman.
> > > I had a hiccup with one specific container created for Docker / Linux.
> > > Its difference from other containers is that it uses multiple daemons
> > > and a supervisor process to take care of them.  That particular
> > > supervisor is another variation of "advanced init", it's called s6.
> > > Apparently, it is relatively popular for container use (not sure about
> > > host systems).  Probably other alternatives can be / are used for that
> > > purpose as well.
> > >
> > > I think that this is what a supervisor in a container needs:
> > > 1. its PID is 1;
> > > 2. orphaned processes get re-parented to it.
> > >
> > > I think that (1) is not a hard requirement, but it's an easy way to
> > > check if the process would be able to work as init.
> > > Also, some other processes might expect to find init at PID 1, but I am
> > > not sure about that.
> > >
> > > (2) is important for doing the supervising (at least, when
> > > procctl(PROC_REAP*) is not used) .
> > >
> > > I think that on Linux they have separate PID namespace per container,
> so
> > > the first process to run naturally gets PID 1.
> > >
> > > I think that per-container PID namespace may be an overkill.
> > > Maybe there is a way to make PID 1 special without going that way.
> > >
> > > E.g., a jail could record the first process it runs.
> > > We can patch up getpid() to return 1 for that process.
> > > Also, we could patch up the process lookup to return the first process
> > > in the jail for PID 1.
> > >
> > > Re-parenting to the "jail init" sounds harder but should be possible as
> > > well (e.g., using PROC_REAP).
> This is why PROC_REAP was initially implemented: to allow something to
> manage zombies of all its descendants, for surrogate init processes.
> Later it appeared that at least timeout(1) benefits from it as well.
>
> A side note: machinery to reliably signal all specific descendands of
> the reaper is way too complicated.
>
> > >
> > > Not sure what to do if the "jail init" dies... should all processes in
> > > the jail get killed and the jail should die as well (unless
> persistent)?
> > >
> > > This proposal sounds like a kludge but it could be a shortcut to
> support
> > > more Linux containers and to allow similar FreeBSD jails / containers
> > > with alternative init-s / supervisors.
> >
> > Far from being a kludge, I think it's a feature we need, and one at the
> top
> > of my list.  Forcing it to look like PID 1 from jailed perspective is
> > definitely doable (and something I'd done outside of the project a decade
> > ago).  In addition to those two requirements, I would add one that
> answers
> > your last question:
> >
> > 3. signals to init and reboot(2) work as they would on the host side.
> >
> > A jailed reboot would kill all processes and restart rc, and possibly do
> > other kernel-side cleanups yet to be clearly defined.  A jailed halt
> would
> > remove the jail.  A jailed single-user mode could exist where instead of
> > init spawning a shell, it just sits around while the system has a chance
> to
> > jexec into it.
> >
> > init handles various signals by rebooting/halting/etc, and it should be
> able
> > to do that as it does now, by calling reboot(2), directing the kernel to
> do
> > what it needs to with the jail.  If init goes away, it's probably like a
> > halt and removes the jail.
>
> I completely disagree with this design, I insist that init(8) should
> stay as full system init, and reboot(2) should be kept as the machine
> reboot.
>
> For jail-contained inits, it should be a separate/dedicated implementation
> of init.  It would be aware of its usage model, in particular, it should
> proclaim itself the reaper, it should use reaper signalling facilities
> for killing processes when shutting the container down (not ever tweaking
> the reboot(2)).  It must not have the ugly protection against signals
> delivery we have for real init.
>

Almost a side note but we do have catatonit in the ports tree. This uses
PROC_REAP to clear up zombie processes and doesn't need to be PID 1. It
would be nice to have a more full-featured BSD licensed jailable init
though.