Re: init / supervisor in jail
- Reply: Konstantin Belousov : "Re: init / supervisor in jail"
- In reply to: Konstantin Belousov : "Re: init / supervisor in jail"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Tue, 11 Nov 2025 18:04:01 UTC
On 2025-11-11 00:10, Konstantin Belousov wrote: > On Mon, Nov 10, 2025 at 11:16:01AM -0800, James Gritton wrote: >> On 2025-11-10 04:27, Andriy Gapon wrote: >> > I played a little bit with OCI containers and podman. >> > I had a hiccup with one specific container created for Docker / Linux. >> > Its difference from other containers is that it uses multiple daemons >> > and a supervisor process to take care of them. That particular >> > supervisor is another variation of "advanced init", it's called s6. >> > Apparently, it is relatively popular for container use (not sure about >> > host systems). Probably other alternatives can be / are used for that >> > purpose as well. >> > >> > I think that this is what a supervisor in a container needs: >> > 1. its PID is 1; >> > 2. orphaned processes get re-parented to it. >> > >> > I think that (1) is not a hard requirement, but it's an easy way to >> > check if the process would be able to work as init. >> > Also, some other processes might expect to find init at PID 1, but I am >> > not sure about that. >> > >> > (2) is important for doing the supervising (at least, when >> > procctl(PROC_REAP*) is not used) . >> > >> > I think that on Linux they have separate PID namespace per container, so >> > the first process to run naturally gets PID 1. >> > >> > I think that per-container PID namespace may be an overkill. >> > Maybe there is a way to make PID 1 special without going that way. >> > >> > E.g., a jail could record the first process it runs. >> > We can patch up getpid() to return 1 for that process. >> > Also, we could patch up the process lookup to return the first process >> > in the jail for PID 1. >> > >> > Re-parenting to the "jail init" sounds harder but should be possible as >> > well (e.g., using PROC_REAP). > This is why PROC_REAP was initially implemented: to allow something to > manage zombies of all its descendants, for surrogate init processes. > Later it appeared that at least timeout(1) benefits from it as well. Good, that would make it that much easier to implement. It wasn't there when I did this in the early 2000s (I said a decade ago, but time passes faster than I give it credit for). > A side note: machinery to reliably signal all specific descendands of > the reaper is way too complicated. > >> > >> > Not sure what to do if the "jail init" dies... should all processes in >> > the jail get killed and the jail should die as well (unless persistent)? >> > >> > This proposal sounds like a kludge but it could be a shortcut to support >> > more Linux containers and to allow similar FreeBSD jails / containers >> > with alternative init-s / supervisors. >> >> Far from being a kludge, I think it's a feature we need, and one at >> the top >> of my list. Forcing it to look like PID 1 from jailed perspective is >> definitely doable (and something I'd done outside of the project a >> decade >> ago). In addition to those two requirements, I would add one that >> answers >> your last question: >> >> 3. signals to init and reboot(2) work as they would on the host side. >> >> A jailed reboot would kill all processes and restart rc, and possibly >> do >> other kernel-side cleanups yet to be clearly defined. A jailed halt >> would >> remove the jail. A jailed single-user mode could exist where instead >> of >> init spawning a shell, it just sits around while the system has a >> chance to >> jexec into it. >> >> init handles various signals by rebooting/halting/etc, and it should >> be able >> to do that as it does now, by calling reboot(2), directing the kernel >> to do >> what it needs to with the jail. If init goes away, it's probably like >> a >> halt and removes the jail. > > I completely disagree with this design, I insist that init(8) should > stay as full system init, and reboot(2) should be kept as the machine > reboot. Why? With the system calls hooks, init(8) was nearly 100% identical. There was a place or two that needed to be context-aware, which is very easy to add. It seems silly to re-implement init with just a couple of changes. reboot(2) wouldn't be the first system call to act differently for jailed access. Is doing a useful thing for jails worse than just doing nothing? I see in this the beauty of a container that moves that much closer to feeling like a virtual machine, while retaining its lightweight nature. The ideal is that jails "just work," and working at the syscall level is part of that. > For jail-contained inits, it should be a separate/dedicated > implementation > of init. It would be aware of its usage model, in particular, it > should > proclaim itself the reaper, it should use reaper signalling facilities > for killing processes when shutting the container down (not ever > tweaking > the reboot(2)). It must not have the ugly protection against signals > delivery we have for real init. I haven't looked into that protection, so I'm neutral on that for now. It makes sense to exempt virtual init from virtual killall for example, but I wouldn't expect to just not deliver certain signals. I don't recall how I dealth with that specific issue 20 years ago. - Jamie