Re: init / supervisor in jail
- Reply: Konstantin Belousov : "Re: init / supervisor in jail"
- In reply to: Konstantin Belousov : "Re: init / supervisor in jail"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Wed, 12 Nov 2025 01:49:21 UTC
On 2025-11-11 16:41, Konstantin Belousov wrote: > On Tue, Nov 11, 2025 at 10:04:01AM -0800, James Gritton wrote: >> On 2025-11-11 00:10, Konstantin Belousov wrote: >> > On Mon, Nov 10, 2025 at 11:16:01AM -0800, James Gritton wrote: >> > > On 2025-11-10 04:27, Andriy Gapon wrote: >> > > > I played a little bit with OCI containers and podman. >> > > > I had a hiccup with one specific container created for Docker / Linux. >> > > > Its difference from other containers is that it uses multiple daemons >> > > > and a supervisor process to take care of them. That particular >> > > > supervisor is another variation of "advanced init", it's called s6. >> > > > Apparently, it is relatively popular for container use (not sure about >> > > > host systems). Probably other alternatives can be / are used for that >> > > > purpose as well. >> > > > >> > > > I think that this is what a supervisor in a container needs: >> > > > 1. its PID is 1; >> > > > 2. orphaned processes get re-parented to it. >> > > > >> > > > I think that (1) is not a hard requirement, but it's an easy way to >> > > > check if the process would be able to work as init. >> > > > Also, some other processes might expect to find init at PID 1, but I am >> > > > not sure about that. >> > > > >> > > > (2) is important for doing the supervising (at least, when >> > > > procctl(PROC_REAP*) is not used) . >> > > > >> > > > I think that on Linux they have separate PID namespace per container, so >> > > > the first process to run naturally gets PID 1. >> > > > >> > > > I think that per-container PID namespace may be an overkill. >> > > > Maybe there is a way to make PID 1 special without going that way. >> > > > >> > > > E.g., a jail could record the first process it runs. >> > > > We can patch up getpid() to return 1 for that process. >> > > > Also, we could patch up the process lookup to return the first process >> > > > in the jail for PID 1. >> > > > >> > > > Re-parenting to the "jail init" sounds harder but should be possible as >> > > > well (e.g., using PROC_REAP). >> > This is why PROC_REAP was initially implemented: to allow something to >> > manage zombies of all its descendants, for surrogate init processes. >> > Later it appeared that at least timeout(1) benefits from it as well. >> >> Good, that would make it that much easier to implement. It wasn't >> there when I did this in the early 2000s (I said a decade ago, but >> time passes faster than I give it credit for). >> >> > A side note: machinery to reliably signal all specific descendands of >> > the reaper is way too complicated. >> > >> > > > >> > > > Not sure what to do if the "jail init" dies... should all processes in >> > > > the jail get killed and the jail should die as well (unless persistent)? >> > > > >> > > > This proposal sounds like a kludge but it could be a shortcut to support >> > > > more Linux containers and to allow similar FreeBSD jails / containers >> > > > with alternative init-s / supervisors. >> > > >> > > Far from being a kludge, I think it's a feature we need, and one at >> > > the top >> > > of my list. Forcing it to look like PID 1 from jailed perspective is >> > > definitely doable (and something I'd done outside of the project a >> > > decade >> > > ago). In addition to those two requirements, I would add one that >> > > answers >> > > your last question: >> > > >> > > 3. signals to init and reboot(2) work as they would on the host side. >> > > >> > > A jailed reboot would kill all processes and restart rc, and >> > > possibly do >> > > other kernel-side cleanups yet to be clearly defined. A jailed halt >> > > would >> > > remove the jail. A jailed single-user mode could exist where >> > > instead of >> > > init spawning a shell, it just sits around while the system has a >> > > chance to >> > > jexec into it. >> > > >> > > init handles various signals by rebooting/halting/etc, and it should >> > > be able >> > > to do that as it does now, by calling reboot(2), directing the >> > > kernel to do >> > > what it needs to with the jail. If init goes away, it's probably >> > > like a >> > > halt and removes the jail. >> > >> > I completely disagree with this design, I insist that init(8) should >> > stay as full system init, and reboot(2) should be kept as the machine >> > reboot. >> >> Why? >> >> With the system calls hooks, init(8) was nearly 100% identical. >> There was a place or two that needed to be context-aware, which >> is very easy to add. It seems silly to re-implement init with >> just a couple of changes. >> >> reboot(2) wouldn't be the first system call to act differently for >> jailed access. Is doing a useful thing for jails worse than just >> doing nothing? I see in this the beauty of a container that moves >> that much closer to feeling like a virtual machine, while retaining >> its lightweight nature. The ideal is that jails "just work," and >> working at the syscall level is part of that. > I do not see why reboot(2), which resets the hardware, should be > overloaded > to kill a bunch of processes. In fact, I am not even sure why should > we > put this in kernel, when it can be done in userspace. > > If there some specific feature that is missing in current kernel > interfaces, > we should plug it in a minimal form. I am open to hear what is not > enough from PROC_REAP. An example is Linux jails. The goal is to be able to drop such a thing into place, and have it virtually boot up, and be virtually manageable, including shutting it down or reastarting it, using the tools such a system has. Linux emulation is already in the kernel level, for good reason, and it fits with such a purpose to keep the emulation there. I don't know if we can fully get a linux jail to work out of the box with systemd, but would be happy to see it happen. It's similar even without emulation. The less a jail is required to be different from a non-jailed system, the easier it is to use. I prefer to ask "Can I make the kernel work seamlessly with jails?" rather than "Can a user do this thing without changing the kernel?" The truth is that many things people use jails for could be done without kernel support beyond chroot, and careful permission management. Why do we even allow per-jail hostnames (a feature from the very beginning), when a jailed process could just as easily read its hostname from a file? Because it's seamless, and it just works. >> > For jail-contained inits, it should be a separate/dedicated >> > implementation >> > of init. It would be aware of its usage model, in particular, it should >> > proclaim itself the reaper, it should use reaper signalling facilities >> > for killing processes when shutting the container down (not ever >> > tweaking >> > the reboot(2)). It must not have the ugly protection against signals >> > delivery we have for real init. >> >> I haven't looked into that protection, so I'm neutral on that for >> now. It makes sense to exempt virtual init from virtual killall for >> example, but I wouldn't expect to just not deliver certain signals. >> I don't recall how I dealth with that specific issue 20 years ago. > > Yes real init (pid 1) is excempted from the group killing, and from > receiving async signals with the default disposition. I do not see > why should we allow this for non-pid 1 init. Userspace has enough > controls to catch everything it does not want to fall aside. For > init(8) it is more a hack then proper code.