[patch] separate SysV IPC namespace for jail

Sun Jun 7 01:39:38 UTC 2015

On Sun, Jun 07, 2015 at 12:04:17AM +0900, kikuchan wrote:
> Sorry for cross-post to freebsd-stable, but I want to get more
> feedback for my patch.
> (The patch is; http://lists.freebsd.org/pipermail/freebsd-jail/attachments/20150606/7736309b/attachment.bin)
> 
> 
> I believe this patch FIXES current SysV IPC for jail WITHOUT changing
> current kernel architecture.
> (so I hope it will be merged into stable/10)
> 
> Let me explain what happens currently, with and without my patch,
> since it's little confusing.
> 
> 
> I use SysV IPC shared memory (SYSVSHM) as an example here, because
> it's easy to understand.
> Remember shmget / shmat / shmdt / shmctl, are syscalls of SYSVSHM.
> 
> All normal processes have its own virtual memory space, it is done by kernel.
> A backend component of virtual memory is a page, is on real memory or
> on swap devices.
> 
> SYSVSHM provides a way to share memory segments on the page between
> processes on userland.
> A process can load the page into its own virtual memory space with
> shmat syscall.
> Once the page is loaded into the virtual memory space, the page is
> accessible until further shmdt syscall or exit of process.
> 
> Another process can obtain the exact same page, by calling shmat syscall.
> So, permission of shmat syscall is very important.
> 
> 
> > Address space can be shared between multiple jails
> 

This was a typo. Let me quote fixed version:

"Address space can be shared between multiple PROCESSES, what happens if
such a pair ends up in different jails? Preferably such a scenario would
be prohibited to avoid future accidents."

However, sysvipc namespace sharing is an ok feature esp. with
multi-level jails. In the simplest scenario upon jail creation you
decide whether it gets its own namespace or inherits it.

> > What about existing sysvshm mappings when jailing?
> 
> Real (not jailed) environment is treated as a jail with jid=0 in kernel.
> If you create sysvshm memory segment before entering a jail, the
> segment simply owned by jid=0.
> 

The point is you get a process with sysvshm segments from 2 different
jails. Looks like solid trouble protential.

> 
> > Extending struct prison with relevant pointers and updating the code to
> 
> You don't need to extend the struct to separate IPC namespaces.
> The word "namespaces" means a key (key_t) of IPC syscall, here.
> 
> Whether the struct should be extended or not, depends on how we want
> to control IPC resources for each jail.
> If you want to control SysV IPC resources by changing sysctl
> parameters from inside of jail for each jail,
> then it might be yes.
> But I think per-jail resource control should be done with RACCT, and
> it might be applied to my implementation too.
> 
> 
> The one missing feature is how to export information to userland.
> This should be discuss separately, even if my patch is rejected.
> (If visibility control is needed for ipcs, maybe it should use similar
> technique to ps or netstat?)
> 
> 
> Conclusion;
> I think my patch is better than broken. (SysV IPC + jail is buggy over
> 10 years!)
> 

The feature in question is definitely desirable, but your patch is hack,
with the "hack" part visible to userspace.

As mentioned earlier there are some things to do before any kind of
jail-aware ipcs land in the tree. As a minimum this is singlethreading
when jailing, prevention of jailing processes with shared virtual address
spaces and ones with existing sysvshm mappings. All this is to reduce
amount of bugs one would have to deal with. 

After the work is completed there is no problem whatsoever with
providing per-jail sysvipcs. This avoids information leaks (no id list
to look at) and conflicts.

Exporting is not a problem either - a dedicated sysctl grabs JID and
dumps its ipcs. It also gets a 'recursive' flag to know whether ipcs
for its own jails should be dumped as well (if different).

-- 
Mateusz Guzik <mjguzik gmail.com>