How to implement jail-aware SysV IPC (with my nasty patch)

Mon Jun 15 18:45:36 UTC 2015

On Mon, 15 Jun 2015 12:49:16 +0200, Mateusz Guzik <mjguzik at gmail.com> wrote:
> On Mon, Jun 15, 2015 at 09:53:53AM +0000, Bjoern A. Zeeb wrote:
>> Hi,
>> 
>> removed hackers, added virtualization.
>> 
>> 
>> > On 12 Jun 2015, at 01:17 , kikuchan at uranus.dti.ne.jp wrote:
>> > 
>> > Hello,
>> > 
>> > I’m (still) trying to figure out how jail-aware SysV IPC mechanism should be.
>> 
>> The best way probably is to finally get the “common” VIMAGE framework into HEAD to allow easy virtualisation of other services.  That work has been sitting in perforce for a few years and simply needs updating for sysctls I think.
>> 
>> Then use that to virtualise things and have a vipc like we have vnets.  The good news is that you have identified most places and have the cleanup functions already so it’d be a matter of transforming your changes (assuming they are correct and working fine; haven’t actually read the patch in detail;-)  to the different infrastructure.  And that’s the easiest part.
>> 
>> 
> 
> I have not looked at vimage too closely, maybe indeed it's the right to
> go. Would definitely be interested in seeing it cleaned up and in
> action.
> 
> In the meantime, as I tried to explain in the previous thread, a
> jail-aware sysvshm poses several questions which need to be
> answered/taken care of before it can hit the tree. I doubt any
> reasonable implementation can magically avoid problems they pose and I
> definitely want to get an analysis how proposed implementation behaves
> (or how it prevents given scenario from occuring).
> 
> Fundamentally the basic question is how does the implementation cope
> with processes having sysvshm mappings obtained from 2 different jails
> (provided they use different sysvshms).
> 
> Preferably the whole business would be /prevented/. Prevention mechanism
> would have to deal with shared address spaces (rfork(2) + RFMEM),
> threads and pre-existing mappings.
> 
> The patch posted here just puts permission checks in several places,
> while leaving the namespace shared, which I find to be a user-visible
> hack with no good justification. There is also no analysis how this
> behaves when presented with aforementioned scenario. Even if it turns
> out the resut is harmless with resulting code, this leaves us with a
> very error-prone scheme.
> 
> There is no technical problem adding a pointer to struct prison and
> dereferencing it instead of current global vars. Adding proper sysctls
> dumping the content for given jail is trivial and so is providing
> resource limits when creating a first-level jail with a separate
> sysvshm. Something which cannot be as easily achieved with the patch in
> question.
> 
> Possible later switch to vimage would be transparent to users.

Dear Mateusz,

I'm sorry if I'm annoying you, but I really want to solve this problems.

> Fundamentally the basic question is how does the implementation cope
> with processes having sysvshm mappings obtained from 2 different jails
> (provided they use different sysvshms).
> 
> Preferably the whole business would be /prevented/. Prevention mechanism
> would have to deal with shared address spaces (rfork(2) + RFMEM),
> threads and pre-existing mappings.
> 
> The patch posted here just puts permission checks in several places,
> while leaving the namespace shared, which I find to be a user-visible
> hack with no good justification. There is also no analysis how this
> behaves when presented with aforementioned scenario. Even if it turns
> out the resut is harmless with resulting code, this leaves us with a
> very error-prone scheme.
> 
> There is no technical problem adding a pointer to struct prison and
> dereferencing it instead of current global vars. Adding proper sysctls
> dumping the content for given jail is trivial and so is providing
> resource limits when creating a first-level jail with a separate
> sysvshm. Something which cannot be as easily achieved with the patch in
> question.

Could you try the latest patch, please?
I justify user-visibility, make it hierarchical jail friendly, and use EINVAL instead of EACCES to conceal information leak.
https://bz-attachments.freebsd.org/attachment.cgi?id=157661 (typo fixed)

I realized my method is a bit better, when I'm trying to port/write the real namespace separation.
Let me explain (again) why I choose this method for sysv ipc, and could you tell me how it should be, please?

struct shmmap_state {
	vm_offset_t va;
	int shmid;
};

In sysv_shm.c, struct shmmap_state, exist per process as p->p_vmspace->vm_shm, is a lookup-table for va -> shm object lookup.
The shmmap_state entry holds a reference (here, shmid) to shm object for further detach, and entries are simply copied on fork.

If you split namespace (includes shmid space) completely, shmid would be no longer a unique identifier for IPC object in kernel.
To make it unique, adding a reference to prison into shmmap_state like this;

struct shmmap_state {
	vm_offset_t va;
	struct prison *prison;
	int shmid;
};

would be bad idea, because after a process calls jail_attach(), the process holds a reference to another (creator) prison, or copy the IPC object completely on every jail_attach() occurs?
How do you deal with hierarchical jail?

How about this;

struct shmmap_state {
	vm_offset_t va;
	struct shmid_kernel *shmseg;
};

looks better, but remember you split shmid space completely by moving global vars to separated namespace for each jail,
the *shmseg points inside of each jail's "struct shmid_kernel *shmsegs" list.
It would have the same problems as the previous one.

To prevent this, make a single shared list of struct shmid_kernel in kernel?
It would be the same method what I used.

Hmm, or, other smarter way exists, perhaps?
What does it look like?

I have no more idea...

My method didn't touch anything about the mapping stuff, thus it behaves exactly the same as current FreeBSD behave on this point.

I'm not sure I could understand properly what the shared address space problem is, (Could someone help me to understand, perhaps in code?)
and, I'm not sure whether the current FreeBSD has the shared address space problem for sysvshm combined with jails.
If it has the problem, unfortunately my patch doesn't provide any solution for that,
but if not, my patch doesn't have the problem either, because I didn't change code structure.

The patch just fixes key_t collision for jails, nothing more.
So, the patch is harmless for non-jail user, and I believe it's useful for jail user using allow.sysvipc=true.

BTW, What do you think about the following design for jail-aware sysvipc?

> - IPC objects created on parent jail, are invisible to children.
> - IPC objects created on neighbor jail, are also invisible each other.
> - IPC objects craeted on child jail, are VISIBLE from parent.
> - IPC key_t spaces are separated between jails. If you see the key_t named object from parent, it's shown as IPC_PRIVATE.

I want to know it because I think the implementation and the behavior design can be discussed independently.

Thank you.

Regards,
Kikuchan