Re: curtain: WIP sandboxing mechanism with pledge()/unveil() support

From: Mathieu <>
Date: Wed, 30 Mar 2022 17:52:09 UTC
On 3/30/22 03:22, Baptiste Daroussin wrote:
> Hello Mathieu,
> First of all, thank you for this amazing work, leveraging the mac framework to
> build curtain is imho an excellent idea, I personnally see a curtain like
> approach as complementary to a capsicum approach rather than an antagonist
> feature, I can see many possible usage of curtains in freebsd in particular in
> the port framework!


One nice thing with the MAC approach is that many of the checks were 
already carefully placed to not deviate too much from expected behavior 
for applications.  At first I tried to do all of the checks in namei() 
and this often made syscalls fail too early with the wrong errnos and 
this confuses some programs.  When I figured out the right place to add 
the checks it was almost always right next to a MAC check.

Also, if I hadn't used a MAC module and added a new layer instead, you 
could say that it would have been the fifth (!!!) general access control 
mechanism in the kernel (after traditional UNIX, Capsicum, jails and 
MAC).  This was starting to feel like a bit much.  So the MAC framework, 
it's not a perfect fit for some of what this module wants to do but it 
could be worse.

Also, if you combine curtains with chroots you kind of get jails. It's 
chroot escape-proof because unveils deny access to the outer directories 
like any other.  Privileges are restricted.  It doesn't have all the 
features of jails but it's very jail-like. Just to say, the MAC 
framework was pretty close to being able to implement jails too.

> To allow to integrate and permit reviews from developers, I think we can/should
> split the review. The first thing will probably me imho to start a review
> process of the sysfilt feature, this is probably the part that will need most of
> the back and forth discussion given the rest is pretty isolated (mac module,
> userland).

Yeah I think it's a good idea.  I could do some minor cleanups and add 
some comments while at it before submitting it.

I'm still working on this, but the majority of the changes should be in 
the userland and the module itself now. And getting some feedback on the 
kernel parts (and eventually knowing that they're in an acceptable 
state) would help.  The module works well enough already that I don't 
think that there's anything critical that would require extra changes to 
the rest of the kernel (unless some problem is found).

I think the kernel changes could be broken up something like this:

1) Sysfils.  Adds annotations to every syscall in syscalls.master 
(Linuxulator could be done later too).  Generalizes the Capsicum flag to 
a sysfils bitmap both for sysents and ucreds.  Changes to the syscall 
entry checks.  Some sysfil checks spread out in the kernel.

2) There are some small modifications that I believe are bug fixes or 
minor improvements or basically have no effect in the current state of 
things (but are important for mac_curtain).

3) The new MAC handlers.  That's a significant part too and it can be 
separate from sysfils.  This would include the new call sites in 
vfs_lookup() (and there's some extra logic too).  New call sites in 
SysV/POSIX IPC modules (and some other changes to path handling).  
There's a function that deals with getdirentries() filtering directly in 
the MAC framework (the MAC hooks only deal with the dirents one by one) 
that maybe doesn't belong there. vn_open() got a new flag.

4) Various modifications that are specifically to support mac_curtain.  
I tried to minimize those but there are some. struct thread/proc need 
new fields (opaque pointers for unveil tracking/caching).  struct file 
needs a few extra integer fields (I really tried to get rid of those but 
couldn't figure out a good way).  I also added "sysctl_shadow" objects 
to kern_sysctl.c. These are handles that can outlast sysctl_oids.  It's 
not elegant, but mac_curtain needs it.  At first I thought that I could 
add long-lived references to sysctl_oids and I designed the curtain data 
structure for that.  Then I realized that sysctl_oids get totally nuked 
when a module unloads.  They get unmapped from memory with the rest of 
the module's data and there's no holding on to them (IIUC).  So I added 
an intermediate data structure between sysctl_oids and curtains.

The mallocs with non-sleepable locks in the SysV IPC module with late 
label initializations are annoying but this could be fixed separately 
(it might need a bit of code restructuring...).  Making the module 
fplookup-friendly would probably need some changes to the MAC framework 
itself, this can be done separately too. There are others, but not 
critical and can be done separately.

> Can you isolate the sysfils code and start a review in phabricator? If you need
> help for this don't hesitate to ask me ;)

Yup, I'm on it.

> Again thanks for the huge work.

No problem! This is gonna be a lot of work to review so thanks to anyone 
involved too.

> Best regards,
> Bapt