Re: curtain: WIP sandboxing mechanism with pledge()/unveil() support

Reply: David Chisnall : "Re: curtain: WIP sandboxing mechanism with pledge()/unveil() support"
In reply to: David Chisnall : "Re: curtain: WIP sandboxing mechanism with pledge()/unveil() support"
Go to: [ bottom of page ] [ top of archives ] [ this month ]

From: Mathieu <sigsys_at_gmail.com>
Date: Tue, 29 Mar 2022 17:32:41 UTC

On 3/29/22 04:34, David Chisnall wrote:
> Hi,
>
> Does pledge actually require kernel support?  I'd have thought that it 
> could be implemented on top of Capsicum as a purely userland 
> abstraction (more easily with libc help, but even with an LD_PRELOADed 
> library along the lines of libpreopen).  In Verona, we're able to use 
> Capsicum to run unmodified libraries in a sandbox, for example, 
> including handling raw system calls:
>
> https://github.com/microsoft/verona/tree/master/experiments/process_sandbox 
>
>
> It would be good to understand why this needs more kernel attack surface.
>
> David

If it can work like that then it's pretty cool.  It could be a lot more 
secure.  But it's just not the way I went with. Re-implementing so much 
kernel functionality in userland seems like a lot of work. Because I 
wanted my module to be able to sandbox (almost) everything that the OS 
can run.  Including whole process hierarchies that execute other 
programs and use process management and shared memory, etc.  That's a 
lot of little details to get right...  So I went with the same route 
that jails, other MAC modules and even Capsicum are implemented: with 
access checks in the kernel itself.  And most of these checks were 
already in place with MAC hooks.

pledge()/unveil() are usually used for fairly well-disciplined 
applications that either don't run other programs or run very specific 
programs that are also well-disciplined and don't expect too much 
(unless you just drop the pledges on execve()).

Pledged applications usually reduce the kernel attack surface a lot, but 
you don't run arbitrary programs with pledge (and that wasn't one of its 
goals AFAIK).  But that's what I wanted my module to be able to do.  I'd 
say it has become a bit of a weird hybrid between a "container" 
framework and an exploit mitigation framework at this point.  You can 
run a `make buildworld` with it, build/install/run random programs 
isolated in your project directories, sandbox shell/desktop sessions as 
a whole, etc.  And then within those sandboxes, nested applications can 
do their own sandboxing on top of it (with this module (and its 
pledge/unveil compat) or Capsicum (and possibly other compat layers 
built on top of it)).  The "inner" programs can use more restrictive 
sandboxes that don't expose as much kernel functionality.  But for the 
"outer" programs the whole thing slides more towards being 
"containers"/"jails" (and the more complex it would have been to do 
purely in userland I believe).