Re: curtain: WIP sandboxing mechanism with pledge()/unveil() support

From: David Chisnall <>
Date: Thu, 31 Mar 2022 10:24:43 UTC
On 29/03/2022 18:32, Mathieu wrote:
> On 3/29/22 04:34, David Chisnall wrote:
>> Hi,
>> Does pledge actually require kernel support?  I'd have thought that it 
>> could be implemented on top of Capsicum as a purely userland 
>> abstraction (more easily with libc help, but even with an LD_PRELOADed 
>> library along the lines of libpreopen).  In Verona, we're able to use 
>> Capsicum to run unmodified libraries in a sandbox, for example, 
>> including handling raw system calls:
>> It would be good to understand why this needs more kernel attack surface.
>> David
> If it can work like that then it's pretty cool.  It could be a lot more 
> secure.  But it's just not the way I went with. Re-implementing so much 
> kernel functionality in userland seems like a lot of work. Because I 
> wanted my module to be able to sandbox (almost) everything that the OS 
> can run.  Including whole process hierarchies that execute other 
> programs and use process management and shared memory, etc.  That's a 
> lot of little details to get right...  So I went with the same route 
> that jails, other MAC modules and even Capsicum are implemented: with 
> access checks in the kernel itself.  And most of these checks were 
> already in place with MAC hooks.

My concern with adding it to the kernel is that anything that does 
path-based checks is *incredibly* hard to get right and it will fail 
open.  To date, there are zero examples of path-based sandboxing 
mechanisms deployed in the wild that have not had vulnerabilities 
arising from the nature of the problem.  The filesystem is, inherently, 
concurrent.  A process can mutate the shape of the filesystem graph 
while you are doing path-based checks, mostly around the handling of 
'..' in paths.  Jails and Capsicum sidestep this in different ways:

Jails effectively punt the problem to the jail orchestration code.  They 
provide very strong restrictions on the paths, with a single root and 
allowing all access within this.  There are a few restrictions on what 
you can do from outside of a jail to avoid allowing the jailed process 
to exploit TOCTOU differences and escaping but fortunately these align 
with the use of jails as isolated containers containing (minimal) base 

Capsicum simply disallows '..' in paths.  If you want to support it in 
user code then you must do path resolution in userspace.  You may still 
have TOCTOU bugs, but they'll all fail closed: you will try to resolve 
the result, discover that you don't have a file descriptor corresponding 
to the path, and fail.

> pledge()/unveil() are usually used for fairly well-disciplined 
> applications that either don't run other programs or run very specific 
> programs that are also well-disciplined and don't expect too much 
> (unless you just drop the pledges on execve()).

The execve hole is the reason that I have little interest in pledge as 
an enforcement mechanism.  If a process can just execve itself to 
escape, then that's a trivial hole to exploit unless you're incredibly 
careful to make sure that the process does not have the ability to 
create or read files with executable privilege on the filesystem.

In contrast, something using Capsicum can create child processes but 
they inherit the same limitations.  It can inherit file descriptors from 
the parent, so if it is using something like libpreopen then it can 
inherit a large number of file descriptors for any of the files / 
directories that it should be permitted to open.

Since rtld was extended to allow direct execution mode, you can launch 
dynamically linked binaries in Capsicum mode.  With the SIGCAP things in, it becomes easy to write a signal 
handler that intercepts blocked system calls and handles them (I'm 
running with this applied and doing exactly that), so this can be 
transparent to any dynamically linked binary.

> Pledged applications usually reduce the kernel attack surface a lot, but 
> you don't run arbitrary programs with pledge (and that wasn't one of its 
> goals AFAIK).  But that's what I wanted my module to be able to do.  I'd 
> say it has become a bit of a weird hybrid between a "container" 
> framework and an exploit mitigation framework at this point.  You can 
> run a `make buildworld` with it, build/install/run random programs 
> isolated in your project directories, sandbox shell/desktop sessions as 
> a whole, etc.  And then within those sandboxes, nested applications can 
> do their own sandboxing on top of it (with this module (and its 
> pledge/unveil compat) or Capsicum (and possibly other compat layers 
> built on top of it)).  The "inner" programs can use more restrictive 
> sandboxes that don't expose as much kernel functionality.  But for the 
> "outer" programs the whole thing slides more towards being 
> "containers"/"jails" (and the more complex it would have been to do 
> purely in userland I believe).

So how do you avoid TOCTOU bugs in your path logic?  I don't disagree 
with the goals, I worry that you're doing something that is 
intrinsically almost impossible to get right.