Re: curtain: WIP sandboxing mechanism with pledge()/unveil() support

From: Mathieu <>
Date: Thu, 31 Mar 2022 22:55:36 UTC
On 3/31/22 06:24, David Chisnall wrote:
> On 29/03/2022 18:32, Mathieu wrote:
>> On 3/29/22 04:34, David Chisnall wrote:
>>> Hi,
>>> Does pledge actually require kernel support?  I'd have thought that 
>>> it could be implemented on top of Capsicum as a purely userland 
>>> abstraction (more easily with libc help, but even with an 
>>> LD_PRELOADed library along the lines of libpreopen). In Verona, 
>>> we're able to use Capsicum to run unmodified libraries in a sandbox, 
>>> for example, including handling raw system calls:
>>> It would be good to understand why this needs more kernel attack 
>>> surface.
>>> David
>> If it can work like that then it's pretty cool.  It could be a lot 
>> more secure.  But it's just not the way I went with. Re-implementing 
>> so much kernel functionality in userland seems like a lot of work. 
>> Because I wanted my module to be able to sandbox (almost) everything 
>> that the OS can run.  Including whole process hierarchies that 
>> execute other programs and use process management and shared memory, 
>> etc.  That's a lot of little details to get right...  So I went with 
>> the same route that jails, other MAC modules and even Capsicum are 
>> implemented: with access checks in the kernel itself.  And most of 
>> these checks were already in place with MAC hooks.
> My concern with adding it to the kernel is that anything that does 
> path-based checks is *incredibly* hard to get right and it will fail 
> open.  To date, there are zero examples of path-based sandboxing 
> mechanisms deployed in the wild that have not had vulnerabilities 
> arising from the nature of the problem.  The filesystem is, 
> inherently, concurrent.  A process can mutate the shape of the 
> filesystem graph while you are doing path-based checks, mostly around 
> the handling of '..' in paths.  Jails and Capsicum sidestep this in 
> different ways:
> Jails effectively punt the problem to the jail orchestration code.  
> They provide very strong restrictions on the paths, with a single root 
> and allowing all access within this.  There are a few restrictions on 
> what you can do from outside of a jail to avoid allowing the jailed 
> process to exploit TOCTOU differences and escaping but fortunately 
> these align with the use of jails as isolated containers containing 
> (minimal) base system.
> Capsicum simply disallows '..' in paths.  If you want to support it in 
> user code then you must do path resolution in userspace. You may still 
> have TOCTOU bugs, but they'll all fail closed: you will try to resolve 
> the result, discover that you don't have a file descriptor 
> corresponding to the path, and fail.

Yeah it's by far the hardest part of this module for the reasons you 
describe.  Directories can move while you're holding references to them 
or even while you're traversing them.  I mean, this is hell.  My 
implementation has flaws, but I believe that they have a limited impact, 
but one of them is a big deal and it means you have to be careful with 
what you do from outside the sandbox.  I'll try to explain below for 
anyone interested.  It's similar to the situation with jails.  I still 
hope I can improve it though.

>> pledge()/unveil() are usually used for fairly well-disciplined 
>> applications that either don't run other programs or run very 
>> specific programs that are also well-disciplined and don't expect too 
>> much (unless you just drop the pledges on execve()).
> The execve hole is the reason that I have little interest in pledge as 
> an enforcement mechanism.  If a process can just execve itself to 
> escape, then that's a trivial hole to exploit unless you're incredibly 
> careful to make sure that the process does not have the ability to 
> create or read files with executable privilege on the filesystem.

I see this with some other sandboxing frameworks too, they allow to 
execute other programs with higher permissions.  That's something I 
wanted to avoid completely.  Look how difficult it is to write safe 
setugid binaries.  All programs you'd allow to run would have to be safe 

pledge() can inherit its restrictions though. If you specify 
restrictions on the second argument, the executed process will never be 
able to drop those. Unveils will be inherited as well (but it will not 
be possible to use unveil() again, so this is not fully "nestable" 
sandboxing as I call it).

But pledge() isn't always used like that because a lot of programs just 
won't work with inherited pledge restrictions if they weren't designed 
for it.

That's what I wanted my module to support better.

> In contrast, something using Capsicum can create child processes but 
> they inherit the same limitations.  It can inherit file descriptors 
> from the parent, so if it is using something like libpreopen then it 
> can inherit a large number of file descriptors for any of the files / 
> directories that it should be permitted to open.

With my module, generally the sandbox is inherited.  When you use it 
with curtain(1), that's how it is.  The ability to drop the restrictions 
on-exec was mostly added for pledge() compatibility. And since my module 
is designed for arbitrary nesting, what actually happens when you 
request to drop restrictions on-exec with pledge() is that you return to 
the restrictions that your process inherited at first (which is no 
restrictions only if you started with no restrictions).

> Since rtld was extended to allow direct execution mode, you can launch 
> dynamically linked binaries in Capsicum mode.  With the SIGCAP things 
> in, it becomes easy to write a 
> signal handler that intercepts blocked system calls and handles them 
> (I'm running with this applied and doing exactly that), so this can be 
> transparent to any dynamically linked binary.

Yeah it's interesting.  I see the "process descriptor" thing too that 
could make process management capability-based.  Getting a general 
compat layer built on top of Capsicum could be the best of both worlds.

>> Pledged applications usually reduce the kernel attack surface a lot, 
>> but you don't run arbitrary programs with pledge (and that wasn't one 
>> of its goals AFAIK). But that's what I wanted my module to be able to 
>> do.  I'd say it has become a bit of a weird hybrid between a 
>> "container" framework and an exploit mitigation framework at this 
>> point. You can run a `make buildworld` with it, build/install/run 
>> random programs isolated in your project directories, sandbox 
>> shell/desktop sessions as a whole, etc.  And then within those 
>> sandboxes, nested applications can do their own sandboxing on top of 
>> it (with this module (and its pledge/unveil compat) or Capsicum (and 
>> possibly other compat layers built on top of it)).  The "inner" 
>> programs can use more restrictive sandboxes that don't expose as much 
>> kernel functionality.  But for the "outer" programs the whole thing 
>> slides more towards being "containers"/"jails" (and the more complex 
>> it would have been to do purely in userland I believe).
> So how do you avoid TOCTOU bugs in your path logic?  I don't disagree 
> with the goals, I worry that you're doing something that is 
> intrinsically almost impossible to get right.
> David

You're right, it's not done right.  Not fully so.  The security rests on 
the module never allowing sandboxed processes to alter the directory 
hierarchy in a way that could cause exploitable confusion.  But if the 
alterations come from outside, then it can be unsafe.  This is similar 
to jails but (as you said) it's probably less risky there because of the 
way jails tend to be used.

And I could be wrong about all of this... this is going to need auditing.

But IIUC, the risk mainly comes from moving directories from inside of 
the sandbox to outside of it.

Say you sandbox firefox and it has write access to ~/.firefox and a 
separate $TMPDIR, moving any sub-directory from inside of those to 
somewhere outside is extremely unsafe.  But moving directories between 
~/.firefox and the $TMPDIR, this is fine.  Moving directories from 
outside to inside, also fine (but they better stay in there).  You could 
expect no one to do that, but say firefox has access to ~/Downloads 
though, moving directories out of it is a lot more tempting... So that's 
a big flaw.

Moving directories from a writable sandboxed directory to a read-only 
sandboxed directory is also dangerous for the same reasons.

But all of this has to be done from outside the sandbox.

I'm not super happy with it.  I wish there could be protections for 
that.  But I'm not sure how.  I'd have to keep thinking about it. But 
for now you have to be careful with what you do from outside of the 
sandbox to directories unveiled inside of it.  I think it's tolerable.  
Even if just barely...  In practice I can imagine it causing problems.

The actual implementation is not all that complicated in principle and 
somewhat similar to jails in some ways.  Instead of the process having 
one "jail root directory", it has as many as it has unveils. Each 
curtain unveil entry holds a vnode reference to its directory. And, when 
you unveil a path, ALL parent directories are also added to the curtain 
as unveil entries (with permissions to traverse them).

When path lookup starts, if needed, it'll ascend the directory hierarchy 
to find the unveil entry that "covers" the start directory.  After this, 
checks are strictly done against the curtain, there's no more searching 
for parent directories.  The module just reacts to traversals of 
directories that it knows about.

Whenever you land on a directory during path lookup (including after 
".." lookups), the curtain is searched for the vnode.  The dangerous 
part of it is when you descend into a directory with no corresponding 
curtain unveil entries.  Then it switches into "uncharted" mode where it 
assumes that ALL files can inherit the parent unveil's permissions.  But 
this is only until you hit another vnode that the curtain knows about.  
Then it'll leave "uncharted" mode and inherited permissions will not 
carry upwards during ".." lookups.  Each unveil (and all of their parent 
directories) act like a sort of firewall for the hierarchy confusion 
down below (or at least that's the idea).  And to cause confusion within 
this outer directory "skeleton", you need to have permissions to alter 
it in the first place!

But within "uncharted" directories anything can happen (using the 
permissions of the unveil that covers it).  If you can move a directory 
into the depths of another directory, you can trick curtain into using 
the permissions of the moved directory on the content of the target 
directory (curtain won't know where the directory really is when it 
handles ".." lookups).  But here's the thing: to move that directory 
there, the unveils must give write access to both the source and target 
directories in the first place!  So what was gained?  The hierarchy 
confusion isn't so bad.

That's a lie though.  You can get permissions to modify file attributes 
on a directory for which you only had write permissions this way (those 
can be separate unveil permissions).  That's something I'll try to fix.  
It's just a problem for writable directories that can be moved though, 
separate file attribute change permissions are still useful for /dev 
files for example (sandboxed root processes are correctly prevented from 
changing permissions on /dev/null even if they can write to it).

All of this is kind of hard to reason about...  So I hope I got this 
right (and there could be bugs too obviously).