Re: curtain: WIP sandboxing mechanism with pledge()/unveil() support

From: David Chisnall <>
Date: Tue, 29 Mar 2022 08:34:08 UTC

Does pledge actually require kernel support?  I'd have thought that it 
could be implemented on top of Capsicum as a purely userland abstraction 
(more easily with libc help, but even with an LD_PRELOADed library along 
the lines of libpreopen).  In Verona, we're able to use Capsicum to run 
unmodified libraries in a sandbox, for example, including handling raw 
system calls:

It would be good to understand why this needs more kernel attack surface.


On 28/03/2022 10:37, Mathieu wrote:
> Hello list.  Since a while I've been working on and off on a 
> pledge()/unveil() implementation for FreeBSD.  I also wanted it to be 
> able to sandbox arbitrary programs that might not expect it with no (or 
> very minor) modifications.  So I just kept adding to it until it could 
> do that well enough.  I'm still working on it, and there are some known 
> issues and some things I'm not sure are done correctly, but overall it's 
> in a very functional state now. It can run unmodified most utilities and 
> desktop apps (though dbus/dconf/etc are trouble), server daemons, 
> buildworld and whole shell/desktop sessions sandboxed.
> It can be broken up in 4 parts: 1) A MAC module that implements most of 
> the functionality.  2) The userland library, sandboxing utility, configs 
> and tests.  3) Various kernel changes needed to support it (including 
> new MAC handlers and extended syscall filtering).  4) Small 
> changes/fixes to the base userland (things like adding reporting to ps 
> and modifying some utilities to use $TMPDIR so that they can be properly 
> sandboxed).  So 1) and 2) could be in a port.  And I tried to minimize 
> 3) and 4) as much as possible.
> I noted some problems/limitations in the CURTAIN-ISSUES file.  At this 
> point I'm mostly wondering about the general design being acceptable for 
> merging eventually.  Because most of this could be part of a port, but 
> not all of it.  And the way that it deals with filesystem access 
> restrictions in particular is kludgy.  So any feedback/testing welcome.
> It still lacks documentation (in part because I'm not sure of what could 
> still change) so I'm going to give an overview of it here and show some 
> examples and that's going to be the documentation for now.  And I'll 
> describe the kernel changes that it needed.  So that's going to be a bit 
> of a long email.
> What it can do:
> ~~~~~~~~~~~~~~~
> It can restrict syscalls and various abilities (by categories that were 
> based on OpenBSD's pledge promises), ioctls, sysctls, socket 
> options/address families, priv(9) privileges, and filesystem access by 
> path.  It can be used at the same time as jails and Capsicum (their 
> restrictions are also enforced on top of it).
> It can be used in a nested manner.  A program that inherits sandbox 
> restrictions can do its own internal sandboxing or sandbox programs that 
> it run (which can then do the same, etc).  The permissions of new 
> sandboxes are always a subset of the inherited sandbox.
> Certain kernel operations are protected by "barriers" which only allow a 
> sandboxed process to operate on kernel objects that were created by 
> itself or a descendant sandbox.  There are barriers for 
> inspecting/signaling/debugging processes, POSIX/SysV IPC objects, PTYs, 
> etc.  Barriers have their own hierarchy which can diverge from the 
> process hierarchy.
> Restrictions can be specified in configuration files and can be 
> associated with named "tags".  Tags are assumed to match application 
> names, they're prefixed with "_" when they don't (just the convention 
> I've been using so far).  Enabling a tag may cause other tags to be 
> enabled depending on configurations.  Permissions associated with 
> different tags are merged in a purely additive manner.  Configurations 
> can be spread in multiple files and directories 
> (/usr/local/etc/curtain.{conf,d} can be used for packages, 
> ~/.curtain.{conf,d} for user customizations).  It'll check the .d 
> directories for files named after the enabled tags.
> Usage examples:
> ~~~~~~~~~~~~~~~
> curtain(1) is the wrapper utility to sandbox arbitrary programs. Default 
> permissions are in /etc/defaults/curtain.conf and /etc/curtain.conf.
> Here a bunch of examples.  A bit random, but they demonstrate a lot of 
> the functionality.
> $ curtain id
> Not very exciting, but it works.  The default permissions don't give it 
> access to the user DB so it only shows numeric IDs.  It can be given 
> access with the "_pwddb" tag:
> $ curtain -t _pwddb id
> It's possible to nest sandboxes, but it needs the "curtain" tag because 
> the curtain config files are not unveiled by default (they could be 
> though, maybe they should be...).
> Here, id cannot read the user DB because the outer sandbox doesn't allow 
> it:
> $ curtain -t curtain curtain -t _pwddb id
> But this way it can:
> $ curtain -t curtain -t _pwddb curtain -t _pwddb id
> Starts a sandboxed shell session with access to ~/work in a clean 
> environment:
> $ mkdir -p ~/work && curtain -p ~/work:rwx -S
> You'll probably miss your dotfiles though.  If you browse around you'll 
> see what paths get unveiled by default.
> If you try to list processes:
> $ curtain ps -ax
> You'll just see the ps process itself.  It can be allowed to see 
> processes outside of it like that:
> $ curtain -d ability-pass:ps ps -ax
> But it will not be allowed to signal, reprioritize or debug them (there 
> are other "abilities" for that).  The "-pass" means to allow the ability 
> in a "passthrough" manner (beyond the sandbox's barrier).  Visibility 
> could also be blocked at an outer sandbox's barrier, like so:
> $ curtain -t curtain curtain -d ability-pass:ps ps -ax
> Give read-only access to the current directory and list files:
> $ curtain -p . ls
> If you have $CLICOLOR set, it may look less colorful than usual. 
> curtain(1) is a bit paranoid and will filter out most control characters 
> written to the TTY by default (and set $TERM to "dumb").  They can be 
> let through with -R:
> $ curtain -R -p . ls
> And -T can be used to stop it from doing PTY wrapping altogether and 
> give the program direct access to the TTY (which is less secure, but 
> there are ioctl restrictions).
> Per-path permissions can be specified after a ":".  More specific paths 
> override the permissions of less specific paths.
> $ curtain -p .:rw -p ./secret: -p ./dev:rwx -p ./data:r ...
> Then those paths would have those permissions:
>      ./:rw
>      ./123:rw
>      ./secret:
>      ./dev:rwx
>      ./dev/123:rwx
>      ./data:r
>      ./data/123:r
> As an example of how nested sandboxing is handled, if you were then to 
> do this within this sandbox (don't forget to give it the "curtain" tag):
> $ curtain -p .:r -p ./dev:rx -p ./data:rw ...
> Then the permissions would end up being:
>      ./:r
>      ./123:r
>      ./secret:
>      ./dev:rx
>      ./dev/123:rx
>      ./data:r
>      ./data/123:r
> root processes can be sandboxed too.  Some privileges are allowed by 
> default (which is similar to the set of privileges allowed by jails), 
> but most are denied.  As are accesses to most /dev and /etc files.  For 
> example, tcpdump will not be able to use bpf(4):
> # curtain tcpdump
> But there's a tag for that:
> # curtain -t _bpf tcpdump
> Something else that won't work:
> $ curtain node -e 'console.log(2+2)'
> It wants to do a PROT_EXEC mprotect(2) which is not allowed by default. 
> By default, PROT_EXEC is only allowed when mmap(2)'ing files that are 
> unveiled for execution.
> $ curtain -d ability:prot_exec node -e 'console.log(2+2)'
> Just what is allowed by default?  Well it's kind of arbitrary and messy 
> and there are 10 levels of it.
> curtain(1) uses a 10-levels "permissions tower" usable with options -0 
> to -9 (which enable tags "_level0" to "_level9"). These are mostly just 
> meant to be used as a quick way to try giving programs more or less 
> access from the command-line (ideally a profile should be made to give 
> programs just what they need). The default level currently is 5 (which 
> is fairly permissive compared to most pledge(3)'d applications).  All 
> levels are intended to be securely containable, but each level exposes a 
> greater attack surface than the previous one.  Level 9 is the "please 
> just work" level.  It allows to use all ioctls and to read all sysctls 
> and almost all rare syscalls.  Filesystem access is still very 
> restricted though so you've still got to figure out what unveils the 
> program needs.
> And there's another dimension to it which is the "unsafety level". 
> Directives in the config files can be suffixed with one or more "!" to 
> indicate that the permissions that it gives are potentially unsafe, 
> depending on circumstances, or could be surprising or undesired.  The 
> directive only applies when curtain(1) is invoked with as many or more 
> "-!" options.  This was more useful at the beginning when many features 
> weren't properly sandboxed yet.  Now it's not used as much.  But I still 
> find it useful.  The way I'm using it is "!" is probably no big deal but 
> you might want to check it if you're paranoid, "!!" has a real risk of 
> allowing escapes in certain plausible scenarios, and "!!!" is very 
> likely insecure unless special precautions are taken.
> I'm still not sure what the defaults should be or how they could be 
> better organized.  The "unsafety" is an odd thing to expose to the user 
> and as much as possible I tried to make it unnecessary.
> So anyway, a shorter way to make nodejs work is to use level 6 which 
> allows PROT_EXEC on anonymous memory (and to execute binaries in $TMPDIR 
> too):
> $ curtain -6 node -e 'console.log(2+2)'
> Now with X programs:
> $ curtain -X xlogo
> $ curtain -X xterm
> -X gives "untrusted" X11 access, -Y "trusted" access (like with ssh) and 
> -W is for Wayland.
> There's an example config file with sample application profiles that can 
> be enabled by uncommenting the include line in /etc/curtain.conf (and 
> reading this file is a good way to see how the whole thing works). 
> Profiles can be used with -a/-A.  Both simply enable the tag named after 
> the program.  -A is a shortcut that also enables "unsafety level" 1 
> (most profiles don't actually need it, but some do, so I just use it all 
> the time).
> $ curtain -XA xterm
> $ curtain -XA firefox
> $ curtain -XA chrome
> $ curtain -XA falkon
> $ curtain -XA qbittorrent
> $ curtain -XA hexchat
> $ curtain -XA gimp
> $ curtain -XA audacious
> # curtain -A tcpdump
> Programs started this way still have the default level 5 permissions in 
> addition to their profile permissions.
> Option -k ("kill") enables "strict" mode where the default becomes level 
> 1 and programs are sent SIGKILL when trying to do something forbidden 
> (otherwise they just get EPERM errors).  I made those two things go 
> together because unexpected restrictions can make programs misbehave and 
> this could lead to security issues.  This reduces the attack surface but 
> it also means you've got to figure out the permissions just right or 
> your programs are going to get killed a lot.  Also, trying to access 
> non-unveiled files does not cause a SIGKILL to be sent yet, so missing 
> unveils have the potential to cause insecure misbehavior too.
> See the config files here:
> How well does it generally work?
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Well, there are some problems.
> First of all, "untrusted" X11 access doesn't work all that great. Some 
> programs are just unstable with it.  Firefox used to crash a lot with 
> X11 errors but for some reason it seems to have gotten a lot better 
> recently.  But there might be thick borders around menus, client-side 
> decorated windows won't be movable, the system tray won't work, 
> selection/clipboard will only work one direction.  And it'll be slower. 
> The alternative is to give them "trusted" X11 access but that's very 
> insecure.  And even untrusted access isn't so secure either, untrusted 
> programs are not isolated from one another IIUC.  And who knows what the 
> window manager, panels and others could be doing with the window 
> properties of untrusted clients...  And this exposes the huge complexity 
> of the X11 server.
> Wayland's security is supposed to be much better, but it depends on how 
> the compositors handle security on the extra protocols that they support 
> and IIUC there's not a consensus on how it should be handled yet and 
> most compositors still lack security restrictions (but apparently some 
> people just compile out their support for insecure protocols).
> Programs that have built-in support for privilege-separation and 
> self-sandboxing can solve this by not giving direct access to the 
> display to the sandboxed parts.  And that's something that this 
> implementation means to support (which can be done on top of sandboxing 
> the application as a whole).  But it's not a general solution.
> Also, dbus/dconf/pulseaudio/etc are not dealt with very well yet. 
> They're just ignored really.  And (a bit surprisingly) many programs 
> seem OK with that.  fontconfig will complain a lot but if the font 
> caches are already up to date it doesn't look like it matters (startup 
> will be much slower otherwise).  pulseaudio will just die when firefox 
> tries to start it but then it'll fallback to using OSS directly (sndio 
> works too).  Thumbnail caches won't be accessible.  The XDG shared 
> recent documents list won't work. dconf will be completely 
> non-functional and some programs won't be able to save their settings. 
> Etc.  And even when it works, "desktop integration" in general is going 
> to be very degraded.  A program trying to launch the desktop 
> environment's handler program to open a file or URL probably won't work 
> because it'll inherit a too restrictive sandbox.  I haven't really 
> gotten into trying to deal with this better yet.  I see that there are 
> dbus proxy services for sandboxing on Linux.  It would probably need 
> something like that.
> There are some scripts to sandbox programs with separate XDG directories 
> or separate $HOME in /usr/share/examples/curtain/. But I wish doing this 
> wouldn't be necessary...
> For non-desktop programs, it generally just works (if you give them 
> enough permissions).  The main thing causing trouble is usually /tmp.
> About the userland parts:
> ~~~~~~~~~~~~~~~~~~~~~~~~~
> libcurtain is a wrapper around the sandboxing syscall.  It allows to 
> assign permissions to "slots" which then get merged.  Path permissions 
> can override each others (most specific wins) within a slot, but across 
> slots they are merged in a non-interfering way (a more specific 
> permissions never cancels out less specific permissions from a different 
> slot).  Permissions from different bracketed sections of config files 
> are added to different slots, so they all get merged in this way.
> Config files are also handled by libcurtain.  Applications can use 
> libcurtain directly to sandbox themselves using tags, but the API for 
> that is more complex than it should be and I'm probably going to make 
> more changes to it.
> I added a freebsd_simple_sandbox() function directly to libc that tries 
> to load libcurtain and applies a tag.  The idea is to make it as easy as 
> possible to add configurable, opportunistic sandboxing to applications 
> without having to link them to libcurtain.  It can be called multiple 
> times at different stages of initialization of an application, or for 
> different sub-processes, etc.  The application just specifies a tag for 
> each call and the details are in the config files.  Conceivably, there 
> could be different backends implementing the sandboxing.
> libcurtain also contains the pledge()/unveil() implementation.  On 
> OpenBSD, pledge/unveil are available directly in libc (with the 
> declarations in unistd.h), but the portable versions of some OpenBSD 
> programs have problems if pledge/unveil are available on non-OpenBSD 
> platforms because they just don't expect that.  After fixing them, maybe 
> auto-loading wrappers could be added directly to libc too so that they 
> just work without having to deal with libcurtain dependencies.
> About the kernel-side parts:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Most of the implementation is in a separate mac_curtain module, but it 
> also needed some changes spread out in the kernel to support it.  That's 
> what would need to be merged.
> The biggest change is adding "sysfils".  It initially just meant 
> "syscall filters" but now it's more of a general category of things that 
> the kernel can do.  Syscalls can be associated with zero or more 
> required sysfils and some explicit sysfil checks were added in various 
> places in the kernel as needed.  ucreds have a set of allowed sysfils. 
> Sysfils are represented as simple bitmaps and checks are fast.  Capsicum 
> was slightly modified to make use of a sysfil bit to simplify syscall 
> entry checks.
> Sysfils are meant to be part of the internal kernel API, they're not 
> exposed to the userland.  The curtain module exposes intermediate 
> "abilities" instead.
> Some checks that checked for "capability mode" now check for a more 
> general "restricted mode" instead.  A process is considered in 
> restricted mode whenever its ucred is missing any sysfil bit.
> MAC handlers were added to let curtain hook into places that didn't have 
> MAC checks.  Some of those new handlers definitively seem out of place. 
> The new vnode "walk" functions are more of a low-level mechanism than 
> just a security policy.  And many of the new handlers want to restrict 
> access to certain functionality as a whole (e.g. ioctls, sockopts, 
> procctls, etc) rather than compare labels.  But it seemed like the best 
> place to add them because MAC already did most of what was needed.  So 
> I've been treating the MAC framework like it stands for "Modular Access 
> Checks" or something.
> The curtain permissions are stored in "curtain" objects.  Process ucreds 
> have their labels point to a curtain.  Curtains have pointers to 
> "barrier" objects, which contain the hierarchical linkage needed to 
> restrict access to protected kernel objects. Those kernel objects have 
> their labels point directly to barriers.  Barriers can outlive their 
> curtains.  When a ucred loses its last reference from a process, it is 
> "trimmed" and its label curtain pointer "decays" into a pointer to the 
> curtain's barrier so that the curtain can be freed (because curtains can 
> be a few KBs and they can hold vnode references).  A lot of objects hold 
> references to ucreds, so they could build up a lot without this.
> Processes can sandbox themselves with curtainctl(2).  They have to 
> specify the full set of permissions they want to retain.  The requested 
> permissions are then masked with the current curtain (if any).  This 
> involves dealing with inheritance relationships between permissions (as 
> the new curtain can have permissions more specific than the old and vice 
> versa).
> Kernel-side handling of filesystem path unveiling was the hardest part 
> to deal with (given the "statelessness" of the vnode API) and it kind of 
> is all a big kludge.  I tried to make it as nice as possible and wrapped 
> the whole thing behind a MAC API (it used to be a lot worse than that).
> Each directory "unveil" acts like a sort of chroot barrier but with 
> specific permissions.  There's a per-thread "tracker" with a circular 
> buffer that remembers the permissions for the previous N looked-up 
> vnodes.  N only needs to be 2 as far as I can tell (most syscalls only 
> need 1, but linkat() for example needs 2).  The tracker has weak vnode 
> references and doesn't need to be cleaned up after syscalls.  namei() 
> calls the new MAC handlers to manage the tracker during path lookup. 
> fget*() also adds a tracker entry.  Then the access check MAC handlers 
> can find permissions for the passed vnodes in the tracker.  This only 
> works because almost all of the kernel code that work on vnodes first 
> get a reference from namei()/fget*() and then don't call VOP_LOOKUP() 
> directly themselves.  It's messy but one good thing with it is that it 
> usually "fails-secure" if the tracker was mismanaged because it won't 
> find the vnode in it and it defaults to deny.