Re: curtain: WIP sandboxing mechanism with pledge()/unveil() support

From: Baptiste Daroussin <>
Date: Wed, 30 Mar 2022 07:22:10 UTC
On Mon, Mar 28, 2022 at 05:37:44AM -0400, Mathieu wrote:
> Hello list.  Since a while I've been working on and off on a
> pledge()/unveil() implementation for FreeBSD.  I also wanted it to be able
> to sandbox arbitrary programs that might not expect it with no (or very
> minor) modifications.  So I just kept adding to it until it could do that
> well enough.  I'm still working on it, and there are some known issues and
> some things I'm not sure are done correctly, but overall it's in a very
> functional state now. It can run unmodified most utilities and desktop apps
> (though dbus/dconf/etc are trouble), server daemons, buildworld and whole
> shell/desktop sessions sandboxed.
> It can be broken up in 4 parts: 1) A MAC module that implements most of the
> functionality.  2) The userland library, sandboxing utility, configs and
> tests.  3) Various kernel changes needed to support it (including new MAC
> handlers and extended syscall filtering).  4) Small changes/fixes to the
> base userland (things like adding reporting to ps and modifying some
> utilities to use $TMPDIR so that they can be properly sandboxed).  So 1) and
> 2) could be in a port.  And I tried to minimize 3) and 4) as much as
> possible.
> I noted some problems/limitations in the CURTAIN-ISSUES file.  At this point
> I'm mostly wondering about the general design being acceptable for merging
> eventually.  Because most of this could be part of a port, but not all of
> it.  And the way that it deals with filesystem access restrictions in
> particular is kludgy.  So any feedback/testing welcome.
> It still lacks documentation (in part because I'm not sure of what could
> still change) so I'm going to give an overview of it here and show some
> examples and that's going to be the documentation for now.  And I'll
> describe the kernel changes that it needed.  So that's going to be a bit of
> a long email.
> What it can do:
> ~~~~~~~~~~~~~~~
> It can restrict syscalls and various abilities (by categories that were
> based on OpenBSD's pledge promises), ioctls, sysctls, socket options/address
> families, priv(9) privileges, and filesystem access by path.  It can be used
> at the same time as jails and Capsicum (their restrictions are also enforced
> on top of it).
> It can be used in a nested manner.  A program that inherits sandbox
> restrictions can do its own internal sandboxing or sandbox programs that it
> run (which can then do the same, etc).  The permissions of new sandboxes are
> always a subset of the inherited sandbox.
> Certain kernel operations are protected by "barriers" which only allow a
> sandboxed process to operate on kernel objects that were created by itself
> or a descendant sandbox.  There are barriers for
> inspecting/signaling/debugging processes, POSIX/SysV IPC objects, PTYs,
> etc.  Barriers have their own hierarchy which can diverge from the process
> hierarchy.
> Restrictions can be specified in configuration files and can be associated
> with named "tags".  Tags are assumed to match application names, they're
> prefixed with "_" when they don't (just the convention I've been using so
> far).  Enabling a tag may cause other tags to be enabled depending on
> configurations.  Permissions associated with different tags are merged in a
> purely additive manner.  Configurations can be spread in multiple files and
> directories (/usr/local/etc/curtain.{conf,d} can be used for packages,
> ~/.curtain.{conf,d} for user customizations).  It'll check the .d
> directories for files named after the enabled tags.
> Usage examples:
> ~~~~~~~~~~~~~~~
> curtain(1) is the wrapper utility to sandbox arbitrary programs. Default
> permissions are in /etc/defaults/curtain.conf and /etc/curtain.conf.
> Here a bunch of examples.  A bit random, but they demonstrate a lot of the
> functionality.
> $ curtain id
> Not very exciting, but it works.  The default permissions don't give it
> access to the user DB so it only shows numeric IDs.  It can be given access
> with the "_pwddb" tag:
> $ curtain -t _pwddb id
> It's possible to nest sandboxes, but it needs the "curtain" tag because the
> curtain config files are not unveiled by default (they could be though,
> maybe they should be...).
> Here, id cannot read the user DB because the outer sandbox doesn't allow it:
> $ curtain -t curtain curtain -t _pwddb id
> But this way it can:
> $ curtain -t curtain -t _pwddb curtain -t _pwddb id
> Starts a sandboxed shell session with access to ~/work in a clean
> environment:
> $ mkdir -p ~/work && curtain -p ~/work:rwx -S
> You'll probably miss your dotfiles though.  If you browse around you'll see
> what paths get unveiled by default.
> If you try to list processes:
> $ curtain ps -ax
> You'll just see the ps process itself.  It can be allowed to see processes
> outside of it like that:
> $ curtain -d ability-pass:ps ps -ax
> But it will not be allowed to signal, reprioritize or debug them (there are
> other "abilities" for that).  The "-pass" means to allow the ability in a
> "passthrough" manner (beyond the sandbox's barrier).  Visibility could also
> be blocked at an outer sandbox's barrier, like so:
> $ curtain -t curtain curtain -d ability-pass:ps ps -ax
> Give read-only access to the current directory and list files:
> $ curtain -p . ls
> If you have $CLICOLOR set, it may look less colorful than usual. curtain(1)
> is a bit paranoid and will filter out most control characters written to the
> TTY by default (and set $TERM to "dumb").  They can be let through with -R:
> $ curtain -R -p . ls
> And -T can be used to stop it from doing PTY wrapping altogether and give
> the program direct access to the TTY (which is less secure, but there are
> ioctl restrictions).
> Per-path permissions can be specified after a ":".  More specific paths
> override the permissions of less specific paths.
> $ curtain -p .:rw -p ./secret: -p ./dev:rwx -p ./data:r ...
> Then those paths would have those permissions:
>     ./:rw
>     ./123:rw
>     ./secret:
>     ./dev:rwx
>     ./dev/123:rwx
>     ./data:r
>     ./data/123:r
> As an example of how nested sandboxing is handled, if you were then to do
> this within this sandbox (don't forget to give it the "curtain" tag):
> $ curtain -p .:r -p ./dev:rx -p ./data:rw ...
> Then the permissions would end up being:
>     ./:r
>     ./123:r
>     ./secret:
>     ./dev:rx
>     ./dev/123:rx
>     ./data:r
>     ./data/123:r
> root processes can be sandboxed too.  Some privileges are allowed by default
> (which is similar to the set of privileges allowed by jails), but most are
> denied.  As are accesses to most /dev and /etc files.  For example, tcpdump
> will not be able to use bpf(4):
> # curtain tcpdump
> But there's a tag for that:
> # curtain -t _bpf tcpdump
> Something else that won't work:
> $ curtain node -e 'console.log(2+2)'
> It wants to do a PROT_EXEC mprotect(2) which is not allowed by default.  By
> default, PROT_EXEC is only allowed when mmap(2)'ing files that are unveiled
> for execution.
> $ curtain -d ability:prot_exec node -e 'console.log(2+2)'
> Just what is allowed by default?  Well it's kind of arbitrary and messy and
> there are 10 levels of it.
> curtain(1) uses a 10-levels "permissions tower" usable with options -0 to -9
> (which enable tags "_level0" to "_level9"). These are mostly just meant to
> be used as a quick way to try giving programs more or less access from the
> command-line (ideally a profile should be made to give programs just what
> they need). The default level currently is 5 (which is fairly permissive
> compared to most pledge(3)'d applications).  All levels are intended to be
> securely containable, but each level exposes a greater attack surface than
> the previous one.  Level 9 is the "please just work" level.  It allows to
> use all ioctls and to read all sysctls and almost all rare syscalls. 
> Filesystem access is still very restricted though so you've still got to
> figure out what unveils the program needs.
> And there's another dimension to it which is the "unsafety level". 
> Directives in the config files can be suffixed with one or more "!" to
> indicate that the permissions that it gives are potentially unsafe,
> depending on circumstances, or could be surprising or undesired.  The
> directive only applies when curtain(1) is invoked with as many or more "-!"
> options.  This was more useful at the beginning when many features weren't
> properly sandboxed yet.  Now it's not used as much.  But I still find it
> useful.  The way I'm using it is "!" is probably no big deal but you might
> want to check it if you're paranoid, "!!" has a real risk of allowing
> escapes in certain plausible scenarios, and "!!!" is very likely insecure
> unless special precautions are taken.
> I'm still not sure what the defaults should be or how they could be better
> organized.  The "unsafety" is an odd thing to expose to the user and as much
> as possible I tried to make it unnecessary.
> So anyway, a shorter way to make nodejs work is to use level 6 which allows
> PROT_EXEC on anonymous memory (and to execute binaries in $TMPDIR too):
> $ curtain -6 node -e 'console.log(2+2)'
> Now with X programs:
> $ curtain -X xlogo
> $ curtain -X xterm
> -X gives "untrusted" X11 access, -Y "trusted" access (like with ssh) and -W
> is for Wayland.
> There's an example config file with sample application profiles that can be
> enabled by uncommenting the include line in /etc/curtain.conf (and reading
> this file is a good way to see how the whole thing works).  Profiles can be
> used with -a/-A.  Both simply enable the tag named after the program.  -A is
> a shortcut that also enables "unsafety level" 1 (most profiles don't
> actually need it, but some do, so I just use it all the time).
> $ curtain -XA xterm
> $ curtain -XA firefox
> $ curtain -XA chrome
> $ curtain -XA falkon
> $ curtain -XA qbittorrent
> $ curtain -XA hexchat
> $ curtain -XA gimp
> $ curtain -XA audacious
> # curtain -A tcpdump
> Programs started this way still have the default level 5 permissions in
> addition to their profile permissions.
> Option -k ("kill") enables "strict" mode where the default becomes level 1
> and programs are sent SIGKILL when trying to do something forbidden
> (otherwise they just get EPERM errors).  I made those two things go together
> because unexpected restrictions can make programs misbehave and this could
> lead to security issues.  This reduces the attack surface but it also means
> you've got to figure out the permissions just right or your programs are
> going to get killed a lot.  Also, trying to access non-unveiled files does
> not cause a SIGKILL to be sent yet, so missing unveils have the potential to
> cause insecure misbehavior too.
> See the config files here:
> How well does it generally work?
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Well, there are some problems.
> First of all, "untrusted" X11 access doesn't work all that great. Some
> programs are just unstable with it.  Firefox used to crash a lot with X11
> errors but for some reason it seems to have gotten a lot better recently. 
> But there might be thick borders around menus, client-side decorated windows
> won't be movable, the system tray won't work, selection/clipboard will only
> work one direction.  And it'll be slower.  The alternative is to give them
> "trusted" X11 access but that's very insecure.  And even untrusted access
> isn't so secure either, untrusted programs are not isolated from one another
> IIUC.  And who knows what the window manager, panels and others could be
> doing with the window properties of untrusted clients...  And this exposes
> the huge complexity of the X11 server.
> Wayland's security is supposed to be much better, but it depends on how the
> compositors handle security on the extra protocols that they support and
> IIUC there's not a consensus on how it should be handled yet and most
> compositors still lack security restrictions (but apparently some people
> just compile out their support for insecure protocols).
> Programs that have built-in support for privilege-separation and
> self-sandboxing can solve this by not giving direct access to the display to
> the sandboxed parts.  And that's something that this implementation means to
> support (which can be done on top of sandboxing the application as a
> whole).  But it's not a general solution.
> Also, dbus/dconf/pulseaudio/etc are not dealt with very well yet. They're
> just ignored really.  And (a bit surprisingly) many programs seem OK with
> that.  fontconfig will complain a lot but if the font caches are already up
> to date it doesn't look like it matters (startup will be much slower
> otherwise).  pulseaudio will just die when firefox tries to start it but
> then it'll fallback to using OSS directly (sndio works too).  Thumbnail
> caches won't be accessible.  The XDG shared recent documents list won't
> work. dconf will be completely non-functional and some programs won't be
> able to save their settings.  Etc.  And even when it works, "desktop
> integration" in general is going to be very degraded.  A program trying to
> launch the desktop environment's handler program to open a file or URL
> probably won't work because it'll inherit a too restrictive sandbox.  I
> haven't really gotten into trying to deal with this better yet.  I see that
> there are dbus proxy services for sandboxing on Linux.  It would probably
> need something like that.
> There are some scripts to sandbox programs with separate XDG directories or
> separate $HOME in /usr/share/examples/curtain/. But I wish doing this
> wouldn't be necessary...
> For non-desktop programs, it generally just works (if you give them enough
> permissions).  The main thing causing trouble is usually /tmp.
> About the userland parts:
> ~~~~~~~~~~~~~~~~~~~~~~~~~
> libcurtain is a wrapper around the sandboxing syscall.  It allows to assign
> permissions to "slots" which then get merged.  Path permissions can override
> each others (most specific wins) within a slot, but across slots they are
> merged in a non-interfering way (a more specific permissions never cancels
> out less specific permissions from a different slot).  Permissions from
> different bracketed sections of config files are added to different slots,
> so they all get merged in this way.
> Config files are also handled by libcurtain.  Applications can use
> libcurtain directly to sandbox themselves using tags, but the API for that
> is more complex than it should be and I'm probably going to make more
> changes to it.
> I added a freebsd_simple_sandbox() function directly to libc that tries to
> load libcurtain and applies a tag.  The idea is to make it as easy as
> possible to add configurable, opportunistic sandboxing to applications
> without having to link them to libcurtain.  It can be called multiple times
> at different stages of initialization of an application, or for different
> sub-processes, etc.  The application just specifies a tag for each call and
> the details are in the config files.  Conceivably, there could be different
> backends implementing the sandboxing.
> libcurtain also contains the pledge()/unveil() implementation.  On OpenBSD,
> pledge/unveil are available directly in libc (with the declarations in
> unistd.h), but the portable versions of some OpenBSD programs have problems
> if pledge/unveil are available on non-OpenBSD platforms because they just
> don't expect that.  After fixing them, maybe auto-loading wrappers could be
> added directly to libc too so that they just work without having to deal
> with libcurtain dependencies.
> About the kernel-side parts:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Most of the implementation is in a separate mac_curtain module, but it also
> needed some changes spread out in the kernel to support it.  That's what
> would need to be merged.
> The biggest change is adding "sysfils".  It initially just meant "syscall
> filters" but now it's more of a general category of things that the kernel
> can do.  Syscalls can be associated with zero or more required sysfils and
> some explicit sysfil checks were added in various places in the kernel as
> needed.  ucreds have a set of allowed sysfils.  Sysfils are represented as
> simple bitmaps and checks are fast.  Capsicum was slightly modified to make
> use of a sysfil bit to simplify syscall entry checks.
> Sysfils are meant to be part of the internal kernel API, they're not exposed
> to the userland.  The curtain module exposes intermediate "abilities"
> instead.
> Some checks that checked for "capability mode" now check for a more general
> "restricted mode" instead.  A process is considered in restricted mode
> whenever its ucred is missing any sysfil bit.
> MAC handlers were added to let curtain hook into places that didn't have MAC
> checks.  Some of those new handlers definitively seem out of place.  The new
> vnode "walk" functions are more of a low-level mechanism than just a
> security policy.  And many of the new handlers want to restrict access to
> certain functionality as a whole (e.g. ioctls, sockopts, procctls, etc)
> rather than compare labels.  But it seemed like the best place to add them
> because MAC already did most of what was needed.  So I've been treating the
> MAC framework like it stands for "Modular Access Checks" or something.
> The curtain permissions are stored in "curtain" objects.  Process ucreds
> have their labels point to a curtain.  Curtains have pointers to "barrier"
> objects, which contain the hierarchical linkage needed to restrict access to
> protected kernel objects. Those kernel objects have their labels point
> directly to barriers.  Barriers can outlive their curtains.  When a ucred
> loses its last reference from a process, it is "trimmed" and its label
> curtain pointer "decays" into a pointer to the curtain's barrier so that the
> curtain can be freed (because curtains can be a few KBs and they can hold
> vnode references).  A lot of objects hold references to ucreds, so they
> could build up a lot without this.
> Processes can sandbox themselves with curtainctl(2).  They have to specify
> the full set of permissions they want to retain.  The requested permissions
> are then masked with the current curtain (if any).  This involves dealing
> with inheritance relationships between permissions (as the new curtain can
> have permissions more specific than the old and vice versa).
> Kernel-side handling of filesystem path unveiling was the hardest part to
> deal with (given the "statelessness" of the vnode API) and it kind of is all
> a big kludge.  I tried to make it as nice as possible and wrapped the whole
> thing behind a MAC API (it used to be a lot worse than that).
> Each directory "unveil" acts like a sort of chroot barrier but with specific
> permissions.  There's a per-thread "tracker" with a circular buffer that
> remembers the permissions for the previous N looked-up vnodes.  N only needs
> to be 2 as far as I can tell (most syscalls only need 1, but linkat() for
> example needs 2).  The tracker has weak vnode references and doesn't need to
> be cleaned up after syscalls.  namei() calls the new MAC handlers to manage
> the tracker during path lookup.  fget*() also adds a tracker entry.  Then
> the access check MAC handlers can find permissions for the passed vnodes in
> the tracker.  This only works because almost all of the kernel code that
> work on vnodes first get a reference from namei()/fget*() and then don't
> call VOP_LOOKUP() directly themselves.  It's messy but one good thing with
> it is that it usually "fails-secure" if the tracker was mismanaged because
> it won't find the vnode in it and it defaults to deny.

Hello Mathieu,

First of all, thank you for this amazing work, leveraging the mac framework to
build curtain is imho an excellent idea, I personnally see a curtain like
approach as complementary to a capsicum approach rather than an antagonist
feature, I can see many possible usage of curtains in freebsd in particular in
the port framework!

To allow to integrate and permit reviews from developers, I think we can/should
split the review. The first thing will probably me imho to start a review
process of the sysfilt feature, this is probably the part that will need most of
the back and forth discussion given the rest is pretty isolated (mac module,

Can you isolate the sysfils code and start a review in phabricator? If you need
help for this don't hesitate to ask me ;)

Again thanks for the huge work.

Best regards,