Re: curtain: WIP sandboxing mechanism with pledge()/unveil() support
Date: Tue, 29 Mar 2022 08:34:08 UTC
Hi,
Does pledge actually require kernel support? I'd have thought that it
could be implemented on top of Capsicum as a purely userland abstraction
(more easily with libc help, but even with an LD_PRELOADed library along
the lines of libpreopen). In Verona, we're able to use Capsicum to run
unmodified libraries in a sandbox, for example, including handling raw
system calls:
https://github.com/microsoft/verona/tree/master/experiments/process_sandbox
It would be good to understand why this needs more kernel attack surface.
David
On 28/03/2022 10:37, Mathieu wrote:
> Hello list. Since a while I've been working on and off on a
> pledge()/unveil() implementation for FreeBSD. I also wanted it to be
> able to sandbox arbitrary programs that might not expect it with no (or
> very minor) modifications. So I just kept adding to it until it could
> do that well enough. I'm still working on it, and there are some known
> issues and some things I'm not sure are done correctly, but overall it's
> in a very functional state now. It can run unmodified most utilities and
> desktop apps (though dbus/dconf/etc are trouble), server daemons,
> buildworld and whole shell/desktop sessions sandboxed.
>
> https://github.com/Math2/freebsd-pledge
> https://github.com/Math2/freebsd-pledge/blob/main/CURTAIN-README.md
>
> It can be broken up in 4 parts: 1) A MAC module that implements most of
> the functionality. 2) The userland library, sandboxing utility, configs
> and tests. 3) Various kernel changes needed to support it (including
> new MAC handlers and extended syscall filtering). 4) Small
> changes/fixes to the base userland (things like adding reporting to ps
> and modifying some utilities to use $TMPDIR so that they can be properly
> sandboxed). So 1) and 2) could be in a port. And I tried to minimize
> 3) and 4) as much as possible.
>
> I noted some problems/limitations in the CURTAIN-ISSUES file. At this
> point I'm mostly wondering about the general design being acceptable for
> merging eventually. Because most of this could be part of a port, but
> not all of it. And the way that it deals with filesystem access
> restrictions in particular is kludgy. So any feedback/testing welcome.
>
> It still lacks documentation (in part because I'm not sure of what could
> still change) so I'm going to give an overview of it here and show some
> examples and that's going to be the documentation for now. And I'll
> describe the kernel changes that it needed. So that's going to be a bit
> of a long email.
>
> What it can do:
> ~~~~~~~~~~~~~~~
>
> It can restrict syscalls and various abilities (by categories that were
> based on OpenBSD's pledge promises), ioctls, sysctls, socket
> options/address families, priv(9) privileges, and filesystem access by
> path. It can be used at the same time as jails and Capsicum (their
> restrictions are also enforced on top of it).
>
> It can be used in a nested manner. A program that inherits sandbox
> restrictions can do its own internal sandboxing or sandbox programs that
> it run (which can then do the same, etc). The permissions of new
> sandboxes are always a subset of the inherited sandbox.
>
> Certain kernel operations are protected by "barriers" which only allow a
> sandboxed process to operate on kernel objects that were created by
> itself or a descendant sandbox. There are barriers for
> inspecting/signaling/debugging processes, POSIX/SysV IPC objects, PTYs,
> etc. Barriers have their own hierarchy which can diverge from the
> process hierarchy.
>
> Restrictions can be specified in configuration files and can be
> associated with named "tags". Tags are assumed to match application
> names, they're prefixed with "_" when they don't (just the convention
> I've been using so far). Enabling a tag may cause other tags to be
> enabled depending on configurations. Permissions associated with
> different tags are merged in a purely additive manner. Configurations
> can be spread in multiple files and directories
> (/usr/local/etc/curtain.{conf,d} can be used for packages,
> ~/.curtain.{conf,d} for user customizations). It'll check the .d
> directories for files named after the enabled tags.
>
> Usage examples:
> ~~~~~~~~~~~~~~~
>
> curtain(1) is the wrapper utility to sandbox arbitrary programs. Default
> permissions are in /etc/defaults/curtain.conf and /etc/curtain.conf.
>
> Here a bunch of examples. A bit random, but they demonstrate a lot of
> the functionality.
>
> $ curtain id
>
> Not very exciting, but it works. The default permissions don't give it
> access to the user DB so it only shows numeric IDs. It can be given
> access with the "_pwddb" tag:
>
> $ curtain -t _pwddb id
>
> It's possible to nest sandboxes, but it needs the "curtain" tag because
> the curtain config files are not unveiled by default (they could be
> though, maybe they should be...).
>
> Here, id cannot read the user DB because the outer sandbox doesn't allow
> it:
>
> $ curtain -t curtain curtain -t _pwddb id
>
> But this way it can:
>
> $ curtain -t curtain -t _pwddb curtain -t _pwddb id
>
> Starts a sandboxed shell session with access to ~/work in a clean
> environment:
>
> $ mkdir -p ~/work && curtain -p ~/work:rwx -S
>
> You'll probably miss your dotfiles though. If you browse around you'll
> see what paths get unveiled by default.
>
> If you try to list processes:
>
> $ curtain ps -ax
>
> You'll just see the ps process itself. It can be allowed to see
> processes outside of it like that:
>
> $ curtain -d ability-pass:ps ps -ax
>
> But it will not be allowed to signal, reprioritize or debug them (there
> are other "abilities" for that). The "-pass" means to allow the ability
> in a "passthrough" manner (beyond the sandbox's barrier). Visibility
> could also be blocked at an outer sandbox's barrier, like so:
>
> $ curtain -t curtain curtain -d ability-pass:ps ps -ax
>
> Give read-only access to the current directory and list files:
>
> $ curtain -p . ls
>
> If you have $CLICOLOR set, it may look less colorful than usual.
> curtain(1) is a bit paranoid and will filter out most control characters
> written to the TTY by default (and set $TERM to "dumb"). They can be
> let through with -R:
>
> $ curtain -R -p . ls
>
> And -T can be used to stop it from doing PTY wrapping altogether and
> give the program direct access to the TTY (which is less secure, but
> there are ioctl restrictions).
>
> Per-path permissions can be specified after a ":". More specific paths
> override the permissions of less specific paths.
>
> $ curtain -p .:rw -p ./secret: -p ./dev:rwx -p ./data:r ...
>
> Then those paths would have those permissions:
> ./:rw
> ./123:rw
> ./secret:
> ./dev:rwx
> ./dev/123:rwx
> ./data:r
> ./data/123:r
>
> As an example of how nested sandboxing is handled, if you were then to
> do this within this sandbox (don't forget to give it the "curtain" tag):
>
> $ curtain -p .:r -p ./dev:rx -p ./data:rw ...
>
> Then the permissions would end up being:
> ./:r
> ./123:r
> ./secret:
> ./dev:rx
> ./dev/123:rx
> ./data:r
> ./data/123:r
>
> root processes can be sandboxed too. Some privileges are allowed by
> default (which is similar to the set of privileges allowed by jails),
> but most are denied. As are accesses to most /dev and /etc files. For
> example, tcpdump will not be able to use bpf(4):
>
> # curtain tcpdump
>
> But there's a tag for that:
>
> # curtain -t _bpf tcpdump
>
> Something else that won't work:
>
> $ curtain node -e 'console.log(2+2)'
>
> It wants to do a PROT_EXEC mprotect(2) which is not allowed by default.
> By default, PROT_EXEC is only allowed when mmap(2)'ing files that are
> unveiled for execution.
>
> $ curtain -d ability:prot_exec node -e 'console.log(2+2)'
>
> Just what is allowed by default? Well it's kind of arbitrary and messy
> and there are 10 levels of it.
>
> curtain(1) uses a 10-levels "permissions tower" usable with options -0
> to -9 (which enable tags "_level0" to "_level9"). These are mostly just
> meant to be used as a quick way to try giving programs more or less
> access from the command-line (ideally a profile should be made to give
> programs just what they need). The default level currently is 5 (which
> is fairly permissive compared to most pledge(3)'d applications). All
> levels are intended to be securely containable, but each level exposes a
> greater attack surface than the previous one. Level 9 is the "please
> just work" level. It allows to use all ioctls and to read all sysctls
> and almost all rare syscalls. Filesystem access is still very
> restricted though so you've still got to figure out what unveils the
> program needs.
>
> And there's another dimension to it which is the "unsafety level".
> Directives in the config files can be suffixed with one or more "!" to
> indicate that the permissions that it gives are potentially unsafe,
> depending on circumstances, or could be surprising or undesired. The
> directive only applies when curtain(1) is invoked with as many or more
> "-!" options. This was more useful at the beginning when many features
> weren't properly sandboxed yet. Now it's not used as much. But I still
> find it useful. The way I'm using it is "!" is probably no big deal but
> you might want to check it if you're paranoid, "!!" has a real risk of
> allowing escapes in certain plausible scenarios, and "!!!" is very
> likely insecure unless special precautions are taken.
>
> I'm still not sure what the defaults should be or how they could be
> better organized. The "unsafety" is an odd thing to expose to the user
> and as much as possible I tried to make it unnecessary.
>
> So anyway, a shorter way to make nodejs work is to use level 6 which
> allows PROT_EXEC on anonymous memory (and to execute binaries in $TMPDIR
> too):
>
> $ curtain -6 node -e 'console.log(2+2)'
>
> Now with X programs:
>
> $ curtain -X xlogo
> $ curtain -X xterm
>
> -X gives "untrusted" X11 access, -Y "trusted" access (like with ssh) and
> -W is for Wayland.
>
> There's an example config file with sample application profiles that can
> be enabled by uncommenting the include line in /etc/curtain.conf (and
> reading this file is a good way to see how the whole thing works).
> Profiles can be used with -a/-A. Both simply enable the tag named after
> the program. -A is a shortcut that also enables "unsafety level" 1
> (most profiles don't actually need it, but some do, so I just use it all
> the time).
>
> $ curtain -XA xterm
> $ curtain -XA firefox
> $ curtain -XA chrome
> $ curtain -XA falkon
> $ curtain -XA qbittorrent
> $ curtain -XA hexchat
> $ curtain -XA gimp
> $ curtain -XA audacious
> # curtain -A tcpdump
>
> Programs started this way still have the default level 5 permissions in
> addition to their profile permissions.
>
> Option -k ("kill") enables "strict" mode where the default becomes level
> 1 and programs are sent SIGKILL when trying to do something forbidden
> (otherwise they just get EPERM errors). I made those two things go
> together because unexpected restrictions can make programs misbehave and
> this could lead to security issues. This reduces the attack surface but
> it also means you've got to figure out the permissions just right or
> your programs are going to get killed a lot. Also, trying to access
> non-unveiled files does not cause a SIGKILL to be sent yet, so missing
> unveils have the potential to cause insecure misbehavior too.
>
> See the config files here:
>
> https://github.com/Math2/freebsd-pledge/blob/main/lib/libcurtain/curtain.conf.defaults
>
> https://github.com/Math2/freebsd-pledge/blob/main/lib/libcurtain/curtain.conf.sample
>
>
> How well does it generally work?
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Well, there are some problems.
>
> First of all, "untrusted" X11 access doesn't work all that great. Some
> programs are just unstable with it. Firefox used to crash a lot with
> X11 errors but for some reason it seems to have gotten a lot better
> recently. But there might be thick borders around menus, client-side
> decorated windows won't be movable, the system tray won't work,
> selection/clipboard will only work one direction. And it'll be slower.
> The alternative is to give them "trusted" X11 access but that's very
> insecure. And even untrusted access isn't so secure either, untrusted
> programs are not isolated from one another IIUC. And who knows what the
> window manager, panels and others could be doing with the window
> properties of untrusted clients... And this exposes the huge complexity
> of the X11 server.
>
> Wayland's security is supposed to be much better, but it depends on how
> the compositors handle security on the extra protocols that they support
> and IIUC there's not a consensus on how it should be handled yet and
> most compositors still lack security restrictions (but apparently some
> people just compile out their support for insecure protocols).
>
> Programs that have built-in support for privilege-separation and
> self-sandboxing can solve this by not giving direct access to the
> display to the sandboxed parts. And that's something that this
> implementation means to support (which can be done on top of sandboxing
> the application as a whole). But it's not a general solution.
>
> Also, dbus/dconf/pulseaudio/etc are not dealt with very well yet.
> They're just ignored really. And (a bit surprisingly) many programs
> seem OK with that. fontconfig will complain a lot but if the font
> caches are already up to date it doesn't look like it matters (startup
> will be much slower otherwise). pulseaudio will just die when firefox
> tries to start it but then it'll fallback to using OSS directly (sndio
> works too). Thumbnail caches won't be accessible. The XDG shared
> recent documents list won't work. dconf will be completely
> non-functional and some programs won't be able to save their settings.
> Etc. And even when it works, "desktop integration" in general is going
> to be very degraded. A program trying to launch the desktop
> environment's handler program to open a file or URL probably won't work
> because it'll inherit a too restrictive sandbox. I haven't really
> gotten into trying to deal with this better yet. I see that there are
> dbus proxy services for sandboxing on Linux. It would probably need
> something like that.
>
> There are some scripts to sandbox programs with separate XDG directories
> or separate $HOME in /usr/share/examples/curtain/. But I wish doing this
> wouldn't be necessary...
>
> For non-desktop programs, it generally just works (if you give them
> enough permissions). The main thing causing trouble is usually /tmp.
>
> About the userland parts:
> ~~~~~~~~~~~~~~~~~~~~~~~~~
>
> libcurtain is a wrapper around the sandboxing syscall. It allows to
> assign permissions to "slots" which then get merged. Path permissions
> can override each others (most specific wins) within a slot, but across
> slots they are merged in a non-interfering way (a more specific
> permissions never cancels out less specific permissions from a different
> slot). Permissions from different bracketed sections of config files
> are added to different slots, so they all get merged in this way.
>
> Config files are also handled by libcurtain. Applications can use
> libcurtain directly to sandbox themselves using tags, but the API for
> that is more complex than it should be and I'm probably going to make
> more changes to it.
>
> I added a freebsd_simple_sandbox() function directly to libc that tries
> to load libcurtain and applies a tag. The idea is to make it as easy as
> possible to add configurable, opportunistic sandboxing to applications
> without having to link them to libcurtain. It can be called multiple
> times at different stages of initialization of an application, or for
> different sub-processes, etc. The application just specifies a tag for
> each call and the details are in the config files. Conceivably, there
> could be different backends implementing the sandboxing.
>
> libcurtain also contains the pledge()/unveil() implementation. On
> OpenBSD, pledge/unveil are available directly in libc (with the
> declarations in unistd.h), but the portable versions of some OpenBSD
> programs have problems if pledge/unveil are available on non-OpenBSD
> platforms because they just don't expect that. After fixing them, maybe
> auto-loading wrappers could be added directly to libc too so that they
> just work without having to deal with libcurtain dependencies.
>
> About the kernel-side parts:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Most of the implementation is in a separate mac_curtain module, but it
> also needed some changes spread out in the kernel to support it. That's
> what would need to be merged.
>
> The biggest change is adding "sysfils". It initially just meant
> "syscall filters" but now it's more of a general category of things that
> the kernel can do. Syscalls can be associated with zero or more
> required sysfils and some explicit sysfil checks were added in various
> places in the kernel as needed. ucreds have a set of allowed sysfils.
> Sysfils are represented as simple bitmaps and checks are fast. Capsicum
> was slightly modified to make use of a sysfil bit to simplify syscall
> entry checks.
>
> Sysfils are meant to be part of the internal kernel API, they're not
> exposed to the userland. The curtain module exposes intermediate
> "abilities" instead.
>
> Some checks that checked for "capability mode" now check for a more
> general "restricted mode" instead. A process is considered in
> restricted mode whenever its ucred is missing any sysfil bit.
>
> MAC handlers were added to let curtain hook into places that didn't have
> MAC checks. Some of those new handlers definitively seem out of place.
> The new vnode "walk" functions are more of a low-level mechanism than
> just a security policy. And many of the new handlers want to restrict
> access to certain functionality as a whole (e.g. ioctls, sockopts,
> procctls, etc) rather than compare labels. But it seemed like the best
> place to add them because MAC already did most of what was needed. So
> I've been treating the MAC framework like it stands for "Modular Access
> Checks" or something.
>
> The curtain permissions are stored in "curtain" objects. Process ucreds
> have their labels point to a curtain. Curtains have pointers to
> "barrier" objects, which contain the hierarchical linkage needed to
> restrict access to protected kernel objects. Those kernel objects have
> their labels point directly to barriers. Barriers can outlive their
> curtains. When a ucred loses its last reference from a process, it is
> "trimmed" and its label curtain pointer "decays" into a pointer to the
> curtain's barrier so that the curtain can be freed (because curtains can
> be a few KBs and they can hold vnode references). A lot of objects hold
> references to ucreds, so they could build up a lot without this.
>
> Processes can sandbox themselves with curtainctl(2). They have to
> specify the full set of permissions they want to retain. The requested
> permissions are then masked with the current curtain (if any). This
> involves dealing with inheritance relationships between permissions (as
> the new curtain can have permissions more specific than the old and vice
> versa).
>
> Kernel-side handling of filesystem path unveiling was the hardest part
> to deal with (given the "statelessness" of the vnode API) and it kind of
> is all a big kludge. I tried to make it as nice as possible and wrapped
> the whole thing behind a MAC API (it used to be a lot worse than that).
>
> Each directory "unveil" acts like a sort of chroot barrier but with
> specific permissions. There's a per-thread "tracker" with a circular
> buffer that remembers the permissions for the previous N looked-up
> vnodes. N only needs to be 2 as far as I can tell (most syscalls only
> need 1, but linkat() for example needs 2). The tracker has weak vnode
> references and doesn't need to be cleaned up after syscalls. namei()
> calls the new MAC handlers to manage the tracker during path lookup.
> fget*() also adds a tracker entry. Then the access check MAC handlers
> can find permissions for the passed vnodes in the tracker. This only
> works because almost all of the kernel code that work on vnodes first
> get a reference from namei()/fget*() and then don't call VOP_LOOKUP()
> directly themselves. It's messy but one good thing with it is that it
> usually "fails-secure" if the tracker was mismanaged because it won't
> find the vnode in it and it defaults to deny.
>
>
>