From nobody Tue Mar 29 08:34:08 2022 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id B7E9B1A36FAF for ; Tue, 29 Mar 2022 08:34:11 +0000 (UTC) (envelope-from theraven@FreeBSD.org) Received: from smtp.freebsd.org (smtp.freebsd.org [IPv6:2610:1c1:1:606c::24b:4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "smtp.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4KSNC73kmjz3Mw0 for ; Tue, 29 Mar 2022 08:34:11 +0000 (UTC) (envelope-from theraven@FreeBSD.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1648542851; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=dIrsssPG9KdZoFhAsET7fWphc47wJPvSblL2hDiA7ic=; b=XSZfTCdkWNtWCFJAG14T2hG3Jk1UY43RRuZFV/QNX4PlZOEFACE8ZPOek/K+EcaaYfmY9z vups0aRW+fSewJHPIDyVA6GRgdQhbQR83C7riJPh/Em6ulDhh1HrNP7ChLjAfG+BL5abZK s9UCGou3WK/TkRqy2s2NUBsdMTOjyRCjTobvpiyrDSUUQjhGxO/PiiFwdrT3cAKiyZvYUM 8I2o685N7ErBlI22O+KxRXEXshtjOH8+Q257haDjBpC0kB3CPFbQOzjaFW9aheyB1CXpXG vCFL2Xz13cbxIx0hJ2xCl4yjP9QLdDcoxUFXQ0CQT6xLlo03JJaGSLw3RN14WA== Received: from smtp.theravensnest.org (smtp.theravensnest.org [45.77.103.195]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) (Authenticated sender: theraven) by smtp.freebsd.org (Postfix) with ESMTPSA id 01ED42EEF0 for ; Tue, 29 Mar 2022 08:34:10 +0000 (UTC) (envelope-from theraven@FreeBSD.org) Received: from [10.164.186.150] (unknown [167.220.197.22]) by smtp.theravensnest.org (Postfix) with ESMTPSA id 1F4652F797 for ; Tue, 29 Mar 2022 09:34:09 +0100 (BST) Message-ID: <01320c49-fa7e-99d2-5840-3c61bb8c0d57@FreeBSD.org> Date: Tue, 29 Mar 2022 09:34:08 +0100 List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@freebsd.org MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.7.0 Subject: Re: curtain: WIP sandboxing mechanism with pledge()/unveil() support Content-Language: en-GB To: freebsd-hackers@freebsd.org References: <25b5c60f-b9cc-78af-86d7-1cc714232364@gmail.com> From: David Chisnall In-Reply-To: <25b5c60f-b9cc-78af-86d7-1cc714232364@gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1648542851; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=dIrsssPG9KdZoFhAsET7fWphc47wJPvSblL2hDiA7ic=; b=UwHHXM23olVn1f1SXkXUGw9KUYv0Xh03xKCHPrfOykRroxDCaevKLJGfFi8uBwdWGLf0NT SIK1naDOFpWtB6WjPumqf0C6SD/AtM5u6P0hKD6Bh0+9VP1gmg2TIYdd949pyO2MwT6i3J wGvRetRfqoObhyNMoYiUl+uy+Cr5Slpzls6Wse/vPK1Sv3U2pKCHZLn+WpTK9Kj1VGQeCQ qFMDZeoC05OHI1euJhm/8FVQdnjuoaKp/c43lqySYWrW14FeWPnS1Ng2lbgPH1ROak65eL /oHNkW0eR7MGeix+WQFyP/FpLsIC987HRmj9psGkJSE6k8GspSk/l/8CU5nf5Q== ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1648542851; a=rsa-sha256; cv=none; b=hcGQMXS+vlEnytaedkcUwR0JvR4BRqZd8AUwF4jDL7MneZA0FWAyzMxNLoKKD1W5Joz3oE Flr3tfCQ8u4AkPEWfW1BennF2fBaVGTSzy4snnyOckOlkpuH78o/kVj8dIupH9+KDxckiq txmxhkR+BQN5KoJQMkVbymftUDQSBFukBorIWswtqHo4PtSI0s61DEs1HFt/yQc3n6jQ1b qv+uWquyK89eeTrPMXpy88xjDfqmp2O735Ey+NCZBQQKVTpHeDke8CPVes4B5LcRhtSEE2 sdg7EedaKGHAN4sQsoXAK/X5YBcm4FXOUi0UPRTYEuqc6e0JRDF8B3YX+BxTWA== ARC-Authentication-Results: i=1; mx1.freebsd.org; none X-ThisMailContainsUnwantedMimeParts: N Hi, Does pledge actually require kernel support? I'd have thought that it could be implemented on top of Capsicum as a purely userland abstraction (more easily with libc help, but even with an LD_PRELOADed library along the lines of libpreopen). In Verona, we're able to use Capsicum to run unmodified libraries in a sandbox, for example, including handling raw system calls: https://github.com/microsoft/verona/tree/master/experiments/process_sandbox It would be good to understand why this needs more kernel attack surface. David On 28/03/2022 10:37, Mathieu wrote: > Hello list.  Since a while I've been working on and off on a > pledge()/unveil() implementation for FreeBSD.  I also wanted it to be > able to sandbox arbitrary programs that might not expect it with no (or > very minor) modifications.  So I just kept adding to it until it could > do that well enough.  I'm still working on it, and there are some known > issues and some things I'm not sure are done correctly, but overall it's > in a very functional state now. It can run unmodified most utilities and > desktop apps (though dbus/dconf/etc are trouble), server daemons, > buildworld and whole shell/desktop sessions sandboxed. > > https://github.com/Math2/freebsd-pledge > https://github.com/Math2/freebsd-pledge/blob/main/CURTAIN-README.md > > It can be broken up in 4 parts: 1) A MAC module that implements most of > the functionality.  2) The userland library, sandboxing utility, configs > and tests.  3) Various kernel changes needed to support it (including > new MAC handlers and extended syscall filtering).  4) Small > changes/fixes to the base userland (things like adding reporting to ps > and modifying some utilities to use $TMPDIR so that they can be properly > sandboxed).  So 1) and 2) could be in a port.  And I tried to minimize > 3) and 4) as much as possible. > > I noted some problems/limitations in the CURTAIN-ISSUES file.  At this > point I'm mostly wondering about the general design being acceptable for > merging eventually.  Because most of this could be part of a port, but > not all of it.  And the way that it deals with filesystem access > restrictions in particular is kludgy.  So any feedback/testing welcome. > > It still lacks documentation (in part because I'm not sure of what could > still change) so I'm going to give an overview of it here and show some > examples and that's going to be the documentation for now.  And I'll > describe the kernel changes that it needed.  So that's going to be a bit > of a long email. > > What it can do: > ~~~~~~~~~~~~~~~ > > It can restrict syscalls and various abilities (by categories that were > based on OpenBSD's pledge promises), ioctls, sysctls, socket > options/address families, priv(9) privileges, and filesystem access by > path.  It can be used at the same time as jails and Capsicum (their > restrictions are also enforced on top of it). > > It can be used in a nested manner.  A program that inherits sandbox > restrictions can do its own internal sandboxing or sandbox programs that > it run (which can then do the same, etc).  The permissions of new > sandboxes are always a subset of the inherited sandbox. > > Certain kernel operations are protected by "barriers" which only allow a > sandboxed process to operate on kernel objects that were created by > itself or a descendant sandbox.  There are barriers for > inspecting/signaling/debugging processes, POSIX/SysV IPC objects, PTYs, > etc.  Barriers have their own hierarchy which can diverge from the > process hierarchy. > > Restrictions can be specified in configuration files and can be > associated with named "tags".  Tags are assumed to match application > names, they're prefixed with "_" when they don't (just the convention > I've been using so far).  Enabling a tag may cause other tags to be > enabled depending on configurations.  Permissions associated with > different tags are merged in a purely additive manner.  Configurations > can be spread in multiple files and directories > (/usr/local/etc/curtain.{conf,d} can be used for packages, > ~/.curtain.{conf,d} for user customizations).  It'll check the .d > directories for files named after the enabled tags. > > Usage examples: > ~~~~~~~~~~~~~~~ > > curtain(1) is the wrapper utility to sandbox arbitrary programs. Default > permissions are in /etc/defaults/curtain.conf and /etc/curtain.conf. > > Here a bunch of examples.  A bit random, but they demonstrate a lot of > the functionality. > > $ curtain id > > Not very exciting, but it works.  The default permissions don't give it > access to the user DB so it only shows numeric IDs.  It can be given > access with the "_pwddb" tag: > > $ curtain -t _pwddb id > > It's possible to nest sandboxes, but it needs the "curtain" tag because > the curtain config files are not unveiled by default (they could be > though, maybe they should be...). > > Here, id cannot read the user DB because the outer sandbox doesn't allow > it: > > $ curtain -t curtain curtain -t _pwddb id > > But this way it can: > > $ curtain -t curtain -t _pwddb curtain -t _pwddb id > > Starts a sandboxed shell session with access to ~/work in a clean > environment: > > $ mkdir -p ~/work && curtain -p ~/work:rwx -S > > You'll probably miss your dotfiles though.  If you browse around you'll > see what paths get unveiled by default. > > If you try to list processes: > > $ curtain ps -ax > > You'll just see the ps process itself.  It can be allowed to see > processes outside of it like that: > > $ curtain -d ability-pass:ps ps -ax > > But it will not be allowed to signal, reprioritize or debug them (there > are other "abilities" for that).  The "-pass" means to allow the ability > in a "passthrough" manner (beyond the sandbox's barrier).  Visibility > could also be blocked at an outer sandbox's barrier, like so: > > $ curtain -t curtain curtain -d ability-pass:ps ps -ax > > Give read-only access to the current directory and list files: > > $ curtain -p . ls > > If you have $CLICOLOR set, it may look less colorful than usual. > curtain(1) is a bit paranoid and will filter out most control characters > written to the TTY by default (and set $TERM to "dumb").  They can be > let through with -R: > > $ curtain -R -p . ls > > And -T can be used to stop it from doing PTY wrapping altogether and > give the program direct access to the TTY (which is less secure, but > there are ioctl restrictions). > > Per-path permissions can be specified after a ":".  More specific paths > override the permissions of less specific paths. > > $ curtain -p .:rw -p ./secret: -p ./dev:rwx -p ./data:r ... > > Then those paths would have those permissions: >     ./:rw >     ./123:rw >     ./secret: >     ./dev:rwx >     ./dev/123:rwx >     ./data:r >     ./data/123:r > > As an example of how nested sandboxing is handled, if you were then to > do this within this sandbox (don't forget to give it the "curtain" tag): > > $ curtain -p .:r -p ./dev:rx -p ./data:rw ... > > Then the permissions would end up being: >     ./:r >     ./123:r >     ./secret: >     ./dev:rx >     ./dev/123:rx >     ./data:r >     ./data/123:r > > root processes can be sandboxed too.  Some privileges are allowed by > default (which is similar to the set of privileges allowed by jails), > but most are denied.  As are accesses to most /dev and /etc files.  For > example, tcpdump will not be able to use bpf(4): > > # curtain tcpdump > > But there's a tag for that: > > # curtain -t _bpf tcpdump > > Something else that won't work: > > $ curtain node -e 'console.log(2+2)' > > It wants to do a PROT_EXEC mprotect(2) which is not allowed by default. > By default, PROT_EXEC is only allowed when mmap(2)'ing files that are > unveiled for execution. > > $ curtain -d ability:prot_exec node -e 'console.log(2+2)' > > Just what is allowed by default?  Well it's kind of arbitrary and messy > and there are 10 levels of it. > > curtain(1) uses a 10-levels "permissions tower" usable with options -0 > to -9 (which enable tags "_level0" to "_level9"). These are mostly just > meant to be used as a quick way to try giving programs more or less > access from the command-line (ideally a profile should be made to give > programs just what they need). The default level currently is 5 (which > is fairly permissive compared to most pledge(3)'d applications).  All > levels are intended to be securely containable, but each level exposes a > greater attack surface than the previous one.  Level 9 is the "please > just work" level.  It allows to use all ioctls and to read all sysctls > and almost all rare syscalls.  Filesystem access is still very > restricted though so you've still got to figure out what unveils the > program needs. > > And there's another dimension to it which is the "unsafety level". > Directives in the config files can be suffixed with one or more "!" to > indicate that the permissions that it gives are potentially unsafe, > depending on circumstances, or could be surprising or undesired.  The > directive only applies when curtain(1) is invoked with as many or more > "-!" options.  This was more useful at the beginning when many features > weren't properly sandboxed yet.  Now it's not used as much.  But I still > find it useful.  The way I'm using it is "!" is probably no big deal but > you might want to check it if you're paranoid, "!!" has a real risk of > allowing escapes in certain plausible scenarios, and "!!!" is very > likely insecure unless special precautions are taken. > > I'm still not sure what the defaults should be or how they could be > better organized.  The "unsafety" is an odd thing to expose to the user > and as much as possible I tried to make it unnecessary. > > So anyway, a shorter way to make nodejs work is to use level 6 which > allows PROT_EXEC on anonymous memory (and to execute binaries in $TMPDIR > too): > > $ curtain -6 node -e 'console.log(2+2)' > > Now with X programs: > > $ curtain -X xlogo > $ curtain -X xterm > > -X gives "untrusted" X11 access, -Y "trusted" access (like with ssh) and > -W is for Wayland. > > There's an example config file with sample application profiles that can > be enabled by uncommenting the include line in /etc/curtain.conf (and > reading this file is a good way to see how the whole thing works). > Profiles can be used with -a/-A.  Both simply enable the tag named after > the program.  -A is a shortcut that also enables "unsafety level" 1 > (most profiles don't actually need it, but some do, so I just use it all > the time). > > $ curtain -XA xterm > $ curtain -XA firefox > $ curtain -XA chrome > $ curtain -XA falkon > $ curtain -XA qbittorrent > $ curtain -XA hexchat > $ curtain -XA gimp > $ curtain -XA audacious > # curtain -A tcpdump > > Programs started this way still have the default level 5 permissions in > addition to their profile permissions. > > Option -k ("kill") enables "strict" mode where the default becomes level > 1 and programs are sent SIGKILL when trying to do something forbidden > (otherwise they just get EPERM errors).  I made those two things go > together because unexpected restrictions can make programs misbehave and > this could lead to security issues.  This reduces the attack surface but > it also means you've got to figure out the permissions just right or > your programs are going to get killed a lot.  Also, trying to access > non-unveiled files does not cause a SIGKILL to be sent yet, so missing > unveils have the potential to cause insecure misbehavior too. > > See the config files here: > > https://github.com/Math2/freebsd-pledge/blob/main/lib/libcurtain/curtain.conf.defaults > > https://github.com/Math2/freebsd-pledge/blob/main/lib/libcurtain/curtain.conf.sample > > > How well does it generally work? > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > Well, there are some problems. > > First of all, "untrusted" X11 access doesn't work all that great. Some > programs are just unstable with it.  Firefox used to crash a lot with > X11 errors but for some reason it seems to have gotten a lot better > recently.  But there might be thick borders around menus, client-side > decorated windows won't be movable, the system tray won't work, > selection/clipboard will only work one direction.  And it'll be slower. > The alternative is to give them "trusted" X11 access but that's very > insecure.  And even untrusted access isn't so secure either, untrusted > programs are not isolated from one another IIUC.  And who knows what the > window manager, panels and others could be doing with the window > properties of untrusted clients...  And this exposes the huge complexity > of the X11 server. > > Wayland's security is supposed to be much better, but it depends on how > the compositors handle security on the extra protocols that they support > and IIUC there's not a consensus on how it should be handled yet and > most compositors still lack security restrictions (but apparently some > people just compile out their support for insecure protocols). > > Programs that have built-in support for privilege-separation and > self-sandboxing can solve this by not giving direct access to the > display to the sandboxed parts.  And that's something that this > implementation means to support (which can be done on top of sandboxing > the application as a whole).  But it's not a general solution. > > Also, dbus/dconf/pulseaudio/etc are not dealt with very well yet. > They're just ignored really.  And (a bit surprisingly) many programs > seem OK with that.  fontconfig will complain a lot but if the font > caches are already up to date it doesn't look like it matters (startup > will be much slower otherwise).  pulseaudio will just die when firefox > tries to start it but then it'll fallback to using OSS directly (sndio > works too).  Thumbnail caches won't be accessible.  The XDG shared > recent documents list won't work. dconf will be completely > non-functional and some programs won't be able to save their settings. > Etc.  And even when it works, "desktop integration" in general is going > to be very degraded.  A program trying to launch the desktop > environment's handler program to open a file or URL probably won't work > because it'll inherit a too restrictive sandbox.  I haven't really > gotten into trying to deal with this better yet.  I see that there are > dbus proxy services for sandboxing on Linux.  It would probably need > something like that. > > There are some scripts to sandbox programs with separate XDG directories > or separate $HOME in /usr/share/examples/curtain/. But I wish doing this > wouldn't be necessary... > > For non-desktop programs, it generally just works (if you give them > enough permissions).  The main thing causing trouble is usually /tmp. > > About the userland parts: > ~~~~~~~~~~~~~~~~~~~~~~~~~ > > libcurtain is a wrapper around the sandboxing syscall.  It allows to > assign permissions to "slots" which then get merged.  Path permissions > can override each others (most specific wins) within a slot, but across > slots they are merged in a non-interfering way (a more specific > permissions never cancels out less specific permissions from a different > slot).  Permissions from different bracketed sections of config files > are added to different slots, so they all get merged in this way. > > Config files are also handled by libcurtain.  Applications can use > libcurtain directly to sandbox themselves using tags, but the API for > that is more complex than it should be and I'm probably going to make > more changes to it. > > I added a freebsd_simple_sandbox() function directly to libc that tries > to load libcurtain and applies a tag.  The idea is to make it as easy as > possible to add configurable, opportunistic sandboxing to applications > without having to link them to libcurtain.  It can be called multiple > times at different stages of initialization of an application, or for > different sub-processes, etc.  The application just specifies a tag for > each call and the details are in the config files.  Conceivably, there > could be different backends implementing the sandboxing. > > libcurtain also contains the pledge()/unveil() implementation.  On > OpenBSD, pledge/unveil are available directly in libc (with the > declarations in unistd.h), but the portable versions of some OpenBSD > programs have problems if pledge/unveil are available on non-OpenBSD > platforms because they just don't expect that.  After fixing them, maybe > auto-loading wrappers could be added directly to libc too so that they > just work without having to deal with libcurtain dependencies. > > About the kernel-side parts: > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > Most of the implementation is in a separate mac_curtain module, but it > also needed some changes spread out in the kernel to support it.  That's > what would need to be merged. > > The biggest change is adding "sysfils".  It initially just meant > "syscall filters" but now it's more of a general category of things that > the kernel can do.  Syscalls can be associated with zero or more > required sysfils and some explicit sysfil checks were added in various > places in the kernel as needed.  ucreds have a set of allowed sysfils. > Sysfils are represented as simple bitmaps and checks are fast.  Capsicum > was slightly modified to make use of a sysfil bit to simplify syscall > entry checks. > > Sysfils are meant to be part of the internal kernel API, they're not > exposed to the userland.  The curtain module exposes intermediate > "abilities" instead. > > Some checks that checked for "capability mode" now check for a more > general "restricted mode" instead.  A process is considered in > restricted mode whenever its ucred is missing any sysfil bit. > > MAC handlers were added to let curtain hook into places that didn't have > MAC checks.  Some of those new handlers definitively seem out of place. > The new vnode "walk" functions are more of a low-level mechanism than > just a security policy.  And many of the new handlers want to restrict > access to certain functionality as a whole (e.g. ioctls, sockopts, > procctls, etc) rather than compare labels.  But it seemed like the best > place to add them because MAC already did most of what was needed.  So > I've been treating the MAC framework like it stands for "Modular Access > Checks" or something. > > The curtain permissions are stored in "curtain" objects.  Process ucreds > have their labels point to a curtain.  Curtains have pointers to > "barrier" objects, which contain the hierarchical linkage needed to > restrict access to protected kernel objects. Those kernel objects have > their labels point directly to barriers.  Barriers can outlive their > curtains.  When a ucred loses its last reference from a process, it is > "trimmed" and its label curtain pointer "decays" into a pointer to the > curtain's barrier so that the curtain can be freed (because curtains can > be a few KBs and they can hold vnode references).  A lot of objects hold > references to ucreds, so they could build up a lot without this. > > Processes can sandbox themselves with curtainctl(2).  They have to > specify the full set of permissions they want to retain.  The requested > permissions are then masked with the current curtain (if any).  This > involves dealing with inheritance relationships between permissions (as > the new curtain can have permissions more specific than the old and vice > versa). > > Kernel-side handling of filesystem path unveiling was the hardest part > to deal with (given the "statelessness" of the vnode API) and it kind of > is all a big kludge.  I tried to make it as nice as possible and wrapped > the whole thing behind a MAC API (it used to be a lot worse than that). > > Each directory "unveil" acts like a sort of chroot barrier but with > specific permissions.  There's a per-thread "tracker" with a circular > buffer that remembers the permissions for the previous N looked-up > vnodes.  N only needs to be 2 as far as I can tell (most syscalls only > need 1, but linkat() for example needs 2).  The tracker has weak vnode > references and doesn't need to be cleaned up after syscalls.  namei() > calls the new MAC handlers to manage the tracker during path lookup. > fget*() also adds a tracker entry.  Then the access check MAC handlers > can find permissions for the passed vnodes in the tracker.  This only > works because almost all of the kernel code that work on vnodes first > get a reference from namei()/fget*() and then don't call VOP_LOOKUP() > directly themselves.  It's messy but one good thing with it is that it > usually "fails-secure" if the tracker was mismanaged because it won't > find the vnode in it and it defaults to deny. > > >