From nobody Mon Mar 28 09:37:44 2022 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 839511A49263 for ; Mon, 28 Mar 2022 09:37:53 +0000 (UTC) (envelope-from sigsys@gmail.com) Received: from mail-qt1-x834.google.com (mail-qt1-x834.google.com [IPv6:2607:f8b0:4864:20::834]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4KRng469rGz3HJf for ; Mon, 28 Mar 2022 09:37:52 +0000 (UTC) (envelope-from sigsys@gmail.com) Received: by mail-qt1-x834.google.com with SMTP id s11so11810155qtc.3 for ; Mon, 28 Mar 2022 02:37:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=message-id:date:mime-version:user-agent:content-language:from:to :subject:content-transfer-encoding; bh=skCZIRa/zevGz86UhXGS6sOQEawH1dOuJfA8ex0M+nY=; b=ox5ff/cAE2NI8O15mQ9vZZRVbX3mQo3+muliL1E1nBUqxl9WdzARl3ES5u99RMXkye w79U/8BG20MlrJFv9CLWV4Pz9kRw2N1cim9ArsDrJnTHyicj9vZmsKSR9bVGfiQWhW+M ScnFspLzvZFqHJnVMvs2bWD4IWnB8PlGc3eUDLbWLbUh1uZLLj+JGMtv7izgAOMuo+rS BFE7J+AmbWE1zaUloezstfrVFFjISoxOrIN/3QN2vvYi0pPDHTsL04g6JSTiSQSf9chr Yxlgh3QSK58XRH/Z8loqLESWfu7RFJOE5hxxLwN2/0jWFl+ZQo/qNwxijsfjJ2S3PHAa afrw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent :content-language:from:to:subject:content-transfer-encoding; bh=skCZIRa/zevGz86UhXGS6sOQEawH1dOuJfA8ex0M+nY=; b=FIULg9IGHYsVGv3pZuSdCymw9F44poHcat4sp3MLW29sbomRt+ka9nqXNuLBRvhT9U UFZpiUUZfJRPFDVRXeFZ+AGkN+R3zoOivCD5dy+nMQwAJPAzK9SXfoeob6+8OvS2FpCd 9IALVmD9+stzlFr/w01Pso6OMzlN1xs6DShyFvYJRp6AsMwexkdCLptIhEX/o32arupo JNPXLunhGuIdC6Okf+73uUOQbTSikG9yB7IPbbARA826B+Lw5TzZos5DQb4tWMgh6/ya QawZE1b158rbAZsphukcVpso3t6WO1oJqpJET7rTOEDfFh+Rv7NClxTOCoAiDHkqZ7JI yZSw== X-Gm-Message-State: AOAM530OGEsAZ2Rnhq+m0U/MivzHRd6ty5cCXbrJ7kd3qgkK+UyJkouu I1ZSDryJsB/rmzdt+b9r6M3Qwshgy4U= X-Google-Smtp-Source: ABdhPJy+YXA6RjIdtUtrh7shSTtkHXnHP4vh05RrM+IRll0yldHGhZKgf2YD1+bHVFysVj+Ox9WcBQ== X-Received: by 2002:ac8:5a46:0:b0:2e2:2edd:374 with SMTP id o6-20020ac85a46000000b002e22edd0374mr20441764qta.295.1648460265997; Mon, 28 Mar 2022 02:37:45 -0700 (PDT) Received: from [10.0.0.2] ([162.156.254.107]) by smtp.gmail.com with ESMTPSA id f17-20020ac87f11000000b002e1e831366asm12221078qtk.77.2022.03.28.02.37.45 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 28 Mar 2022 02:37:45 -0700 (PDT) Message-ID: <25b5c60f-b9cc-78af-86d7-1cc714232364@gmail.com> Date: Mon, 28 Mar 2022 05:37:44 -0400 List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@freebsd.org MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:91.0) Gecko/20100101 Thunderbird/91.7.0 Content-Language: en-US From: Mathieu To: freebsd-hackers@FreeBSD.org Subject: curtain: WIP sandboxing mechanism with pledge()/unveil() support Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 4KRng469rGz3HJf X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20210112 header.b="ox5ff/cA"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (mx1.freebsd.org: domain of sigsys@gmail.com designates 2607:f8b0:4864:20::834 as permitted sender) smtp.mailfrom=sigsys@gmail.com X-Spamd-Result: default: False [-4.00 / 15.00]; RCVD_VIA_SMTP_AUTH(0.00)[]; R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36:c]; FREEMAIL_FROM(0.00)[gmail.com]; TO_DN_NONE(0.00)[]; RCVD_COUNT_THREE(0.00)[3]; DKIM_TRACE(0.00)[gmail.com:+]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; NEURAL_HAM_SHORT(-1.00)[-0.997]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; FREEMAIL_ENVFROM(0.00)[gmail.com]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; MID_RHS_MATCH_FROM(0.00)[]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-0.999]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20210112]; FROM_HAS_DN(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[text/plain]; PREVIOUSLY_DELIVERED(0.00)[freebsd-hackers@freebsd.org]; RCPT_COUNT_ONE(0.00)[1]; RCVD_IN_DNSWL_NONE(0.00)[2607:f8b0:4864:20::834:from]; MLMMJ_DEST(0.00)[freebsd-hackers]; RCVD_TLS_ALL(0.00)[] X-ThisMailContainsUnwantedMimeParts: N Hello list.  Since a while I've been working on and off on a pledge()/unveil() implementation for FreeBSD.  I also wanted it to be able to sandbox arbitrary programs that might not expect it with no (or very minor) modifications.  So I just kept adding to it until it could do that well enough.  I'm still working on it, and there are some known issues and some things I'm not sure are done correctly, but overall it's in a very functional state now. It can run unmodified most utilities and desktop apps (though dbus/dconf/etc are trouble), server daemons, buildworld and whole shell/desktop sessions sandboxed. https://github.com/Math2/freebsd-pledge https://github.com/Math2/freebsd-pledge/blob/main/CURTAIN-README.md It can be broken up in 4 parts: 1) A MAC module that implements most of the functionality.  2) The userland library, sandboxing utility, configs and tests.  3) Various kernel changes needed to support it (including new MAC handlers and extended syscall filtering).  4) Small changes/fixes to the base userland (things like adding reporting to ps and modifying some utilities to use $TMPDIR so that they can be properly sandboxed).  So 1) and 2) could be in a port.  And I tried to minimize 3) and 4) as much as possible. I noted some problems/limitations in the CURTAIN-ISSUES file.  At this point I'm mostly wondering about the general design being acceptable for merging eventually.  Because most of this could be part of a port, but not all of it.  And the way that it deals with filesystem access restrictions in particular is kludgy.  So any feedback/testing welcome. It still lacks documentation (in part because I'm not sure of what could still change) so I'm going to give an overview of it here and show some examples and that's going to be the documentation for now.  And I'll describe the kernel changes that it needed.  So that's going to be a bit of a long email. What it can do: ~~~~~~~~~~~~~~~ It can restrict syscalls and various abilities (by categories that were based on OpenBSD's pledge promises), ioctls, sysctls, socket options/address families, priv(9) privileges, and filesystem access by path.  It can be used at the same time as jails and Capsicum (their restrictions are also enforced on top of it). It can be used in a nested manner.  A program that inherits sandbox restrictions can do its own internal sandboxing or sandbox programs that it run (which can then do the same, etc).  The permissions of new sandboxes are always a subset of the inherited sandbox. Certain kernel operations are protected by "barriers" which only allow a sandboxed process to operate on kernel objects that were created by itself or a descendant sandbox.  There are barriers for inspecting/signaling/debugging processes, POSIX/SysV IPC objects, PTYs, etc.  Barriers have their own hierarchy which can diverge from the process hierarchy. Restrictions can be specified in configuration files and can be associated with named "tags".  Tags are assumed to match application names, they're prefixed with "_" when they don't (just the convention I've been using so far).  Enabling a tag may cause other tags to be enabled depending on configurations.  Permissions associated with different tags are merged in a purely additive manner.  Configurations can be spread in multiple files and directories (/usr/local/etc/curtain.{conf,d} can be used for packages, ~/.curtain.{conf,d} for user customizations).  It'll check the .d directories for files named after the enabled tags. Usage examples: ~~~~~~~~~~~~~~~ curtain(1) is the wrapper utility to sandbox arbitrary programs. Default permissions are in /etc/defaults/curtain.conf and /etc/curtain.conf. Here a bunch of examples.  A bit random, but they demonstrate a lot of the functionality. $ curtain id Not very exciting, but it works.  The default permissions don't give it access to the user DB so it only shows numeric IDs.  It can be given access with the "_pwddb" tag: $ curtain -t _pwddb id It's possible to nest sandboxes, but it needs the "curtain" tag because the curtain config files are not unveiled by default (they could be though, maybe they should be...). Here, id cannot read the user DB because the outer sandbox doesn't allow it: $ curtain -t curtain curtain -t _pwddb id But this way it can: $ curtain -t curtain -t _pwddb curtain -t _pwddb id Starts a sandboxed shell session with access to ~/work in a clean environment: $ mkdir -p ~/work && curtain -p ~/work:rwx -S You'll probably miss your dotfiles though.  If you browse around you'll see what paths get unveiled by default. If you try to list processes: $ curtain ps -ax You'll just see the ps process itself.  It can be allowed to see processes outside of it like that: $ curtain -d ability-pass:ps ps -ax But it will not be allowed to signal, reprioritize or debug them (there are other "abilities" for that).  The "-pass" means to allow the ability in a "passthrough" manner (beyond the sandbox's barrier).  Visibility could also be blocked at an outer sandbox's barrier, like so: $ curtain -t curtain curtain -d ability-pass:ps ps -ax Give read-only access to the current directory and list files: $ curtain -p . ls If you have $CLICOLOR set, it may look less colorful than usual. curtain(1) is a bit paranoid and will filter out most control characters written to the TTY by default (and set $TERM to "dumb").  They can be let through with -R: $ curtain -R -p . ls And -T can be used to stop it from doing PTY wrapping altogether and give the program direct access to the TTY (which is less secure, but there are ioctl restrictions). Per-path permissions can be specified after a ":".  More specific paths override the permissions of less specific paths. $ curtain -p .:rw -p ./secret: -p ./dev:rwx -p ./data:r ... Then those paths would have those permissions:     ./:rw     ./123:rw     ./secret:     ./dev:rwx     ./dev/123:rwx     ./data:r     ./data/123:r As an example of how nested sandboxing is handled, if you were then to do this within this sandbox (don't forget to give it the "curtain" tag): $ curtain -p .:r -p ./dev:rx -p ./data:rw ... Then the permissions would end up being:     ./:r     ./123:r     ./secret:     ./dev:rx     ./dev/123:rx     ./data:r     ./data/123:r root processes can be sandboxed too.  Some privileges are allowed by default (which is similar to the set of privileges allowed by jails), but most are denied.  As are accesses to most /dev and /etc files.  For example, tcpdump will not be able to use bpf(4): # curtain tcpdump But there's a tag for that: # curtain -t _bpf tcpdump Something else that won't work: $ curtain node -e 'console.log(2+2)' It wants to do a PROT_EXEC mprotect(2) which is not allowed by default.  By default, PROT_EXEC is only allowed when mmap(2)'ing files that are unveiled for execution. $ curtain -d ability:prot_exec node -e 'console.log(2+2)' Just what is allowed by default?  Well it's kind of arbitrary and messy and there are 10 levels of it. curtain(1) uses a 10-levels "permissions tower" usable with options -0 to -9 (which enable tags "_level0" to "_level9"). These are mostly just meant to be used as a quick way to try giving programs more or less access from the command-line (ideally a profile should be made to give programs just what they need). The default level currently is 5 (which is fairly permissive compared to most pledge(3)'d applications).  All levels are intended to be securely containable, but each level exposes a greater attack surface than the previous one.  Level 9 is the "please just work" level.  It allows to use all ioctls and to read all sysctls and almost all rare syscalls.  Filesystem access is still very restricted though so you've still got to figure out what unveils the program needs. And there's another dimension to it which is the "unsafety level".  Directives in the config files can be suffixed with one or more "!" to indicate that the permissions that it gives are potentially unsafe, depending on circumstances, or could be surprising or undesired.  The directive only applies when curtain(1) is invoked with as many or more "-!" options.  This was more useful at the beginning when many features weren't properly sandboxed yet.  Now it's not used as much.  But I still find it useful.  The way I'm using it is "!" is probably no big deal but you might want to check it if you're paranoid, "!!" has a real risk of allowing escapes in certain plausible scenarios, and "!!!" is very likely insecure unless special precautions are taken. I'm still not sure what the defaults should be or how they could be better organized.  The "unsafety" is an odd thing to expose to the user and as much as possible I tried to make it unnecessary. So anyway, a shorter way to make nodejs work is to use level 6 which allows PROT_EXEC on anonymous memory (and to execute binaries in $TMPDIR too): $ curtain -6 node -e 'console.log(2+2)' Now with X programs: $ curtain -X xlogo $ curtain -X xterm -X gives "untrusted" X11 access, -Y "trusted" access (like with ssh) and -W is for Wayland. There's an example config file with sample application profiles that can be enabled by uncommenting the include line in /etc/curtain.conf (and reading this file is a good way to see how the whole thing works).  Profiles can be used with -a/-A.  Both simply enable the tag named after the program.  -A is a shortcut that also enables "unsafety level" 1 (most profiles don't actually need it, but some do, so I just use it all the time). $ curtain -XA xterm $ curtain -XA firefox $ curtain -XA chrome $ curtain -XA falkon $ curtain -XA qbittorrent $ curtain -XA hexchat $ curtain -XA gimp $ curtain -XA audacious # curtain -A tcpdump Programs started this way still have the default level 5 permissions in addition to their profile permissions. Option -k ("kill") enables "strict" mode where the default becomes level 1 and programs are sent SIGKILL when trying to do something forbidden (otherwise they just get EPERM errors).  I made those two things go together because unexpected restrictions can make programs misbehave and this could lead to security issues.  This reduces the attack surface but it also means you've got to figure out the permissions just right or your programs are going to get killed a lot.  Also, trying to access non-unveiled files does not cause a SIGKILL to be sent yet, so missing unveils have the potential to cause insecure misbehavior too. See the config files here: https://github.com/Math2/freebsd-pledge/blob/main/lib/libcurtain/curtain.conf.defaults https://github.com/Math2/freebsd-pledge/blob/main/lib/libcurtain/curtain.conf.sample How well does it generally work? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Well, there are some problems. First of all, "untrusted" X11 access doesn't work all that great. Some programs are just unstable with it.  Firefox used to crash a lot with X11 errors but for some reason it seems to have gotten a lot better recently.  But there might be thick borders around menus, client-side decorated windows won't be movable, the system tray won't work, selection/clipboard will only work one direction.  And it'll be slower.  The alternative is to give them "trusted" X11 access but that's very insecure.  And even untrusted access isn't so secure either, untrusted programs are not isolated from one another IIUC.  And who knows what the window manager, panels and others could be doing with the window properties of untrusted clients...  And this exposes the huge complexity of the X11 server. Wayland's security is supposed to be much better, but it depends on how the compositors handle security on the extra protocols that they support and IIUC there's not a consensus on how it should be handled yet and most compositors still lack security restrictions (but apparently some people just compile out their support for insecure protocols). Programs that have built-in support for privilege-separation and self-sandboxing can solve this by not giving direct access to the display to the sandboxed parts.  And that's something that this implementation means to support (which can be done on top of sandboxing the application as a whole).  But it's not a general solution. Also, dbus/dconf/pulseaudio/etc are not dealt with very well yet. They're just ignored really.  And (a bit surprisingly) many programs seem OK with that.  fontconfig will complain a lot but if the font caches are already up to date it doesn't look like it matters (startup will be much slower otherwise).  pulseaudio will just die when firefox tries to start it but then it'll fallback to using OSS directly (sndio works too).  Thumbnail caches won't be accessible.  The XDG shared recent documents list won't work. dconf will be completely non-functional and some programs won't be able to save their settings.  Etc.  And even when it works, "desktop integration" in general is going to be very degraded.  A program trying to launch the desktop environment's handler program to open a file or URL probably won't work because it'll inherit a too restrictive sandbox.  I haven't really gotten into trying to deal with this better yet.  I see that there are dbus proxy services for sandboxing on Linux.  It would probably need something like that. There are some scripts to sandbox programs with separate XDG directories or separate $HOME in /usr/share/examples/curtain/. But I wish doing this wouldn't be necessary... For non-desktop programs, it generally just works (if you give them enough permissions).  The main thing causing trouble is usually /tmp. About the userland parts: ~~~~~~~~~~~~~~~~~~~~~~~~~ libcurtain is a wrapper around the sandboxing syscall.  It allows to assign permissions to "slots" which then get merged.  Path permissions can override each others (most specific wins) within a slot, but across slots they are merged in a non-interfering way (a more specific permissions never cancels out less specific permissions from a different slot).  Permissions from different bracketed sections of config files are added to different slots, so they all get merged in this way. Config files are also handled by libcurtain.  Applications can use libcurtain directly to sandbox themselves using tags, but the API for that is more complex than it should be and I'm probably going to make more changes to it. I added a freebsd_simple_sandbox() function directly to libc that tries to load libcurtain and applies a tag.  The idea is to make it as easy as possible to add configurable, opportunistic sandboxing to applications without having to link them to libcurtain.  It can be called multiple times at different stages of initialization of an application, or for different sub-processes, etc.  The application just specifies a tag for each call and the details are in the config files.  Conceivably, there could be different backends implementing the sandboxing. libcurtain also contains the pledge()/unveil() implementation.  On OpenBSD, pledge/unveil are available directly in libc (with the declarations in unistd.h), but the portable versions of some OpenBSD programs have problems if pledge/unveil are available on non-OpenBSD platforms because they just don't expect that.  After fixing them, maybe auto-loading wrappers could be added directly to libc too so that they just work without having to deal with libcurtain dependencies. About the kernel-side parts: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Most of the implementation is in a separate mac_curtain module, but it also needed some changes spread out in the kernel to support it.  That's what would need to be merged. The biggest change is adding "sysfils".  It initially just meant "syscall filters" but now it's more of a general category of things that the kernel can do.  Syscalls can be associated with zero or more required sysfils and some explicit sysfil checks were added in various places in the kernel as needed.  ucreds have a set of allowed sysfils.  Sysfils are represented as simple bitmaps and checks are fast.  Capsicum was slightly modified to make use of a sysfil bit to simplify syscall entry checks. Sysfils are meant to be part of the internal kernel API, they're not exposed to the userland.  The curtain module exposes intermediate "abilities" instead. Some checks that checked for "capability mode" now check for a more general "restricted mode" instead.  A process is considered in restricted mode whenever its ucred is missing any sysfil bit. MAC handlers were added to let curtain hook into places that didn't have MAC checks.  Some of those new handlers definitively seem out of place.  The new vnode "walk" functions are more of a low-level mechanism than just a security policy.  And many of the new handlers want to restrict access to certain functionality as a whole (e.g. ioctls, sockopts, procctls, etc) rather than compare labels.  But it seemed like the best place to add them because MAC already did most of what was needed.  So I've been treating the MAC framework like it stands for "Modular Access Checks" or something. The curtain permissions are stored in "curtain" objects.  Process ucreds have their labels point to a curtain.  Curtains have pointers to "barrier" objects, which contain the hierarchical linkage needed to restrict access to protected kernel objects. Those kernel objects have their labels point directly to barriers.  Barriers can outlive their curtains.  When a ucred loses its last reference from a process, it is "trimmed" and its label curtain pointer "decays" into a pointer to the curtain's barrier so that the curtain can be freed (because curtains can be a few KBs and they can hold vnode references).  A lot of objects hold references to ucreds, so they could build up a lot without this. Processes can sandbox themselves with curtainctl(2).  They have to specify the full set of permissions they want to retain.  The requested permissions are then masked with the current curtain (if any).  This involves dealing with inheritance relationships between permissions (as the new curtain can have permissions more specific than the old and vice versa). Kernel-side handling of filesystem path unveiling was the hardest part to deal with (given the "statelessness" of the vnode API) and it kind of is all a big kludge.  I tried to make it as nice as possible and wrapped the whole thing behind a MAC API (it used to be a lot worse than that). Each directory "unveil" acts like a sort of chroot barrier but with specific permissions.  There's a per-thread "tracker" with a circular buffer that remembers the permissions for the previous N looked-up vnodes.  N only needs to be 2 as far as I can tell (most syscalls only need 1, but linkat() for example needs 2).  The tracker has weak vnode references and doesn't need to be cleaned up after syscalls.  namei() calls the new MAC handlers to manage the tracker during path lookup.  fget*() also adds a tracker entry.  Then the access check MAC handlers can find permissions for the passed vnodes in the tracker.  This only works because almost all of the kernel code that work on vnodes first get a reference from namei()/fget*() and then don't call VOP_LOOKUP() directly themselves.  It's messy but one good thing with it is that it usually "fails-secure" if the tracker was mismanaged because it won't find the vnode in it and it defaults to deny.