From nobody Mon Mar 28 09:37:44 2022
X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 839511A49263
	for <freebsd-hackers@mlmmj.nyi.freebsd.org>; Mon, 28 Mar 2022 09:37:53 +0000 (UTC)
	(envelope-from sigsys@gmail.com)
Received: from mail-qt1-x834.google.com (mail-qt1-x834.google.com [IPv6:2607:f8b0:4864:20::834])
	(using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (2048 bits) client-digest SHA256)
	(Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4KRng469rGz3HJf
	for <freebsd-hackers@freebsd.org>; Mon, 28 Mar 2022 09:37:52 +0000 (UTC)
	(envelope-from sigsys@gmail.com)
Received: by mail-qt1-x834.google.com with SMTP id s11so11810155qtc.3
        for <freebsd-hackers@freebsd.org>; Mon, 28 Mar 2022 02:37:52 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=message-id:date:mime-version:user-agent:content-language:from:to
         :subject:content-transfer-encoding;
        bh=skCZIRa/zevGz86UhXGS6sOQEawH1dOuJfA8ex0M+nY=;
        b=ox5ff/cAE2NI8O15mQ9vZZRVbX3mQo3+muliL1E1nBUqxl9WdzARl3ES5u99RMXkye
         w79U/8BG20MlrJFv9CLWV4Pz9kRw2N1cim9ArsDrJnTHyicj9vZmsKSR9bVGfiQWhW+M
         ScnFspLzvZFqHJnVMvs2bWD4IWnB8PlGc3eUDLbWLbUh1uZLLj+JGMtv7izgAOMuo+rS
         BFE7J+AmbWE1zaUloezstfrVFFjISoxOrIN/3QN2vvYi0pPDHTsL04g6JSTiSQSf9chr
         Yxlgh3QSK58XRH/Z8loqLESWfu7RFJOE5hxxLwN2/0jWFl+ZQo/qNwxijsfjJ2S3PHAa
         afrw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:message-id:date:mime-version:user-agent
         :content-language:from:to:subject:content-transfer-encoding;
        bh=skCZIRa/zevGz86UhXGS6sOQEawH1dOuJfA8ex0M+nY=;
        b=FIULg9IGHYsVGv3pZuSdCymw9F44poHcat4sp3MLW29sbomRt+ka9nqXNuLBRvhT9U
         UFZpiUUZfJRPFDVRXeFZ+AGkN+R3zoOivCD5dy+nMQwAJPAzK9SXfoeob6+8OvS2FpCd
         9IALVmD9+stzlFr/w01Pso6OMzlN1xs6DShyFvYJRp6AsMwexkdCLptIhEX/o32arupo
         JNPXLunhGuIdC6Okf+73uUOQbTSikG9yB7IPbbARA826B+Lw5TzZos5DQb4tWMgh6/ya
         QawZE1b158rbAZsphukcVpso3t6WO1oJqpJET7rTOEDfFh+Rv7NClxTOCoAiDHkqZ7JI
         yZSw==
X-Gm-Message-State: AOAM530OGEsAZ2Rnhq+m0U/MivzHRd6ty5cCXbrJ7kd3qgkK+UyJkouu
	I1ZSDryJsB/rmzdt+b9r6M3Qwshgy4U=
X-Google-Smtp-Source: ABdhPJy+YXA6RjIdtUtrh7shSTtkHXnHP4vh05RrM+IRll0yldHGhZKgf2YD1+bHVFysVj+Ox9WcBQ==
X-Received: by 2002:ac8:5a46:0:b0:2e2:2edd:374 with SMTP id o6-20020ac85a46000000b002e22edd0374mr20441764qta.295.1648460265997;
        Mon, 28 Mar 2022 02:37:45 -0700 (PDT)
Received: from [10.0.0.2] ([162.156.254.107])
        by smtp.gmail.com with ESMTPSA id f17-20020ac87f11000000b002e1e831366asm12221078qtk.77.2022.03.28.02.37.45
        for <freebsd-hackers@freebsd.org>
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Mon, 28 Mar 2022 02:37:45 -0700 (PDT)
Message-ID: <25b5c60f-b9cc-78af-86d7-1cc714232364@gmail.com>
Date: Mon, 28 Mar 2022 05:37:44 -0400
List-Id: Technical discussions relating to FreeBSD <freebsd-hackers.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-hackers
List-Help: <mailto:freebsd-hackers+help@freebsd.org>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Subscribe: <mailto:freebsd-hackers+subscribe@freebsd.org>
List-Unsubscribe: <mailto:freebsd-hackers+unsubscribe@freebsd.org>
Sender: owner-freebsd-hackers@freebsd.org
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:91.0) Gecko/20100101
 Thunderbird/91.7.0
Content-Language: en-US
From: Mathieu <sigsys@gmail.com>
To: freebsd-hackers@FreeBSD.org
Subject: curtain: WIP sandboxing mechanism with pledge()/unveil() support
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Rspamd-Queue-Id: 4KRng469rGz3HJf
X-Spamd-Bar: ---
Authentication-Results: mx1.freebsd.org;
	dkim=pass header.d=gmail.com header.s=20210112 header.b="ox5ff/cA";
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (mx1.freebsd.org: domain of sigsys@gmail.com designates 2607:f8b0:4864:20::834 as permitted sender) smtp.mailfrom=sigsys@gmail.com
X-Spamd-Result: default: False [-4.00 / 15.00];
	 RCVD_VIA_SMTP_AUTH(0.00)[];
	 R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36:c];
	 FREEMAIL_FROM(0.00)[gmail.com];
	 TO_DN_NONE(0.00)[];
	 RCVD_COUNT_THREE(0.00)[3];
	 DKIM_TRACE(0.00)[gmail.com:+];
	 DMARC_POLICY_ALLOW(-0.50)[gmail.com,none];
	 NEURAL_HAM_SHORT(-1.00)[-0.997];
	 FROM_EQ_ENVFROM(0.00)[];
	 MIME_TRACE(0.00)[0:+];
	 FREEMAIL_ENVFROM(0.00)[gmail.com];
	 ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US];
	 MID_RHS_MATCH_FROM(0.00)[];
	 DWL_DNSWL_NONE(0.00)[gmail.com:dkim];
	 ARC_NA(0.00)[];
	 NEURAL_HAM_MEDIUM(-1.00)[-0.999];
	 R_DKIM_ALLOW(-0.20)[gmail.com:s=20210112];
	 FROM_HAS_DN(0.00)[];
	 TO_MATCH_ENVRCPT_ALL(0.00)[];
	 NEURAL_HAM_LONG(-1.00)[-1.000];
	 MIME_GOOD(-0.10)[text/plain];
	 PREVIOUSLY_DELIVERED(0.00)[freebsd-hackers@freebsd.org];
	 RCPT_COUNT_ONE(0.00)[1];
	 RCVD_IN_DNSWL_NONE(0.00)[2607:f8b0:4864:20::834:from];
	 MLMMJ_DEST(0.00)[freebsd-hackers];
	 RCVD_TLS_ALL(0.00)[]
X-ThisMailContainsUnwantedMimeParts: N

Hello list.  Since a while I've been working on and off on a 
pledge()/unveil() implementation for FreeBSD.  I also wanted it to be 
able to sandbox arbitrary programs that might not expect it with no (or 
very minor) modifications.  So I just kept adding to it until it could 
do that well enough.  I'm still working on it, and there are some known 
issues and some things I'm not sure are done correctly, but overall it's 
in a very functional state now. It can run unmodified most utilities and 
desktop apps (though dbus/dconf/etc are trouble), server daemons, 
buildworld and whole shell/desktop sessions sandboxed.

https://github.com/Math2/freebsd-pledge
https://github.com/Math2/freebsd-pledge/blob/main/CURTAIN-README.md

It can be broken up in 4 parts: 1) A MAC module that implements most of 
the functionality.  2) The userland library, sandboxing utility, configs 
and tests.  3) Various kernel changes needed to support it (including 
new MAC handlers and extended syscall filtering).  4) Small 
changes/fixes to the base userland (things like adding reporting to ps 
and modifying some utilities to use $TMPDIR so that they can be properly 
sandboxed).  So 1) and 2) could be in a port.  And I tried to minimize 
3) and 4) as much as possible.

I noted some problems/limitations in the CURTAIN-ISSUES file.  At this 
point I'm mostly wondering about the general design being acceptable for 
merging eventually.  Because most of this could be part of a port, but 
not all of it.  And the way that it deals with filesystem access 
restrictions in particular is kludgy.  So any feedback/testing welcome.

It still lacks documentation (in part because I'm not sure of what could 
still change) so I'm going to give an overview of it here and show some 
examples and that's going to be the documentation for now.  And I'll 
describe the kernel changes that it needed.  So that's going to be a bit 
of a long email.

What it can do:
~~~~~~~~~~~~~~~

It can restrict syscalls and various abilities (by categories that were 
based on OpenBSD's pledge promises), ioctls, sysctls, socket 
options/address families, priv(9) privileges, and filesystem access by 
path.  It can be used at the same time as jails and Capsicum (their 
restrictions are also enforced on top of it).

It can be used in a nested manner.  A program that inherits sandbox 
restrictions can do its own internal sandboxing or sandbox programs that 
it run (which can then do the same, etc).  The permissions of new 
sandboxes are always a subset of the inherited sandbox.

Certain kernel operations are protected by "barriers" which only allow a 
sandboxed process to operate on kernel objects that were created by 
itself or a descendant sandbox.  There are barriers for 
inspecting/signaling/debugging processes, POSIX/SysV IPC objects, PTYs, 
etc.  Barriers have their own hierarchy which can diverge from the 
process hierarchy.

Restrictions can be specified in configuration files and can be 
associated with named "tags".  Tags are assumed to match application 
names, they're prefixed with "_" when they don't (just the convention 
I've been using so far).  Enabling a tag may cause other tags to be 
enabled depending on configurations.  Permissions associated with 
different tags are merged in a purely additive manner.  Configurations 
can be spread in multiple files and directories 
(/usr/local/etc/curtain.{conf,d} can be used for packages, 
~/.curtain.{conf,d} for user customizations).  It'll check the .d 
directories for files named after the enabled tags.

Usage examples:
~~~~~~~~~~~~~~~

curtain(1) is the wrapper utility to sandbox arbitrary programs. Default 
permissions are in /etc/defaults/curtain.conf and /etc/curtain.conf.

Here a bunch of examples.  A bit random, but they demonstrate a lot of 
the functionality.

$ curtain id

Not very exciting, but it works.  The default permissions don't give it 
access to the user DB so it only shows numeric IDs.  It can be given 
access with the "_pwddb" tag:

$ curtain -t _pwddb id

It's possible to nest sandboxes, but it needs the "curtain" tag because 
the curtain config files are not unveiled by default (they could be 
though, maybe they should be...).

Here, id cannot read the user DB because the outer sandbox doesn't allow it:

$ curtain -t curtain curtain -t _pwddb id

But this way it can:

$ curtain -t curtain -t _pwddb curtain -t _pwddb id

Starts a sandboxed shell session with access to ~/work in a clean 
environment:

$ mkdir -p ~/work && curtain -p ~/work:rwx -S

You'll probably miss your dotfiles though.  If you browse around you'll 
see what paths get unveiled by default.

If you try to list processes:

$ curtain ps -ax

You'll just see the ps process itself.  It can be allowed to see 
processes outside of it like that:

$ curtain -d ability-pass:ps ps -ax

But it will not be allowed to signal, reprioritize or debug them (there 
are other "abilities" for that).  The "-pass" means to allow the ability 
in a "passthrough" manner (beyond the sandbox's barrier).  Visibility 
could also be blocked at an outer sandbox's barrier, like so:

$ curtain -t curtain curtain -d ability-pass:ps ps -ax

Give read-only access to the current directory and list files:

$ curtain -p . ls

If you have $CLICOLOR set, it may look less colorful than usual. 
curtain(1) is a bit paranoid and will filter out most control characters 
written to the TTY by default (and set $TERM to "dumb").  They can be 
let through with -R:

$ curtain -R -p . ls

And -T can be used to stop it from doing PTY wrapping altogether and 
give the program direct access to the TTY (which is less secure, but 
there are ioctl restrictions).

Per-path permissions can be specified after a ":".  More specific paths 
override the permissions of less specific paths.

$ curtain -p .:rw -p ./secret: -p ./dev:rwx -p ./data:r ...

Then those paths would have those permissions:
     ./:rw
     ./123:rw
     ./secret:
     ./dev:rwx
     ./dev/123:rwx
     ./data:r
     ./data/123:r

As an example of how nested sandboxing is handled, if you were then to 
do this within this sandbox (don't forget to give it the "curtain" tag):

$ curtain -p .:r -p ./dev:rx -p ./data:rw ...

Then the permissions would end up being:
     ./:r
     ./123:r
     ./secret:
     ./dev:rx
     ./dev/123:rx
     ./data:r
     ./data/123:r

root processes can be sandboxed too.  Some privileges are allowed by 
default (which is similar to the set of privileges allowed by jails), 
but most are denied.  As are accesses to most /dev and /etc files.  For 
example, tcpdump will not be able to use bpf(4):

# curtain tcpdump

But there's a tag for that:

# curtain -t _bpf tcpdump

Something else that won't work:

$ curtain node -e 'console.log(2+2)'

It wants to do a PROT_EXEC mprotect(2) which is not allowed by default.  
By default, PROT_EXEC is only allowed when mmap(2)'ing files that are 
unveiled for execution.

$ curtain -d ability:prot_exec node -e 'console.log(2+2)'

Just what is allowed by default?  Well it's kind of arbitrary and messy 
and there are 10 levels of it.

curtain(1) uses a 10-levels "permissions tower" usable with options -0 
to -9 (which enable tags "_level0" to "_level9"). These are mostly just 
meant to be used as a quick way to try giving programs more or less 
access from the command-line (ideally a profile should be made to give 
programs just what they need). The default level currently is 5 (which 
is fairly permissive compared to most pledge(3)'d applications).  All 
levels are intended to be securely containable, but each level exposes a 
greater attack surface than the previous one.  Level 9 is the "please 
just work" level.  It allows to use all ioctls and to read all sysctls 
and almost all rare syscalls.  Filesystem access is still very 
restricted though so you've still got to figure out what unveils the 
program needs.

And there's another dimension to it which is the "unsafety level".  
Directives in the config files can be suffixed with one or more "!" to 
indicate that the permissions that it gives are potentially unsafe, 
depending on circumstances, or could be surprising or undesired.  The 
directive only applies when curtain(1) is invoked with as many or more 
"-!" options.  This was more useful at the beginning when many features 
weren't properly sandboxed yet.  Now it's not used as much.  But I still 
find it useful.  The way I'm using it is "!" is probably no big deal but 
you might want to check it if you're paranoid, "!!" has a real risk of 
allowing escapes in certain plausible scenarios, and "!!!" is very 
likely insecure unless special precautions are taken.

I'm still not sure what the defaults should be or how they could be 
better organized.  The "unsafety" is an odd thing to expose to the user 
and as much as possible I tried to make it unnecessary.

So anyway, a shorter way to make nodejs work is to use level 6 which 
allows PROT_EXEC on anonymous memory (and to execute binaries in $TMPDIR 
too):

$ curtain -6 node -e 'console.log(2+2)'

Now with X programs:

$ curtain -X xlogo
$ curtain -X xterm

-X gives "untrusted" X11 access, -Y "trusted" access (like with ssh) and 
-W is for Wayland.

There's an example config file with sample application profiles that can 
be enabled by uncommenting the include line in /etc/curtain.conf (and 
reading this file is a good way to see how the whole thing works).  
Profiles can be used with -a/-A.  Both simply enable the tag named after 
the program.  -A is a shortcut that also enables "unsafety level" 1 
(most profiles don't actually need it, but some do, so I just use it all 
the time).

$ curtain -XA xterm
$ curtain -XA firefox
$ curtain -XA chrome
$ curtain -XA falkon
$ curtain -XA qbittorrent
$ curtain -XA hexchat
$ curtain -XA gimp
$ curtain -XA audacious
# curtain -A tcpdump

Programs started this way still have the default level 5 permissions in 
addition to their profile permissions.

Option -k ("kill") enables "strict" mode where the default becomes level 
1 and programs are sent SIGKILL when trying to do something forbidden 
(otherwise they just get EPERM errors).  I made those two things go 
together because unexpected restrictions can make programs misbehave and 
this could lead to security issues.  This reduces the attack surface but 
it also means you've got to figure out the permissions just right or 
your programs are going to get killed a lot.  Also, trying to access 
non-unveiled files does not cause a SIGKILL to be sent yet, so missing 
unveils have the potential to cause insecure misbehavior too.

See the config files here:

https://github.com/Math2/freebsd-pledge/blob/main/lib/libcurtain/curtain.conf.defaults
https://github.com/Math2/freebsd-pledge/blob/main/lib/libcurtain/curtain.conf.sample

How well does it generally work?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Well, there are some problems.

First of all, "untrusted" X11 access doesn't work all that great. Some 
programs are just unstable with it.  Firefox used to crash a lot with 
X11 errors but for some reason it seems to have gotten a lot better 
recently.  But there might be thick borders around menus, client-side 
decorated windows won't be movable, the system tray won't work, 
selection/clipboard will only work one direction.  And it'll be slower.  
The alternative is to give them "trusted" X11 access but that's very 
insecure.  And even untrusted access isn't so secure either, untrusted 
programs are not isolated from one another IIUC.  And who knows what the 
window manager, panels and others could be doing with the window 
properties of untrusted clients...  And this exposes the huge complexity 
of the X11 server.

Wayland's security is supposed to be much better, but it depends on how 
the compositors handle security on the extra protocols that they support 
and IIUC there's not a consensus on how it should be handled yet and 
most compositors still lack security restrictions (but apparently some 
people just compile out their support for insecure protocols).

Programs that have built-in support for privilege-separation and 
self-sandboxing can solve this by not giving direct access to the 
display to the sandboxed parts.  And that's something that this 
implementation means to support (which can be done on top of sandboxing 
the application as a whole).  But it's not a general solution.

Also, dbus/dconf/pulseaudio/etc are not dealt with very well yet. 
They're just ignored really.  And (a bit surprisingly) many programs 
seem OK with that.  fontconfig will complain a lot but if the font 
caches are already up to date it doesn't look like it matters (startup 
will be much slower otherwise).  pulseaudio will just die when firefox 
tries to start it but then it'll fallback to using OSS directly (sndio 
works too).  Thumbnail caches won't be accessible.  The XDG shared 
recent documents list won't work. dconf will be completely 
non-functional and some programs won't be able to save their settings.  
Etc.  And even when it works, "desktop integration" in general is going 
to be very degraded.  A program trying to launch the desktop 
environment's handler program to open a file or URL probably won't work 
because it'll inherit a too restrictive sandbox.  I haven't really 
gotten into trying to deal with this better yet.  I see that there are 
dbus proxy services for sandboxing on Linux.  It would probably need 
something like that.

There are some scripts to sandbox programs with separate XDG directories 
or separate $HOME in /usr/share/examples/curtain/. But I wish doing this 
wouldn't be necessary...

For non-desktop programs, it generally just works (if you give them 
enough permissions).  The main thing causing trouble is usually /tmp.

About the userland parts:
~~~~~~~~~~~~~~~~~~~~~~~~~

libcurtain is a wrapper around the sandboxing syscall.  It allows to 
assign permissions to "slots" which then get merged.  Path permissions 
can override each others (most specific wins) within a slot, but across 
slots they are merged in a non-interfering way (a more specific 
permissions never cancels out less specific permissions from a different 
slot).  Permissions from different bracketed sections of config files 
are added to different slots, so they all get merged in this way.

Config files are also handled by libcurtain.  Applications can use 
libcurtain directly to sandbox themselves using tags, but the API for 
that is more complex than it should be and I'm probably going to make 
more changes to it.

I added a freebsd_simple_sandbox() function directly to libc that tries 
to load libcurtain and applies a tag.  The idea is to make it as easy as 
possible to add configurable, opportunistic sandboxing to applications 
without having to link them to libcurtain.  It can be called multiple 
times at different stages of initialization of an application, or for 
different sub-processes, etc.  The application just specifies a tag for 
each call and the details are in the config files.  Conceivably, there 
could be different backends implementing the sandboxing.

libcurtain also contains the pledge()/unveil() implementation.  On 
OpenBSD, pledge/unveil are available directly in libc (with the 
declarations in unistd.h), but the portable versions of some OpenBSD 
programs have problems if pledge/unveil are available on non-OpenBSD 
platforms because they just don't expect that.  After fixing them, maybe 
auto-loading wrappers could be added directly to libc too so that they 
just work without having to deal with libcurtain dependencies.

About the kernel-side parts:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Most of the implementation is in a separate mac_curtain module, but it 
also needed some changes spread out in the kernel to support it.  That's 
what would need to be merged.

The biggest change is adding "sysfils".  It initially just meant 
"syscall filters" but now it's more of a general category of things that 
the kernel can do.  Syscalls can be associated with zero or more 
required sysfils and some explicit sysfil checks were added in various 
places in the kernel as needed.  ucreds have a set of allowed sysfils.  
Sysfils are represented as simple bitmaps and checks are fast.  Capsicum 
was slightly modified to make use of a sysfil bit to simplify syscall 
entry checks.

Sysfils are meant to be part of the internal kernel API, they're not 
exposed to the userland.  The curtain module exposes intermediate 
"abilities" instead.

Some checks that checked for "capability mode" now check for a more 
general "restricted mode" instead.  A process is considered in 
restricted mode whenever its ucred is missing any sysfil bit.

MAC handlers were added to let curtain hook into places that didn't have 
MAC checks.  Some of those new handlers definitively seem out of place.  
The new vnode "walk" functions are more of a low-level mechanism than 
just a security policy.  And many of the new handlers want to restrict 
access to certain functionality as a whole (e.g. ioctls, sockopts, 
procctls, etc) rather than compare labels.  But it seemed like the best 
place to add them because MAC already did most of what was needed.  So 
I've been treating the MAC framework like it stands for "Modular Access 
Checks" or something.

The curtain permissions are stored in "curtain" objects.  Process ucreds 
have their labels point to a curtain.  Curtains have pointers to 
"barrier" objects, which contain the hierarchical linkage needed to 
restrict access to protected kernel objects. Those kernel objects have 
their labels point directly to barriers.  Barriers can outlive their 
curtains.  When a ucred loses its last reference from a process, it is 
"trimmed" and its label curtain pointer "decays" into a pointer to the 
curtain's barrier so that the curtain can be freed (because curtains can 
be a few KBs and they can hold vnode references).  A lot of objects hold 
references to ucreds, so they could build up a lot without this.

Processes can sandbox themselves with curtainctl(2).  They have to 
specify the full set of permissions they want to retain.  The requested 
permissions are then masked with the current curtain (if any).  This 
involves dealing with inheritance relationships between permissions (as 
the new curtain can have permissions more specific than the old and vice 
versa).

Kernel-side handling of filesystem path unveiling was the hardest part 
to deal with (given the "statelessness" of the vnode API) and it kind of 
is all a big kludge.  I tried to make it as nice as possible and wrapped 
the whole thing behind a MAC API (it used to be a lot worse than that).

Each directory "unveil" acts like a sort of chroot barrier but with 
specific permissions.  There's a per-thread "tracker" with a circular 
buffer that remembers the permissions for the previous N looked-up 
vnodes.  N only needs to be 2 as far as I can tell (most syscalls only 
need 1, but linkat() for example needs 2).  The tracker has weak vnode 
references and doesn't need to be cleaned up after syscalls.  namei() 
calls the new MAC handlers to manage the tracker during path lookup.  
fget*() also adds a tracker entry.  Then the access check MAC handlers 
can find permissions for the passed vnodes in the tracker.  This only 
works because almost all of the kernel code that work on vnodes first 
get a reference from namei()/fget*() and then don't call VOP_LOOKUP() 
directly themselves.  It's messy but one good thing with it is that it 
usually "fails-secure" if the tracker was mismanaged because it won't 
find the vnode in it and it defaults to deny.