Re: capsicum(4): .. and SIGTRAP causing syscall really is in siginfo_t.si_errno?

From: David Chisnall <theraven_at_freebsd.org>
Date: Thu, 13 Apr 2023 07:47:14 UTC
Hi,

I added the siginfo member that passes the system call number (si_syscall).  The problem that it solves is the syscall system call. For normal system calls, you can extract the system call number from the register frame, since it will be in rax. Unfortunately, for the syscall system call, this value is clobbered and you have no way of usefully recovering it.

You might want to take a look at the Verona Sandbox code for inspiration (it works correctly without si_syscall for all system calls except syscall):

https://github.com/microsoft/verona-sandbox

This was my project that required this functionality, since it needed to intercept system calls and convert them to RPCs. It provides a simple mechanism for loading a .so in  an unprivileged child process and handling all system calls that touch a global namespace (open, bind, getaddrinfo) via RPC into the parent, with some easy-to-use abstractions for filesystem and network access. It works on Linux with seccomp-bpf and on FreeBSD with Capsicum. The FreeBSD version was significantly easier to write for a variety of reasons (Linux doesn’t support strongly aligned allocation in mmap, Linux can’t kill ld process when the parent process exits, only the parent thread, seccomp-bpf policies are amazingly fragile and require an entire library dependency to get right).

I have a patch under review that adds a SIGCAP as an alternative to SIGTRAP, which avoids painful interaction with the debugger. I’d love to get that merged before 14 but haven’t had time to address the last round of review comments. I’ve been running with it locally for a year or so.

David


> On 12 Apr 2023, at 21:35, Steffen Nurpmeso <steffen@sdaoden.eu> wrote:
> 
> Hello!
> 
> Ah, oh!!
> 
> Ed Maste wrote in
> <CAPyFy2Do80xZmNFdtG=xbRuscKaQQM7rQ5ir5TVZENX3UfyKtg@mail.gmail.com>:
> |On Wed, 12 Apr 2023 at 10:49, Steffen Nurpmeso <steffen@sdaoden.eu> wrote:
> |> I am trying to capsicumize a simple daemon (for learning purposes
> |> as that runs only in the second line behind postfix), and i have
> ...
> |Excellent, always happy to see folks exploring Capsicum.
> |
> |Keep in mind that Capsicum and pledge/unvil are not equivalent, so
> |comparing the ease of applying one or the other isn't particularly
> |meaningful. Achieving similar security properties with pledge/unveil
> |as with Capsicum requires similar effort in decomposing and
> |refactoring existing applications.
> 
> Luckily not this simple thing.  (With unveil together pledge seems
> pretty good, despite the many system calls i get, and of course
> the path fixation that does not allow users to add new paths when
> they reload configurations .. the way the program is designed;
> i like that new syslog system call which avoids all the things GNU
> C lib for example does and potentially needs, later maybe more.
> I think capsicum is very, very smart and capable, like CloudABI
> was.  Yet very hard to work with as it needs so many new *at(),
> needs to have hooks to modify descriptors after accept(), and
> openat(), etc.  And needs user-path <> realpath(3) mappings .. the
> way i do it.)
> 
> As i am very new with this -- am i correct assuming that once
> a capability was set on a directory or listening socket, opened
> / accepted FDs inherit the capability of "their parent"?
> 
> |> Anyhow.  Regardless of 13.1-i386 or 12.2-amd64 (despite
> |> no_new_privs) i only see
> |>
> |>   capsicum(4) violation (syscall 93, 4, 5, 0); please report this bug
> |
> |I'm not sure what you mean in the subject with respect to the syscall
> |in siginfo_t.si_errno. It looks like this is ENOTCAPABLE, which means
> 
> This is a misunderstanding!!  I *thought* PROC_TRAPCAP_CTL_ENABLE
> saying "the si_errno member of the siginfo signal handler
> parameter is set to the syscall error value" meant the actual
> "syscall number"!  And since git head now has that
> _capsicum._syscall member i thought *that* would now be an
> explicit thing "to detangle that".
> It really is an error number!
> I did not even think about that!
> So .. the actual syscall number is not available in that siginfo_t
> before FreeBSD 14?  I guess you guys simply write one of those
> dtrace snippets to get over that.
> (You know i did not even think, because the Linux seccomp(2) thing
> i did like that, though there it is SIGSYS and the syscall is in
> si_syscall.  The capsicum(4) and rights(4) etc manuals are
> complete, but for someone without any real foreknowledge they miss
> some small hints, here and there.  Not that Linux does that
> better.  Or OpenBSD, where you need at least one unveil with "some
> meat" in order to apply it, even if you simply want no paths at
> all.  .. I think.)
> 
> |an attempt to perform an operation on an fd that you are not allowed
> |to do - for example, calling write() on an fd which has had
> |cap_rights_limit() applied without CAP_WRITE. errno 94 is ECAPMODE.
> 
> Ah yes!  Not a thought on error values.
> 
> |This could be for example trying to use open() in capability mode,
> |which is just not permitted (openat() is).
> 
> Yes.  I have had real problems with that, or rather that FDCWD is
> not possible.  (And realpath did cause violations, in at least
> 12.2, .. though yesterday evening the program was in terrible
> state on FreeBSD.)
> 
> |>     This takes the usual shortcut of only sandboxing the last input file.
> |>     It's a first cut and this program will be easy to adapt to sandbox \
> |>     all
> |>     files in the future
> |>
> |> from a December 2016 commit message, and i like the word "easy".
> |
> |cap_fileargs() didn't exist in December 2016 and there was not yet a
> |straightforward, performant and desirable way to apply Capsicum to
> |existing applications that operate on a list of files provided on the
> |commandline.
> |
> |For a more recent change that makes use of cap_fileargs a good example
> |commit is:
> |
> |commit 802c2095b5a6dcf0f63c473cbba1e40445e9052a
> |Author: Mark Johnston <markj@FreeBSD.org>
> |Date:   Thu Aug 1 18:57:08 2019 +0000
> |
> |    Capsicumize readelf(1).
> |
> |    Reviewed by:    oshogbo
> |    Sponsored by:   The FreeBSD Foundation
> |    Differential Revision:  https://reviews.freebsd.org/D21108
> 
> I had the impression that casper uses a supervising process.  You
> know, then i thought i better do it myself: this allows the Linux
> seccomp(2) program for the clients and the server to be
> streamlined; not only for this small one, where that bystanding
> process only logs; ie, i simply sliced that into the server, and
> the server then forks again so that logger actually can
> synchronize on the server via SIGCLD (etc etc etc), and thus can
> inherit file locks, naturally, etc etc.
> 
> --End of <CAPyFy2Do80xZmNFdtG=xbRuscKaQQM7rQ5ir5TVZENX3UfyKtg@mail.gmail\
> .com>
> 
> Thank you.
> 
> --steffen
> |
> |Der Kragenbaer,                The moon bear,
> |der holt sich munter           he cheerfully and one by one
> |einen nach dem anderen runter  wa.ks himself off
> |(By Robert Gernhardt)
>