CFT: snmalloc as libc malloc

From: David Chisnall <theraven_at_FreeBSD.org>
Date: Thu, 09 Feb 2023 12:08:49 UTC
Hi,

For the few yearsI've been running locally with snmalloc as the malloc 
in libc.  Eventually I'd like to propose this for upstreaming but it 
needs some wider testing first.

For those unfamiliar with snmalloc 
(https://github.com/microsoft/snmalloc), it is an allocator (or, rather, 
a toolkit for building allocators) from my team at Microsoft Research 
designed for both performance and security.  A few highlights:

  - Snmalloc uses a message-passing design, which makes allocating on 
one thread and freeing on another cheap.
  - Very fast allocation performance
  - Randomisation of relative locations of allocations
  - Most metadata is stored out-of-band
  - In-band metadata uses some lighweight encryption to protect against 
corruption.
  - Support for CHERI.

In the (limited!) testing that I've done, it outperforms jemalloc and 
results in a smaller libc binary.

I've also previously managed to use it in the kernel, though that code 
hasn't been tested in a while (last used with FreeBSD 11):

https://github.com/microsoft/snmalloc/blob/main/src/snmalloc/pal/pal_freebsd_kernel.h

It is also used in the Verona process sandboxing work, which makes it 
easy to isolate a library in a capsicum Sandbox:

https://github.com/microsoft/verona/tree/master/experiments/process_sandbox

We test on FreeBSD in CI upstream and the code is actively maintained.
We have implemented compatibility wrappers for all of the jemalloc 
non-standard APIs that FreeBSD's libc exposes.

In particular, snmalloc is designed to make it very cheap to find the 
start and end of an allocation, given a heap pointer.  This means that 
we can insert bounds checks in critical libc functions to prevent heap 
overflow.  This is done in the branch for memcpy, which some 
investigation of a corpus of security vulnerabilities showed was the 
root cause of about 10% of arbitrary-code-execution vulnerabilities.

The bounds checks are controlled via an environment variable 
LIBC_BOUNDS_CHECKS.  Setting this to 0 disables checks, to 1 checks on 
destination arguments, and to 2 checks sources and destinations.  An 
ifunc resolver selects the correct memcpy implementation at load time.

I did have a version that checked a bunch of other libc functions (e.g. 
sprintf, puts) but it was quite hacky (and the way the ifunc resolves 
was implemented broke tcl).

The current branch puts two things behind the MALLOC_PRODUCTION toggle:

  - The additional security checks that detect corruption of malloc state.
  - Pretty-printing errors.

We are currently separating the former into separate knobs upstream, 
some subset should probably be turned on by default in production.  The 
latter has less of a performance impact than it had and will probably be 
on for all configurations at some point once we've refactored slightly 
to ensure the compiler can tail call the failure function (which moves 
it entirely off the fast path).  With this enabled, you get errors that 
look like this:

Fatal Error!
memcpy with source out of bounds of heap allocation:
         range [0x14823c02440, 0x14823c0246a)
         allocation [0x14823c02440, 0x14823c02450)
range goes beyond allocation by 0x1a bytes

Abort trap (core dumped)

Without it, you just get an illegal instruction trap.

There are a few limitations in the current branch:

  - The memcpy integration is broken on non-amd64 platforms (patches 
welcome from people who can test these!).
  - Only memcpy (not, for example, memmove) has bounds checks.
  - The memcpy in rtld is naive, which may impact performance.
  - MALLOC_PRODUCTION conflates too many things

The branch is here:

https://github.com/davidchisnall/freebsd-src/tree/snmalloc2

It adds snmalloc as a submodule in contrib.  FreeBSD is allergic to 
submodules, so upstreaming will need to replace this with something more 
complicated.  You should be able to cherry-pick the top commit on any 
vaguely-recent -CURRENT.

You should also be able to build the libc from this branch against the 
version that you're running and try it with LD_LIBRARY_PATH.

I'd love to hear feedback on:

  - Performance, especially workloads where snmalloc does badly.
  - RSS usage (again, especially workloads where snmalloc does badly).
  - Anything that breaks.

David