if_wg simd chapoly needs some love

Jason A. Donenfeld Jason at zx2c4.com
Mon Mar 15 04:58:39 UTC 2021


Kyle, MattD, and I have just finished an excursion reworking if_wg
(hopefully in time for 13.0, or maybe 13.1). I was shocked at how rough
the code was that had been committed to the tree – things left
unimplemented, cryptographic mistakes, security flaws, ominous comments.
It was as though the thing was never finished and was committed to the
tree in half-baked form. Oh well. We spent some time getting it into
shape and now we’re in a (somewhat) better situation.

One casualty in the process, however, was SIMD crypto. I removed 40,000
lines of hacked up code from Linux that wasn’t really wired up correctly
and had a maze of insane Linux ifdefs, and I replaced those with 1,800
lines of boring C code in a file called “crypto.c”, whose safety and
correctness I’m a whole lot more sure of. But the downside is that
there’s no longer an AVX{,2,512} ChaCha20Poly1305 implementation hooked
up to WireGuard. It’s still pretty fast, but it’s definitely not _as

So this email is a call for help in wiring up some SIMD crypto properly.
There are already a few implementations in the tree – in OpenSSL, in
libsodium, and in BearSSL. The OpenSSL that’s inside /sys seems to have
Andy Polyakov’s super fast implementations [3], which I like. It seems
like just a little bit of plumbing is necessary to do this.

Now, there are two categories of fixups that can be done. The first
category is “two obvious places that will lead to massive measurable
performance increases.” The second category is “other subtle places that
are hard to get right, and won’t really make much of a difference.”

The two obvious places that will lead to massive measurable performance
increases are at [1] and [2]. These calls to
chacha20poly1305_{encrypt,decrypt} are where the actual IP packets get
encrypted and decrypted. If you change those calls instead to be to
extrafast_chacha20poly1305_{encrypt,decrypt}, and maybe twiddle some FPU
bits on the task where those run, then you’ll have scored a massive
performance gain.

The other subtle places that are hard to get right, and won’t really
make much of a difference, would still be nice to replace with code
properly in the crypto/ module, rather than in my crypto.c stopgap, but
this work is *much less* urgent, and harder to get right. If you’re also
interested in doing this, please at least shoot me an email or message
on IRC so I can point to various pitfalls and weird things that might
not be obvious when porting that code elsewhere.

However, if you want to work on [1] and [2] today, I don’t think that
will require much coordination or headscratching, and it’d make a big
difference. There are really only a few small requirements for
extrafast_chacha20poly1305_{encrypt,decrypt}: a) they shouldn’t need to
allocate memory; b) they should be synchronous; c) there should be no
overhead associated with changing the {en,de}cryption key used; d) they
should handle 64-bit nonces. These requirements match what you can build
out of the existing OpenSSL code I saw in the tree. It’s also possible
you’ll want to bring in new code from elsewhere. Or maybe you have other
ideas. But anyway, [1] and [2] should be easy functions to swap out.

If you’re interested and have questions, let me know. I don’t (yet!)
know a whole lot about the various facilities in the FreeBSD tree, but I
know crypto implementation stuff and WireGuard stuff pretty decently, so
I’m happy to jump in where you need.


[1] noise_remote_encrypt:
[2] noise_remote_decrypt:
[3] openssl avx implementations:

More information about the freebsd-hackers mailing list