Can In-Kernel TLS (kTLS) work with any OpenSSL Application?
Mark Johnston
markj at freebsd.org
Wed Jan 27 19:04:41 UTC 2021
On Sat, Jan 23, 2021 at 03:25:59PM +0000, Rick Macklem wrote:
> Ronald Klop wrote:
> >On Wed, 20 Jan 2021 21:21:15 +0100, Neel Chauhan <nc at freebsd.org> wrote:
> >
> >> Hi freebsd-current@,
> >>
> >> I know that In-Kernel TLS was merged into the FreeBSD HEAD tree a while
> >> back.
> >>
> >> With 13.0-RELEASE around the corner, I'm thinking about upgrading my
> >> home server, well if I can accelerate any SSL application.
> >>
> >> I'm asking because I have a home server on a symmetrical Gigabit
> >> connection (Google Fiber/Webpass), and that server runs a Tor relay. If
> >> you're interested in how Tor works, the EFF has a writeup:
> >> https://www.eff.org/pages/what-tor-relay
> >>
> >> But the main point for you all is: more-or-less Tor relays deal with
> >> 1000s TLS connections going into and out of the server.
> >>
> >> Would In-Kernel TLS help with an application like Tor (or even load
> >> balancers/TLS termination), or is it more for things like web servers
> >> sending static files via sendfile() (e.g. CDN used by Netflix).
> >>
> >> My server could also work with Intel's QuickAssist (since it has an
> >> Intel Xeon "Scalable" CPU). Would QuickAssist SSL be more helpful here?
> There is now qat(4), which KTLS should be able to use, but I do
> not think it has been tested for this. I also have no idea
> if it can be used effectively for userland encryption?
KTLS requires support for separate output buffers and AAD buffers, which
I hadn't implemented in the committed driver. I have a working patch
which adds that, so when that's committed qat(4) could in principle be
used with KTLS. So far I only tested with /dev/crypto and a couple of
debug sysctls used to toggle between the different cryptop buffer
layouts, not with KTLS proper.
qat(4) can be used by userspace via cryptodev(4). This comes with a
fair bit of overhead since it involves a round-trip through the kernel
and some extra copying. AFAIK we don't have any framework for exposing
crypto devices directly to userspace, akin to DPDK's polling mode
drivers or netmap.
I've seen a few questions about the comparative (dis)advantages of QAT
and AES-NI so I'll sidetrack a bit and try to characterize qat(4)'s
performance here based on some microbenchmarking I did this week. This
was all done in the kernel and so might need some qualification if
you're interested in using qat(4) from userspace. Numbers below are
gleaned from an Atom C3558 at 2.2GHz with an integrated QAT device. I
mostly tested AES-CBC-256 and AES-GCM-256.
The high-level tradeoffs are:
- qat(4) introduces a lot of latency. For a single synchronous
operation it can take between 2x and 100x more time than aesni(4) to
complete. aesni takes 1000-2000 cycles to handle a request plus
3-5 cycles per byte depending on the algorithm. qat takes at least
~150,000 cycles between calling crypto_dispatch() and the cryptop
completion callback, plus 5-8 cycles per byte. qat dispatch itself is
quite cheap, typically 1000-2000 cycles depending on the size of the
buffer. Handling a completion interrupt involves a context switch to
the driver ithread but this is also a small cost relative to the
entire operation. So, for anything where latency is crucial QAT is
probably not a great bet.
- qat can save a correspondingly large number of CPU cycles. It takes
qat roughly twice as long as aesni to complete encryption of a 32KB
buffer using AES-CBC-256 (more with GCM), but with qat the CPU is idle
much of the time. Dispatching the request to firmware takes less than
1% of the total time elapsed between request dispatch and completion,
even with small buffers. OTOH with really small buffers aesni can
complete a request in the time that it takes qat just to dispatch the
request to the device, so at best qat will give comparable throughput
and CPU usage and worse latency.
- qat can handle multiple requests in parallel. This can improve
throughput dramatically if the producer can keep qat busy.
Empirically, the maximum throughput improvement is a function of the
request size. For example, counting the number of cycles required to
encrypt 100,000 buffers using AES-GCM-256:
max # in flight 1 16 64 128
aesni, 16B 206M n/a n/a n/a
aesni, 4KB 1.52B n/a n/a n/a
aesni, 32KB 10.8B n/a n/a n/a
qat, 16B 17.1B 1.11B 219M 184M
qat, 4KB 20.9B 1.68B 710M 694M
qat, 32KB 38.2B 8.37B 4.25B 4.23B
As a side note, OpenCrypto supports async dispatch for software crypto
drivers, in which crypto_dispatch() hands work off to other threads.
This is enabled by net.inet.ipsec.async_crypto, for example. Of
course, the maximum parallelism is limited by the number of CPUs in
the system, but this can improve throughput significantly as well if
you're willing to spend the corresponding CPU cycles.
To summarize, QAT can be beneficial when some or all of the following
apply:
1. You have large requests. qat can give comparable throughput for
small requests if the producer can exploit parallelism in qat, though
OpenCrypto's backpressure mechanism is really primitive (arguably
non-existent) and performance will tank if things get to a point
where qat can't keep up.
2. You're able to dispatch requests in parallel. But see point 1.
3. CPU cycles are precious and the extra latency is tolerable.
3b. aesni doesn't implement some transform that you care about, but qat
does. Some (most?) Xeons don't implement the SHA extensions for
instance. I don't have a sense for how the plain cryptosoft driver
performs relative to aesni though.
More information about the freebsd-current
mailing list