Can In-Kernel TLS (kTLS) work with any OpenSSL Application?

Wed Jan 27 19:04:41 UTC 2021

On Sat, Jan 23, 2021 at 03:25:59PM +0000, Rick Macklem wrote:
> Ronald Klop wrote:
> >On Wed, 20 Jan 2021 21:21:15 +0100, Neel Chauhan <nc at freebsd.org> wrote:
> >
> >> Hi freebsd-current@,
> >>
> >> I know that In-Kernel TLS was merged into the FreeBSD HEAD tree a while
> >> back.
> >>
> >> With 13.0-RELEASE around the corner, I'm thinking about upgrading my
> >> home server, well if I can accelerate any SSL application.
> >>
> >> I'm asking because I have a home server on a symmetrical Gigabit
> >> connection (Google Fiber/Webpass), and that server runs a Tor relay. If
> >> you're interested in how Tor works, the EFF has a writeup:
> >> https://www.eff.org/pages/what-tor-relay
> >>
> >> But the main point for you all is: more-or-less Tor relays deal with
> >> 1000s TLS connections going into and out of the server.
> >>
> >> Would In-Kernel TLS help with an application like Tor (or even load
> >> balancers/TLS termination), or is it more for things like web servers
> >> sending static files via sendfile() (e.g. CDN used by Netflix).
> >>
> >> My server could also work with Intel's QuickAssist (since it has an
> >> Intel Xeon "Scalable" CPU). Would QuickAssist SSL be more helpful here?
> There is now qat(4), which KTLS should be able to use, but I do
> not think it has been tested for this. I also have no idea
> if it can be used effectively for userland encryption?

KTLS requires support for separate output buffers and AAD buffers, which
I hadn't implemented in the committed driver.  I have a working patch
which adds that, so when that's committed qat(4) could in principle be
used with KTLS.  So far I only tested with /dev/crypto and a couple of
debug sysctls used to toggle between the different cryptop buffer
layouts, not with KTLS proper.

qat(4) can be used by userspace via cryptodev(4).  This comes with a
fair bit of overhead since it involves a round-trip through the kernel
and some extra copying.  AFAIK we don't have any framework for exposing
crypto devices directly to userspace, akin to DPDK's polling mode
drivers or netmap.

I've seen a few questions about the comparative (dis)advantages of QAT
and AES-NI so I'll sidetrack a bit and try to characterize qat(4)'s
performance here based on some microbenchmarking I did this week.  This
was all done in the kernel and so might need some qualification if
you're interested in using qat(4) from userspace.  Numbers below are
gleaned from an Atom C3558 at 2.2GHz with an integrated QAT device.  I
mostly tested AES-CBC-256 and AES-GCM-256.

The high-level tradeoffs are:
- qat(4) introduces a lot of latency.  For a single synchronous
  operation it can take between 2x and 100x more time than aesni(4) to
  complete.  aesni takes 1000-2000 cycles to handle a request plus
  3-5 cycles per byte depending on the algorithm.  qat takes at least
  ~150,000 cycles between calling crypto_dispatch() and the cryptop
  completion callback, plus 5-8 cycles per byte.  qat dispatch itself is
  quite cheap, typically 1000-2000 cycles depending on the size of the
  buffer.  Handling a completion interrupt involves a context switch to
  the driver ithread but this is also a small cost relative to the
  entire operation.  So, for anything where latency is crucial QAT is
  probably not a great bet.
- qat can save a correspondingly large number of CPU cycles.  It takes
  qat roughly twice as long as aesni to complete encryption of a 32KB
  buffer using AES-CBC-256 (more with GCM), but with qat the CPU is idle
  much of the time.  Dispatching the request to firmware takes less than
  1% of the total time elapsed between request dispatch and completion,
  even with small buffers.  OTOH with really small buffers aesni can
  complete a request in the time that it takes qat just to dispatch the
  request to the device, so at best qat will give comparable throughput
  and CPU usage and worse latency.
- qat can handle multiple requests in parallel.  This can improve
  throughput dramatically if the producer can keep qat busy.
  Empirically, the maximum throughput improvement is a function of the
  request size.  For example, counting the number of cycles required to
  encrypt 100,000 buffers using AES-GCM-256:

  max # in flight       1        16       64        128

  aesni, 16B           206M     n/a      n/a        n/a
  aesni, 4KB          1.52B     n/a      n/a        n/a
  aesni, 32KB         10.8B     n/a      n/a        n/a
  qat,   16B          17.1B   1.11B     219M       184M 
  qat,   4KB          20.9B   1.68B     710M       694M
  qat,   32KB         38.2B   8.37B    4.25B      4.23B

  As a side note, OpenCrypto supports async dispatch for software crypto
  drivers, in which crypto_dispatch() hands work off to other threads.
  This is enabled by net.inet.ipsec.async_crypto, for example.  Of
  course, the maximum parallelism is limited by the number of CPUs in
  the system, but this can improve throughput significantly as well if
  you're willing to spend the corresponding CPU cycles.

To summarize, QAT can be beneficial when some or all of the following
apply:
1. You have large requests.  qat can give comparable throughput for
   small requests if the producer can exploit parallelism in qat, though
   OpenCrypto's backpressure mechanism is really primitive (arguably
   non-existent) and performance will tank if things get to a point
   where qat can't keep up.
2. You're able to dispatch requests in parallel.  But see point 1. 
3. CPU cycles are precious and the extra latency is tolerable.
3b. aesni doesn't implement some transform that you care about, but qat
    does.  Some (most?) Xeons don't implement the SHA extensions for
    instance.  I don't have a sense for how the plain cryptosoft driver
    performs relative to aesni though.