Ranting about OCF / crypto(9)

Sun Jan 14 23:59:45 UTC 2018

John Baldwin wrote this message on Wed, Jan 10, 2018 at 16:18 -0800:
> While working on hooking the ccr(4) driver into our in-kernel crypto
> framework (along with some out-of-tree patches to extend OpenSSL's
> /dev/crypto engine to support AES-CTR/XTS/GCM and some further changes to
> do zero-copy), I've run into several bumps / oddities in OCF.  I'm probably
> going to miss several of them, but here's at least a start of a list of
> things.  In some cases I have some suggestions on improvements.
> 
> I will try to start with more broad / higher-level items first before
> diving into minutiae:
> 
> - OCF is over flexible and overly broad.  Rather than supporting
>   arbitrary stacking of transforms (and arbitrary depths), I think we
>   should probably aim to support more specific workloads and support

Many drivers don't fully support artitrary stacking...  In fact, they
will reorder to "make sense"...

>   them well.  To my mind the classes of things we should support are
>   probably:
> 
>   - Simple block cipher requests.
>   - Simple "hash a buffer" requests. (Both HMAC and non-HMAC)
>   - IPSec-style requests (combined auth and encryption using
>     "encrypt-then-mac" with an optional AAD region before the
>     ciphertext).  Note that geli requests fall into this type.
>   - TLS-style requests (using TLS's different methods of
>     combining auth and encryption methods when those are
>     separate)
>   - Simple compression / decompression requests.  While this isn't
>     "crypto", per se, I do think it is probably still simpler to
>     manage this via OCF than a completely separate interface.

We need to decide what we are using OCF for.  Currently, due to how
slow most hardware acceleration is, it's IPsec and GELI in the kernel,
and then for embedded systems, OpenSSL for TLS acceleration...

IMO, making it 100% generic is a terrible idea, and we should only
support the above use cases...  W/ the fact most modern processors
are faster than most hardware acceleration, and I don't even know
how many embedded systems are using OCF from userland as you have to
configure the system to use crypto...

>   In terms of algorithms, I suspect there are some older algorithms
>   we could drop.  Modern hardware doesn't offload DES for example.
>   Both ccr(4) and aesni(4) only support AES for encryption.  We
>   do need to keep algorithms required for IPSec in the kernel, but
>   we could probably drop some others?

I attempted to do this a few years back, and got significant push back...

Please see the archives...

> - To better support OpenSSL's engine, the /dev/crypto hash interface
>   should not require monotonic buffers, but support requests for
>   large buffers that span multiple requests (so you can do something
>   akin to the 'Init' / 'Update' (N times) / 'Final' model existing
>   software hashing APIs use).  In particular, the bigger win for
>   hashing in hardware is when you can offload the hashing of a large
>   thing rather than small requests.
> 
> - To better support OpenSSL's engine, the /dev/crypto hash interface
>   should support "plain" hash algorithms such as SHA* without an
>   HMAC.  By default OpenSSL's engine interface does the HMAC-specific
>   bits (generating pads, etc.) in software and only defers to the
>   engine for the raw hash (e.g. if you use the HMAC() function from
>   libcrypto it will only ask the engine interface for a raw hash,
>   not for an HMAC hash).

Already does for MD5 and SHA1.  We have not added support for SHA-2 or
SHA-3...

#define CRYPTO_MD5              13
#define CRYPTO_SHA1             14

Yes, I know crypto(7) is lacking documentation for additional modes,
but we didn't have any before I was working on it, so added what I
could...

> - To better support OpenSSL's engine, the /dev/crypto cipher
>   interface should also support non-monolithic buffers.  The existing
>   engine does this now by copying the last block of the output data
>   out as a saved IV to use for a subsequent request, but it might be
>   nicer to be more formal here and return the IV to userland for
>   non-"final" cipher requests.
> 
> - The interface between the crypto layer and backend drivers should
>   _not_ use integer session IDs.  This is rediculously dumb and
>   inefficient.  All the drivers have silly algorithms to try to manage
>   growable arrays that can be indexed by the returned session ID.
>   Instead, drivers should be able to return a 'void *' cookie when
>   creating a session and get that cookie pointer as an argument to
>   the 'process' and 'freesession' callbacks.  Imagine if vnodes used
>   an i-node number rather than 'v_data' and you'd have the model OCF
>   uses.  I don't mind if we have a kind of generic 'session' structure
>   that we export to drivers and pass in the callbacks and the drivers
>   get to use a 'foo_data' member of.
> 
> - The interface to describe crypto requests needs to move away from
>   arbitrary linked lists of descriptors.  We should just have a
>   single "session" structure that assumes you have one cipher and
>   one auth with a "mode" member to indicate the particular direction
>   / combination.  Likewise, the description of a request needs to
>   have a similar assumption.  The structures used by the /dev/crypto
>   ioctl's are a bit closer to what I think we should use compared to
>   the linked-list thing we have now.  Related is that we should be
>   able to get rid of having the three separate "algorithms" for GCM
>   hashes.  For AES-GCM one would just say they are using AES-GCM
>   and both the hash/tag and ciphertext would be valid inputs / outputs
>   with a single key.
> 
> - To support non-monolithic buffers from the OpenSSL engine, crypto
>   requests to drivers also have to support non-monolithic buffers.
>   This means having a notion of a buffer that may be at the start,
>   middle, or end of a larger transformation (e.g. for hash only the
>   start gets the IPAD, only the end gets the OPAD and returns a
>   valid hash, etc., whereas for ciphers any non-end requests would
>   return the IV to use for the next request).
> 
>   For drivers that have buffer size limits, it would be nice to expose
>   those limits in the driver capabilities and depend on the upper layer
>   to "split" requests such as happens now for disk drivers.
> 
> - For hashing algorithms we should support a "verify" mode in addition
>   to the current "compute" mode.  The verify mode would accept a block
>   of data to hash along with an expected mac and return a success
>   / failure rather than an computed hash value.  AES-GCM already works
>   this way for decryption, but this would extend that mode for other
>   hash algorithms (e.g. AES-CBC+SHA2-256-HMAC).  Existing crypto
>   co-processors (e.g. ccr(4)) already support these types of requests.

Please make sure that the compare is constant time for any verify modes.

>   Related is that we need to fix IPSec to treat EBADMSG errors from
>   descryption as auth failure rather than encryption failure (right
>   now AES-GCM auth failures are reported incorrectly in netstat -s
>   due to this).
> 
> - Sessions for a combined cipher + hash should also be tied to a
>   specific way of combining the algorithms.  Right now you can
>   create a session for AES-CBC with a SHA hash and the driver has no
>   way to know if you are going to do encrypt-then-mac or one of the
>   other variants.  We should include this in the session (so a given
>   session can only be used for one type which is normally true anyway),
>   and drivers can then only claim to support combinations they
>   support.
> 
> - The CRD_F_IV_PRESENT flag should be removed and replaced with
>   a CRD_F_IV_INJECT flag which means "inject the IV".  Right now
>   the _lack_ of CRD_F_IV_PRESENT for encryption (but not decryption!)
>   requests means "inject the IV".  It would be clearer to just have
>   a flag that is only set when you want the driver to take the
>   action.
> 
> - Speaking of IV handling, drivers have to do some extra handling for
>   IVs including possibly generating them.  I think the idea is that
>   some co-processors might support generating IVs, but most of the
>   drivers I've looked at just end up duplicating the same block of
>   code to call arc4rand() for encryption requests without
>   CRD_F_IV_EXPLICIT.  I don't believe Linux tries to support this and
>   instead always supplies an IV to the driver.  I'd rather we do that
>   and only depend on a flag to indicate where the IV is (crd_iv vs
>   in the buffer).
> 
> - The API for copying data to/from crypto buffers is a bit obtuse and
>   limiting.  Rather than accepting the crypto operation ('crp') as
>   a parameter to describe the crypto buffer, the crypto_copyback()
>   and crypto_copydata() functions accept various members of that
>   function explicitly (e.g. crp_flags and crp_buf).  However, in my
>   experiments with zero-copy AES-GCM via /dev/crypto and OpenSSL it
>   was convenient to store the AAD in a KVA buffer in the 'crp' and
>   the payload to transform in an array of VM pages.  However, for
>   this model 'crp_buf' is useless.  I ended up adding a wrapper API
>   'crypto_copyto' and 'crypto_copyfrom' which accept a 'crp' directly.
>   Linux's API actually passes something akin to sglist as the
>   description of the buffers in a crypto request.
> 
> - We need to not treat accelerated software (e.g. AES-NI) as a
>   hardware interface.  Right now OCF's model of priorities when
>   trying to choose a backend driver for a session only has two
>   "levels" software vs hardware and aesni(4) (and the ARMv8 variant)
>   are lumped into the hardware bucket so that they have precedence
>   over the "dumb" software implementation.  However, the accelerated
>   software algorithms do need some of the same support features of
>   the "dumb" software implementation (such as being scheduled on a
>   thread pool to use CPU cycles) that are not needed by other "hardware"
>   engines.  OCF needs to understand this distinction.
> 
> - Somewhat related, we should try to use accelerated software when
>   possible (e.g. AES-CBC with SHA) doesn't use AES-NI unless the
>   CPU supports accelerated SHA.  Ideally for this case we'd still
>   use AES-NI for the AES portion along with the software SHA
>   implementation (and we'd do it one pass over the data rather than
>   two when possible).

Intel has lots of assembly for combined modes, including a pipelined
mode for AES-CBC+SHA2 that allows 4 streams to be processed in
effectively the same time as one stream...  Being able to make use of
these is cool, but IMO, w/ AES-GCM, or AES-GCM-SIV is a better solution
than trying to shoe horn in old algorithms like this...  If someone
really needs it, they can pay for it, but IMO, lets get the most bang
for the buck...

> - Sometimes a crypto driver might need to defer certain requests to
>   software (e.g. ccr(4) has to do this for some GCM requests).  In
>   addition, there are some other cases when we might want requests
>   from a single session to be sent to different backends (e.g. you
>   may want to use accelerated software for requests below a certain
>   size, and a crypto engine for larger requests.  You might also want
>   to take NUMA into account when choosing which backend crypto engine
>   to dispatch a request to.)  To that end, I think we want to have the
>   ability for a single OCF session to support multiple backend
>   sessions.
> 
>   One use case is that if I as a driver can't handle a request I'd like
>   to be able to fail it with a special error code and have the crypto
>   later fall back to software for me (and to use accelerated software if
>   possible).  Right now ccr(4) duplicates the "dumb" software for GCM
>   requests it can't handle explicitly.
> 
>   Another use case might be failover if a hardware engine experiences
>   a hardware failure.  In theory it should be possible to fail over
>   to a different driver at that point including resubmitting pending
>   requests that weren't completed, and it should be possible (I think)
>   to manage this in the crypto framework rather than in consumers like
>   IPSec and GELI.
> 
>   Load distribution among backends might be another case to consider
>   (e.g. GELI or ZFS encryption once that lands) if you have long-
>   running sessions that spawn lots of self-contained requests.
> 
>   Note that if we want to spawn additional backend sessions on the fly
>   (e.g. only create a software fallback session on demand if a driver
>   fails a request with the "use software" magic error code), we will
>   have to keep per-session state such as keys around.  We probably
>   already do that now, but this would definitely require doing that.
> 
> One concern with some of these changes is that there are several drivers
> in the tree for older hardware that I'm not sure is really used anymore.
> That is an impediment to making changes to the crypto <-> driver interface
> if we can't find folks willing to at least test changes to those drivers
> if not maintain them.

I have recently obtained a good amount of this hardware from various
donations...  hifn, via padlock, and possibly ubsec, I'd have to check..

IMO, I'd like to see us deprecate most of these old drivers as they are
mostly too slow, or even if you can find a system to put them in, it's
overall faster/energy effecient to go w/ newer hardware that is faster and
run pure software...  Again, lease read the archives for more of this discussion...

> This is all I could think of today.  What do other folks think?

I'd like to see a full redesign of the system, but I also don't know
how many other third party utilities depend upon /dev/crypto that are not
in tree...

I'm willing to meet w/ people to discuss/design this...

-- 
  John-Mark Gurney				Voice: +1 415 225 5579

     "All that I will do, has been done, All that I have, has not."