Summary of discussion of harvester/random locking and performance optimization

Sat Aug 14 08:28:37 PDT 2004

CC'd current@ because it might be of more general interest; Mark and I did
a phone call a few days ago to discuss how to improve the performance of
entropy harvesting, which occurs along several critical paths for network
performance.  This was a followup to previous e-mail and other online
discussion.  Some of the ideas here are specific to entropy gathering, but
others apply more generally. 

We discussed a few things, including:

- Currently, the harvesting code uses several spin mutexes, including per
  source fifo mutex and a free list mutex.  The result is four mutex
  operations per entropy event, two for the free list and two for
  the source fifo.  I suggested exploring whether coalescing some of these
  mutexes would help lower locking overhead without negatively impacting
  contention/parallelism.  In particular, possibly using per-fifo free
  lists, or just using a single spin mutex over all harvesting.

- Currently, we gather all entropy we can find in the harvesting code in
  the entropy souce thread, subject to fairly high fifo bounds, and then
  "let the yarrow thread sort it out".  Via e-mail and on the phone, we
  discussed ways to expose back pressure to the harvesting code to avoid
  paying the cost of harvesting if we have "adequate entropy" or if
  there's little demand for randomness.  We discussed lockless approaches
  for exposing this so that a lockless read could be used by interrupt
  handlers, etc, to avoid the cost of locking when entropy wasn't
  required.  I pointed at my modification to the harvest code to do a
  lockless check for a full fifo as an example of safe use of lockless
  reads for performance purposes.  I don't have a useful opinion on how we
  decide if we have "enough"  entropy, other than to note that on a busy
  system, we're probably awash in it and doing a lot of work for little
  incremental benefit, especially on high load network systems.

- We discussed moving from a floating buffer model to a ring buffer that
  requires less synchronization, or even permits very week consistency
  synchronization.   We didn't go into a lot of detail here, other than to
  observe that there are a variety of techniques that could be used to
  lower cost, lower synchronization, increase locality, etc. 

- Currently, the yarrow thread performs a lot of sequential lock
  operations as it iterates over the entropy fifos and releases buffers.
  While this happens only intermittently, it happens a fair amount over
  time, and the current code layout results in possibly excessive mutex
  use.  I suggested moving to a model where entropy records are moved "en
  masse" to a thread local entropy queue or fifo from the global fifo
  between two lock operations to avoid grabbing and releasing the locks
  many times during processing.  Likewise for the free list.  Note that
  every spin mutex acquire disables interrupts...  Our current queue(9) 
  macros don't all lend themselves to an O(1) move of the contents of some
  list types from one head to another, but even O(n) would probably be an
  improvement.  I pointed at som similar use of queues in the network code
  to avoid thrashing driver locks when processing multiple packets,
  amortizing the locking cost over multiple packets.

- We discussed more generally when it is safe to not use locking, such as
  when we use pointer/integer size reads or writes and some staleness or
  inconsistency is acceptable.  We also discussed that this might mean
  using a lockless read to decide if something might happen (subject to a
  race), and then re-checking once locks are acquired in order to perform
  a safe read-modify-write.  I.e., a lockless check on fifo size to make
  sure there might be room, then acquire the mutex and re-read the value
  to make sure there really is room before adding the record.  This is an
  optimistic concurrencyy/synchronization approacah since it optimizes
  based on the assumption we won't need to do the work, or that we don't
  need to do the work frequently enough to amortize the additional check
  cost given that it's lower than an occasional lock. 

- And to summarize a point previously discussed in e-mail: right now mbuf
  havesting in the ethernet input path is incorrectly harvesting from a
  free'd mbuf pointer.  This means we're harvesting both kernel pointers
  instead of packet data, that we may harvest from an unmapped page due to
  the object being free'd, etc.  We discussed in the phone call whether
  there was benefit to reaping there and you observed that the timing
  information was useful, and that the ethernet header information might
  still be useful if correctly reaped.  I was unclear that the ethernet
  header information would be useful, although timing might well be
  (subject to cost).  It's worth observing, though, that if you've already
  gathered entropy information from the interrupt, there is probably
  little incremental benefit to gathering in the ethernet input routines,
  especially since the time to perform incremental processing on each
  additional packet handled by an ethernet interrupt is likely fairly
  constant.

- I provided some general guidance on locking: it's expensive.  Right now
  we're far more concened about the cost of lock operations than the cost
  of contention, and so general guidance has been "err on the side of
  fewer locks now, it's easier to diagnose than too many".  We have some
  reasonable tools for diagnosing lock use -- I pointed at MUTEX_PROFILING
  (although observed it's expensive) and KTR.  I have information on using
  KTR at http://www.watson.org/~robert/freebsd/netperf/ktr/.  The
  MUTEX_PROFILING man page is pretty decent and the results instructive.
  I noted that the reason I had chosen to focus some time on the entropy
  gathering code was that it showed up fairly clearly in interrupt
  handling traces from KTR, and that KTR was very helpful (if you're
  willing to invest the time to use it well) in getting a picture of how
  the system is behaving, and that my recent modifications to KTR provide
  a lot more useful context information than we used to have. 

As a follow-up, I re-read the harvesting code again after our
conversation, in particular to learn about the cost of harvesting time
information.  I observed that get_cyclecounter() is very cheap on modern
hardware, but that on older hardware without a TSC, it's extraordinarily
expensive.  In particular, the #ifdef code on i386 suggests that i486
systems may not have a TSC, and insead read the system clock (ouch!).  We
may want to investigate what approaches we can use to mitigate this,
especially if systems like soekris boxes don't have TSC.  Right now, the
API for retrieving cycle counts does not allow the caller to easily
distinguish those two cases, and it may be we need to teach it to do that
so that we can allow the caller to decide it doesn't want to pay the
higher cost.

Finally, a question I raised by e-mail previously and think it's worth
thinking about is how to decide what impact we're having on entropy
quality.  In particular, are there scores or other easily understood
numerical or statistical results that can be exposed to a user or
developer to allow them to measure the impact on entropy quality and
availability resulting from specific configuration, optimization, etc.
This would allow us to more easily say "X is or is not worth the
performance cost", or at least, allow our users to make that choice and
provide more information about when they should make that choice.  This
also paves the way for better auto-tuning.

Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
robert at fledge.watson.org      Principal Research Scientist, McAfee Research