Summary of discussion of harvester/random locking and performance
optimization
Robert Watson
rwatson at FreeBSD.org
Sat Aug 14 08:28:37 PDT 2004
CC'd current@ because it might be of more general interest; Mark and I did
a phone call a few days ago to discuss how to improve the performance of
entropy harvesting, which occurs along several critical paths for network
performance. This was a followup to previous e-mail and other online
discussion. Some of the ideas here are specific to entropy gathering, but
others apply more generally.
We discussed a few things, including:
- Currently, the harvesting code uses several spin mutexes, including per
source fifo mutex and a free list mutex. The result is four mutex
operations per entropy event, two for the free list and two for
the source fifo. I suggested exploring whether coalescing some of these
mutexes would help lower locking overhead without negatively impacting
contention/parallelism. In particular, possibly using per-fifo free
lists, or just using a single spin mutex over all harvesting.
- Currently, we gather all entropy we can find in the harvesting code in
the entropy souce thread, subject to fairly high fifo bounds, and then
"let the yarrow thread sort it out". Via e-mail and on the phone, we
discussed ways to expose back pressure to the harvesting code to avoid
paying the cost of harvesting if we have "adequate entropy" or if
there's little demand for randomness. We discussed lockless approaches
for exposing this so that a lockless read could be used by interrupt
handlers, etc, to avoid the cost of locking when entropy wasn't
required. I pointed at my modification to the harvest code to do a
lockless check for a full fifo as an example of safe use of lockless
reads for performance purposes. I don't have a useful opinion on how we
decide if we have "enough" entropy, other than to note that on a busy
system, we're probably awash in it and doing a lot of work for little
incremental benefit, especially on high load network systems.
- We discussed moving from a floating buffer model to a ring buffer that
requires less synchronization, or even permits very week consistency
synchronization. We didn't go into a lot of detail here, other than to
observe that there are a variety of techniques that could be used to
lower cost, lower synchronization, increase locality, etc.
- Currently, the yarrow thread performs a lot of sequential lock
operations as it iterates over the entropy fifos and releases buffers.
While this happens only intermittently, it happens a fair amount over
time, and the current code layout results in possibly excessive mutex
use. I suggested moving to a model where entropy records are moved "en
masse" to a thread local entropy queue or fifo from the global fifo
between two lock operations to avoid grabbing and releasing the locks
many times during processing. Likewise for the free list. Note that
every spin mutex acquire disables interrupts... Our current queue(9)
macros don't all lend themselves to an O(1) move of the contents of some
list types from one head to another, but even O(n) would probably be an
improvement. I pointed at som similar use of queues in the network code
to avoid thrashing driver locks when processing multiple packets,
amortizing the locking cost over multiple packets.
- We discussed more generally when it is safe to not use locking, such as
when we use pointer/integer size reads or writes and some staleness or
inconsistency is acceptable. We also discussed that this might mean
using a lockless read to decide if something might happen (subject to a
race), and then re-checking once locks are acquired in order to perform
a safe read-modify-write. I.e., a lockless check on fifo size to make
sure there might be room, then acquire the mutex and re-read the value
to make sure there really is room before adding the record. This is an
optimistic concurrencyy/synchronization approacah since it optimizes
based on the assumption we won't need to do the work, or that we don't
need to do the work frequently enough to amortize the additional check
cost given that it's lower than an occasional lock.
- And to summarize a point previously discussed in e-mail: right now mbuf
havesting in the ethernet input path is incorrectly harvesting from a
free'd mbuf pointer. This means we're harvesting both kernel pointers
instead of packet data, that we may harvest from an unmapped page due to
the object being free'd, etc. We discussed in the phone call whether
there was benefit to reaping there and you observed that the timing
information was useful, and that the ethernet header information might
still be useful if correctly reaped. I was unclear that the ethernet
header information would be useful, although timing might well be
(subject to cost). It's worth observing, though, that if you've already
gathered entropy information from the interrupt, there is probably
little incremental benefit to gathering in the ethernet input routines,
especially since the time to perform incremental processing on each
additional packet handled by an ethernet interrupt is likely fairly
constant.
- I provided some general guidance on locking: it's expensive. Right now
we're far more concened about the cost of lock operations than the cost
of contention, and so general guidance has been "err on the side of
fewer locks now, it's easier to diagnose than too many". We have some
reasonable tools for diagnosing lock use -- I pointed at MUTEX_PROFILING
(although observed it's expensive) and KTR. I have information on using
KTR at http://www.watson.org/~robert/freebsd/netperf/ktr/. The
MUTEX_PROFILING man page is pretty decent and the results instructive.
I noted that the reason I had chosen to focus some time on the entropy
gathering code was that it showed up fairly clearly in interrupt
handling traces from KTR, and that KTR was very helpful (if you're
willing to invest the time to use it well) in getting a picture of how
the system is behaving, and that my recent modifications to KTR provide
a lot more useful context information than we used to have.
As a follow-up, I re-read the harvesting code again after our
conversation, in particular to learn about the cost of harvesting time
information. I observed that get_cyclecounter() is very cheap on modern
hardware, but that on older hardware without a TSC, it's extraordinarily
expensive. In particular, the #ifdef code on i386 suggests that i486
systems may not have a TSC, and insead read the system clock (ouch!). We
may want to investigate what approaches we can use to mitigate this,
especially if systems like soekris boxes don't have TSC. Right now, the
API for retrieving cycle counts does not allow the caller to easily
distinguish those two cases, and it may be we need to teach it to do that
so that we can allow the caller to decide it doesn't want to pay the
higher cost.
Finally, a question I raised by e-mail previously and think it's worth
thinking about is how to decide what impact we're having on entropy
quality. In particular, are there scores or other easily understood
numerical or statistical results that can be exposed to a user or
developer to allow them to measure the impact on entropy quality and
availability resulting from specific configuration, optimization, etc.
This would allow us to more easily say "X is or is not worth the
performance cost", or at least, allow our users to make that choice and
provide more information about when they should make that choice. This
also paves the way for better auto-tuning.
Robert N M Watson FreeBSD Core Team, TrustedBSD Projects
robert at fledge.watson.org Principal Research Scientist, McAfee Research
More information about the freebsd-current
mailing list