svn commit: r238907 - projects/calloutng/sys/kern

Mon Oct 29 05:38:44 UTC 2012

On Mon, 29 Oct 2012, Attilio Rao wrote:

> Now that sched_pin()/sched_unpin() are fixed I would like to introduce
> this new patch, making critical_enter()/critical_exit() inline:
> http://www.freebsd.org/~attilio/inline_critical.patch
>
> The concept is pretty simple: simple add/dec for critical_enter, exit
> are inlined, the rest is in an "hard path". Debugging enables the hard
> paths by default (really I think that only KTR may be due here, but I
> thought that in case of INVARIANTS this was also wanted, so I added
> the check also for that case).
>
> flo@ has stress-tested the patch already and he may be starting
> serious benchmarks soon.
> The effectiveness of this patch is to determine and it brings again
> the question: better an hard function or the usage of compiler membars
> to avoid issues? I think that this patch should be in only if we can
> absolutely prove a measurable performance bump, otherwise I will just
> add a note to critical_enter()/critical_exit() explaining why they
> should not be inlined at all.

Inlining of mtx_lock_spin() is bogus unless critical_enter() is inlined.
Similarly for mtx_unlock_spin() and critical_exit().  It saves 1 function
call. but critical_enter() does a function call anyway.  critical_exit*(
also has a branch in branch in it that might cost more than the function
call just for mispredicting it.

My version goes the other way and uninlines mtx_lock_spin() and
mtx_unlock_spin().  Then it inlines (open codes) critical_enter() and
critical_exit() in them.  This saves a lot of text space and thus
hopefully saves time too.  I couldn't find any cases where it either
saved or lost a significant amount of time.  The main cause of time
differences is probably mispredicted branches: with mtx*() inlined,
the branches in it can be predicted separately, so they are more
predictable.  However, when they are separate, there may be enough to
thrash the branch predictor.  It is probably best to only inline mtx*()
for a few heavily-used locks and only inline critical*() if it is either
in one of these or is in a non-inline function.  Otherwise, inlining
mainly gives code bloat that is only harmless if the code is rarely
actually used.  But this is hard to configure.  <sys/mutex.h> is
already a mess to support uninlining for debugging cases.

OTOH, I couldn't get uninlining of mtx_lock() and mtx_unlock() to work.
These are a little smaller than the spinlock versions, mainly since they
don't inline critical*().  The loss from unlining them cannot be
compensated for by uninlining critical*() in them, and I found just 1
workload where uninlining them was slower: udp packets per second bandwidth
tests.

Uninlining gives a very measurable space debloatage jump :-).

Time performance changes for uninlining and inlining are difficult to
measure and more difficult to verify, since they are very
machine-dependent, load-dependent, and cache-dependent.  While I was
measuring it, the packet bandwidth test was varying by up to 30% due
to changes like adding 4 bytes in unused code or data.  I attribute
this to more cache misses, but couldn't find exactly where they were.
(This, was mostly on AthlonXP and Athlon64 systems, where caches are
large but not very associative.  The non-associativity tends to cause
more cache misses in microbenchmarks since although microbenchmarks
give an unreally small data set, it is not small enough to fit perfectly
with low associativity).  The only consistency was that -current was
slower than my version and kept getting slower.  I attribute the
slowness to general kernel bloat.  When the CPU resources are held
fixed, the creeping bloat steals more resources from the fast path.

Notes about this shouldn't say more than that it didn't work when you
tried it.  Doubling of CPU resources in next year's CPU might make
everything fit in the fast path until the creeping bloat catches up.

Bruce