cvs commit: src/sys/i386/i386 pmap.c

Fri Nov 5 12:02:12 GMT 2004

On Fri, 29 Oct 2004, Mike Silbersack wrote:

> I think we really need some sort of light-weight critical_enter that
> simply assures you that you won't get rescheduled to another CPU, but
> gives no guarantees beyond that. 
<snip> 
> Er, wait - I guess I'm forgetting something, there exists the potential 
> for the interrupt that preempted whatever was calling arc4random to also 
> call arc4random, thereby breaking things...

I've been looking at related issues for the last couple of days and must
have missed this thread while at EuroBSDCon.  Alan Cox pointed me at it,
so here I am. :-)

Right now, the cost of acquiring and dropping an uncontended a sleep mutex
on a UP kernel is very low -- about 21 cycles on my PIII and 40 on my P4,
including some efficiency problems in my measurement which probably add a
non-trivial overhead.  Compare this with the SMP versions on the PIII (90
cycles) and P4 (260 cycles!).  Critical sections on the SMP PIII are about
the same cost as the SMP mutex, but on the P4 a critical section is less
than half the cost.  Getting to a model where critical sections were as
cheap as UP sleep mutexes, or where we could use a similar combination of
primitives (such as UP mutexes with pinning) would be very useful.
Otherwise, optimizing through use of critical sections will improve SMP
but potentially damage performance on UP.  There's been a fair amount of
discussion of such approaches, including the implementation briefly
present in the FreeBSD.  I know John Baldwin and Justin Gibbs both have
theories and plans in this area.

If we do create a UP mutex primitive for use on SMP, I would suggest we
actually expand the contents of the UP mutex structure slightly to include
a cpu number that can be asserted, along with pinning, when an operation
is attempted and INVARIANTS is present.  One of the great strengths of the
mutex/lock model is a strong assertion capability, both for the purposes
of documentation and testing, so we should make sure that carries into any
new synchronization primitives.

Small table of synchronization primitives below; in each case, the count
is in cycles and reflects the cost of acquiring and dropping the primitive
(lock+unlock, enter+exit).  The P4 is a 3ghz box, and the PIII is an
800mhz box.  Note that the synchronization primitives requiring atomic
operations are substantially pessimized on the P4 vs the PIII.

A discussion with John Baldwin and Scott Long yesterday revealed that the
UP spin mutex is currently pessimized from a critical section to a
critical section plus mutex internals due to a need for mtx_owned() on
spin locks.  I'm not convinced that explains the entire performance
irregularity I see for P4 spin mutexes on UP, however.  Note that 39 (P4
UP sleep mutex) + 120 (P4 UP critical section) is not 274 (P4 UP spin
mutex) by a fair amount.  Figuring out what's going on there would be a
good idea, although it could well be a property of my measurement
environment.  I'm currently using this to do measurements:

    //depot/user/rwatson/percpu/sys/test/test_synch_timing.c

In all of the below, the standard deviation is very small if you're
careful about not bumping into hard clock or other interrupts during
testing, especially when it comes to spin mutexes and critical sections. 

Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
robert at fledge.watson.org      Principal Research Scientist, McAfee Research

        sleep mutex     crit section    spin mutex
        UP      SMP     UP      SMP     UP      SMP
PIII    21      90      83      81      112     141
P4      39      260     120     119     274     342