4.7 vs 5.2.1 SMP/UP bridging performance

Wed May 5 14:23:44 PDT 2004

Bruce Evans writes:

 > 
 > Athlon XP2600 UP system:  !SMP case: 22 cycles   SMP case: 37 cycles
 > Celeron 366 SMP system:              35                    48
 > 
 > The extra cycles for the SMP case are just the extra cost of a one lock
 > instruction.  Note that SMP should cost twice as much extra, but the
 > non-SMP atomic_store_rel_int(&slock, 0) is pessimized by using xchgl
 > which always locks the bus.  After fixing this:
 > 
 > Athlon XP2600 UP system:  !SMP case:  6 cycles   SMP case: 37 cycles
 > Celeron 366 SMP system:              10                    48
 > 
 > Mutexes take longer than simple locks, but not much longer unless the
 > lock is contested.  In particular, they don't lock the bus any more
 > and the extra cycles for locking dominate (even in the !SMP case due
 > to the pessimization).
 > 
 > So there seems to be something wrong with your benchmark.  Locking the
 > bus for the SMP case always costs about 20+ cycles, but this hasn't
 > changed since RELENG_4 and mutexes can't be made much faster in the
 > uncontested case since their overhead is dominated by the bus lock
 > time.
 > 

Actually, I think his tests are accurate and bus locked instructions
take an eternity on P4.  See
http://www.uwsg.iu.edu/hypermail/linux/kernel/0109.3/0687.html 

For example, with your test above, I see 212 cycles for the UP case on
a 2.53GHz P4.  Replacing the atomic_store_rel_int(&slock, 0) with a
simple slock = 0; reduces that count to 18 cycles.

If its really safe to remove the xchg* from non-SMP atomic_store_rel*,
then I think you should do it.  Of course, that still leaves mutexes
as very expensive on SMP (253 cycles on the 2.53GHz from above).

Drew