4.7 vs 5.2.1 SMP/UP bridging performance

Wed May 5 18:55:25 PDT 2004

Gerrit Nagelhout wrote:
> Andrew Gallatin wrote:
> 
>>Bruce Evans writes:
>>
>> > 
>> > Athlon XP2600 UP system:  !SMP case: 22 cycles   SMP case: 
>>37 cycles
>> > Celeron 366 SMP system:              35                    48
>> > 
>> > The extra cycles for the SMP case are just the extra cost 
>>of a one lock
>> > instruction.  Note that SMP should cost twice as much 
>>extra, but the
>> > non-SMP atomic_store_rel_int(&slock, 0) is pessimized by 
>>using xchgl
>> > which always locks the bus.  After fixing this:
>> > 
>> > Athlon XP2600 UP system:  !SMP case:  6 cycles   SMP case: 
>>37 cycles
>> > Celeron 366 SMP system:              10                    48
>> > 
>> > Mutexes take longer than simple locks, but not much longer 
>>unless the
>> > lock is contested.  In particular, they don't lock the bus any more
>> > and the extra cycles for locking dominate (even in the 
>>!SMP case due
>> > to the pessimization).
>> > 
>> > So there seems to be something wrong with your benchmark.  
>>Locking the
>> > bus for the SMP case always costs about 20+ cycles, but this hasn't
>> > changed since RELENG_4 and mutexes can't be made much faster in the
>> > uncontested case since their overhead is dominated by the bus lock
>> > time.
>> > 
>>
>>Actually, I think his tests are accurate and bus locked instructions
>>take an eternity on P4.  See
>>http://www.uwsg.iu.edu/hypermail/linux/kernel/0109.3/0687.html 
>>
>>For example, with your test above, I see 212 cycles for the UP case on
>>a 2.53GHz P4.  Replacing the atomic_store_rel_int(&slock, 0) with a
>>simple slock = 0; reduces that count to 18 cycles.
>>
>>If its really safe to remove the xchg* from non-SMP atomic_store_rel*,
>>then I think you should do it.  Of course, that still leaves mutexes
>>as very expensive on SMP (253 cycles on the 2.53GHz from above).
>>
>>Drew
>>
> 
> 
> I wonder if there is anything that can be done to make the locking more
> efficient for the Xeon.  Are there any other locking types that could
> be used instead?
> This might also explain why we are seeing much worse system call 
> performance under 4.7 in SMP versus UP.  Here is a table of results
> for some system call tests I ran.  (The numbers are calls/s)

Int 0x80 system calls are known to be extremely expensive on a P4.  I
think that Jeff Roberson measured them as taking 300 cycles on average.
Some work was done on implementing the alternate sysenter/sysexit
method, but I don't think it was ever finished.  I think that it was
shown to have a modest speed improvement, but there was still a lot of
overhead that made it slow on a P4.  There are other optimizations that
can be done like having a shared page that lets you avoid calls like
getpid and gettimeofday, but it opens some security risks that have to
be dealt with.

Scott