4.7 vs 5.2.1 SMP/UP bridging performance
Scott Long
scottl at freebsd.org
Wed May 5 18:55:25 PDT 2004
Gerrit Nagelhout wrote:
> Andrew Gallatin wrote:
>
>>Bruce Evans writes:
>>
>> >
>> > Athlon XP2600 UP system: !SMP case: 22 cycles SMP case:
>>37 cycles
>> > Celeron 366 SMP system: 35 48
>> >
>> > The extra cycles for the SMP case are just the extra cost
>>of a one lock
>> > instruction. Note that SMP should cost twice as much
>>extra, but the
>> > non-SMP atomic_store_rel_int(&slock, 0) is pessimized by
>>using xchgl
>> > which always locks the bus. After fixing this:
>> >
>> > Athlon XP2600 UP system: !SMP case: 6 cycles SMP case:
>>37 cycles
>> > Celeron 366 SMP system: 10 48
>> >
>> > Mutexes take longer than simple locks, but not much longer
>>unless the
>> > lock is contested. In particular, they don't lock the bus any more
>> > and the extra cycles for locking dominate (even in the
>>!SMP case due
>> > to the pessimization).
>> >
>> > So there seems to be something wrong with your benchmark.
>>Locking the
>> > bus for the SMP case always costs about 20+ cycles, but this hasn't
>> > changed since RELENG_4 and mutexes can't be made much faster in the
>> > uncontested case since their overhead is dominated by the bus lock
>> > time.
>> >
>>
>>Actually, I think his tests are accurate and bus locked instructions
>>take an eternity on P4. See
>>http://www.uwsg.iu.edu/hypermail/linux/kernel/0109.3/0687.html
>>
>>For example, with your test above, I see 212 cycles for the UP case on
>>a 2.53GHz P4. Replacing the atomic_store_rel_int(&slock, 0) with a
>>simple slock = 0; reduces that count to 18 cycles.
>>
>>If its really safe to remove the xchg* from non-SMP atomic_store_rel*,
>>then I think you should do it. Of course, that still leaves mutexes
>>as very expensive on SMP (253 cycles on the 2.53GHz from above).
>>
>>Drew
>>
>
>
> I wonder if there is anything that can be done to make the locking more
> efficient for the Xeon. Are there any other locking types that could
> be used instead?
> This might also explain why we are seeing much worse system call
> performance under 4.7 in SMP versus UP. Here is a table of results
> for some system call tests I ran. (The numbers are calls/s)
Int 0x80 system calls are known to be extremely expensive on a P4. I
think that Jeff Roberson measured them as taking 300 cycles on average.
Some work was done on implementing the alternate sysenter/sysexit
method, but I don't think it was ever finished. I think that it was
shown to have a modest speed improvement, but there was still a lot of
overhead that made it slow on a P4. There are other optimizations that
can be done like having a shared page that lets you avoid calls like
getpid and gettimeofday, but it opens some security risks that have to
be dealt with.
Scott
More information about the freebsd-current
mailing list