4.7 vs 5.2.1 SMP/UP bridging performance
Bruce Evans
bde at zeta.org.au
Thu May 6 03:19:12 PDT 2004
On Wed, 5 May 2004, Gerrit Nagelhout wrote:
> Andrew Gallatin wrote:
> > If its really safe to remove the xchg* from non-SMP atomic_store_rel*,
> > then I think you should do it. Of course, that still leaves mutexes
> > as very expensive on SMP (253 cycles on the 2.53GHz from above).
See my other reply [1 memory barrier but not 2 seems to be needed for
each lock/unlock pair in the !SMP case, and the xchgl accidentally (?)
provides it; perhaps [lms]fence would give a faster memory barrier].
More ideas on this:
- compilers should probably now generate memory barrier instructions foe
volatile variables (so volatile variables would be even slower :-). I
haven't seen gcc on i386's do this.
- jhb once tried changing mtx_lolock_spin(mtx)/mtx_unlock_spin(mtx) to
crticial_enter()/critical_exit(). This didn't work because it broke
mtx_assert(). It might also not work because it removes the memory
barrier. criticial_enter() only has the very weak memory barrier in
disable_intr() on i386's.
> I wonder if there is anything that can be done to make the locking more
> efficient for the Xeon. Are there any other locking types that could
> be used instead?
I can't think of anything for the SMP case. See above for the !SMP case.
> This might also explain why we are seeing much worse system call
> performance under 4.7 in SMP versus UP. Here is a table of results
> for some system call tests I ran. (The numbers are calls/s)
>
> 2.8Ghz Xeon
> UP SMP
> write 904427 661312
> socket 1327692 1067743
> select 554131 434390
> gettimeofday 1734963 252479
>
> 1.3Ghz PIII
> UP SMP
> write 746705 532223
> socket 1179819 977448
> select 727811 556537
> gettimeofday 1849862 186387
It's why the Xeon is relatively slower under -current and SMP. -current
just does more locking and more of other things.
> The really interesting one is gettimeofday. For both the Xeon & PIII,
> the UP is much better than SMP, but the UP for PIII is better than that
> of the Xeon. I may try to get the results for 5.2.1 later. I can
> forward the source code of this program to anyone else who wants to try
> it out.
gettimeofday() is slower for SMP because it uses a different timecounter.
This is a hardware problem -- there is no good timecounter available.
It looks like the TSC timecounter is being used for the UP cases and
either the i8254 or the ACPI-slow timecounter for the SMP cases.
Reading the TSC takes about 10-12 cycles on most most i386's (probably
mny more on P4 ;-). Syscall overhead adds a lot to this, but
gettimeofday() still takes much less than a microsecond. The fastest
I've seen recently is 260nS/578 cycles for clock_gettime() on an
AthlonXP. OTOH, reading the i8254 takes about 4000 nS so gettimeofday()
takes 4190nS for clock_gettime() on the same AthlonXP system that takes
260nS with the TSC timecounter. This system also has a slow ACPI timer
so clock_gettime() takes 1397nS with the ACPI-fast timecounter and
about 3 times as long with the ACPI-slow timecounter. Recently-fixed
bugs made it often use the ACPI-slow timecounter although the ACPI-fast
timecounter always works.
Slow timecounters mainly affect workloads that do too many context
switches or timestamps on tinygrams. Probably for yours but not mine.
I only notice them when I run microbenchmarks. The simplest one that
shows them is "ping -fq localhost". There are normally 7 timestamps
per packet (1 to put in the packet in userland, 2 for bookkepping in
userland, 2 for pessimization of netisrs in the kernel and 2 for
tripping on our own Giant foot in the kernel). RELENG_4 only has the
userland ones. With reasonably CPUs (1GHz+ or so) and slow timecounters,
making even one of these timestamps takes longer than everything else.
Bruce
More information about the freebsd-current
mailing list