Optimized copy&move (was: Re: [PATCH] Mantaining turnstile
aligned to 128 bytes in i386 CPUs)
Bruce Evans
bde at zeta.org.au
Sun Jan 21 04:03:53 UTC 2007
On Sat, 20 Jan 2007, David Malone wrote:
> On Thu, Jan 18, 2007 at 11:16:19AM +1100, Bruce Evans wrote:
>> - the FPU routines are faster on Athlons (XP and 64 at least), but these
>> didn't exist until 2001. The introduction of these CPUs may have
>> been the trigger for turning off the FPU routines in -current in 2001.
>> Until then problems were limited to Pentium-1's since the dynamic
>> configuration prevented the routines being used on all other machines.
>
> I think a very quirky K6-2 machine that I had let us reproduce the
> problem fairly dependably and may have been part of the reason it
> was finally turned off.
I just looked again at your old (2001) mail about this. The userland
benchmark was flawed. It tried 3 methods sequentially without warming
up caches, so all methods did unintended testing of I-cache misses
(including branch target cache cache) and the first method (userland
bzero) warmed up the D-cache for the other 2. The kernel runtime
configuration also fails to either warm or cool the caches initially.
It assumes P1 cache sizes and depends on a 1MB buffer being much larger
than caches. Maybe this was not enough for K6-2. It is certainly not
enough for Athlon64, but I think it would mostly cause false negatives
so I don't understand why it gave a false positive for the K6-2.
After fixing the userland benchmark, userland bzero did much better
and your benchmark agreed with mine that FPU methods for bzero are
just pessimizations on A64-AXP. However, the behaviour for bcopy
is quite different on A64-AXP -- even the old FPU methods are small
optimizations in some cases (on A64, about 25% in the fully-L2 cached
case; little difference for other large copies).
Bruce
More information about the freebsd-arch
mailing list