[patch] Adding optimized kernel copying support - Part III

Bruce Evans bde at zeta.org.au
Wed May 31 16:25:32 PDT 2006

On Wed, 31 May 2006, Attilio Rao wrote:

> 2006/5/31, Suleiman Souhlal <ssouhlal at freebsd.org>:
>> Nice work. Any chance you could also port it to amd64? :-)
> Not in the near future, I think. :P

It is not useful for amd64.  An amd64 has enough instruction bandwidth
to saturate the L1 cache using 64-bit accesses although not using
32-bit accesses.  An amd64 has 64-bit integer registers which can be
accesses without the huge setup overheads and code complications for
MMX/XMM registers.  It already uses 64-bit registers or 64-bit movs
for copying and zeroing of course.  Perhaps it should use prefetches
and nontemporal writes more than it already does, but these don't
require using SSE2 instructions like nontemporal writes do for 32-bit

>> Does that mean it won't work with SMP and PREEMPTION?
> Yes it will work (even if I think it needs more testing) but maybe
> would give lesser performances on SMP|PREEMPTION due to too much
> traffic on memory/cache. For this I was planing to use non-temporal
> instructions
> (obviously benchmarks would be very appreciate).

Er, isn't its main point to fix some !SMP assumptions made in the old
copying-through-the-FPU code?  (The old code is messy due to its avoidance
of global changes.  It wants to preserve the FPU state on the stack, but
this doesn't quite work so it does extra things (still mostly locally)
that only work in the !SMP && (!SMPng even with UP) case.  Patching this
approach to work with SMP || SMPng cases would make it messier.)

The new code wouldn't behave much differently under SMP.  It just might
be a smaller optimization because more memory pressure for SMP causes
more cache misses for everything and there are no benefits from copying
through MMX/XMM unless nontemporal writes are used.  All (?) CPUs with
MMX or SSE* can saturate main memory using 32-bit instructions.  On
32-bit CPUs, the benefits of using MMX/XMM come from being able to
saturate the L1 cache on some CPUs (mainly Athlons and not P[2-4]),
and from being able to use nontemporal writes on some CPUs (at least
AthlonXP via SSE extensions all CPUs with SSE2).


More information about the freebsd-hackers mailing list