Fwd: 5-STABLE kernel build with icc broken

Fri Apr 1 05:20:34 PST 2005

On Thu, 31 Mar 2005, Matthew Dillon wrote:

I didn't mean to get into the kernel's use of the FPU, but...

>    All I really did was implement a comment that DG had made many years
>    ago in the PCB structure about making the FPU save area a pointer rather
>    then hardwiring it into the PCB.

ISTR writing something like that.  dg committed most of my early work
since I didn't have commit access at the time.

>...
>    The use of the XMM registers is a cpu optimization.  Modern CPUs,
>    especially AMD Athlon and Opterons, are more efficient with 128 bit
>    moves then with 64 bit moves.   I experimented with all sorts of
>    configurations, including the use of special data caching instructions,
>    but they had so many special cases and degenerate conditions that
>    I found that simply using straight XMM instructions, reading as big
>    a glob as possible, then writing the glob, was by far the best solution.

Are you sure about that?  The amd64 optimization manual says (essentially)
that big globs are bad, and my benchmarks confirm this.  The best glob size
is 128 bits according to my benchmarks.  This can be obtained using 2
64-bit reads of 64-bit registers followed by 2 64-bit writes of these
registers, or by read-write of a single 128-bit register.  The 64-bit
registers can be either MMX or integer registers on 64-bit systems, but
the 128-registers must be XMM on all systems.  I get identical speeds
of 12.9GB/sec (+-0.1GB/sec) on a fairly old and slow Athlon64 system
for copying 16K (fully cached) through MMX and XMM 128 bits at a time
using the following instructions:

 	# MMX:				# XMM
 	movq	(%0),%mm0		movdqa	(%0),%xmm0
 	movq	8(%0),%mm1		movdqa	%xmm0,(%1)
 	movq	%mm0,(%1)		...	# unroll same amount
 	movq	%mm1,8(%1)
 	...	# unroll to copy 64 bytes per iteration

Unfortunately (since I want to avoid using both MMX and XMM), I haven't
managed to make copying through 64-integer registers work as well.
Copying 128 bits at a time using 2 pairs of movq's through integer
registers gives only 7.9GB/sec.  movq through MMX is never that slow.
However, movdqu through xmm is even slower (7.4GB/sec).

The fully cached case is too unrepresentative of normal use, and normal
(partially cached) use is hard to bencmark, so I normally benchmark
the fully uncached case.  For that, movnt* is best for benchmarks but
not for general use, and it hardly matters which registers are used.

>    The key for fast block copying is to not issue any memory writes other
>    then those related directly to the data being copied.  This avoids
>    unnecessary RAS cycles which would otherwise kill copying performance.
>    In tests I found that copying multi-page blocks in a single loop was
>    far more efficient then copying data page-by-page precisely because
>    page-by-page copying was too complex to be able to avoid extranious
>    writes to memory unrelated to the target buffer inbetween each page copy.

By page-by-page, do you mean prefetch a page at a time into the L1
cache?

I've noticed strange loss (apparently) from extraneous reads or writes
more for benchmarks that do just (very large) writes.  An at least old
Celerons and AthlonXPs, the writes go straight to the L1/L2 caches
(unless you use movntq on AthlonXP's).  The caches are flushed to main
memory some time later, apparently not very well since some pages take
more than twice as long to write as others (as seen by the writer
filling the caches), and the slow case happens enough to affect the
average write speed by up to 50%.  This problem can be reduced by
putting memory bank bits in the page colors.  This is hard to get right
even for the simple unrepresentative case of large writes.

Bruce