Fwd: 5-STABLE kernel build with icc broken

Fri Apr 1 10:06:08 PST 2005

:>    The use of the XMM registers is a cpu optimization.  Modern CPUs,
:>    especially AMD Athlon and Opterons, are more efficient with 128 bit
:>    moves then with 64 bit moves.   I experimented with all sorts of
:>    configurations, including the use of special data caching instructions,
:>    but they had so many special cases and degenerate conditions that
:>    I found that simply using straight XMM instructions, reading as big
:>    a glob as possible, then writing the glob, was by far the best solution.
:
:Are you sure about that?  The amd64 optimization manual says (essentially)
:that big globs are bad, and my benchmarks confirm this.  The best glob size
:is 128 bits according to my benchmarks.  This can be obtained using 2
:...
:
:Unfortunately (since I want to avoid using both MMX and XMM), I haven't
:managed to make copying through 64-integer registers work as well.
:Copying 128 bits at a time using 2 pairs of movq's through integer
:registers gives only 7.9GB/sec.  movq through MMX is never that slow.
:However, movdqu through xmm is even slower (7.4GB/sec).
:
:The fully cached case is too unrepresentative of normal use, and normal
:(partially cached) use is hard to bencmark, so I normally benchmark
:the fully uncached case.  For that, movnt* is best for benchmarks but
:not for general use, and it hardly matters which registers are used.

    Yah, I'm pretty sure.  I tested the fully cached (L1), partially
    cached (L2), and the fully uncached cases.   I don't have a logic 
    analyzer but what I think is happening is that the cpu's write buffer
    is messing around with the reads and causing extra RAS cycles to occur.
    I also tested using various combinations of movdqa, movntdq, and
    prefetcha.  carefully arranged non-temporal and/or prefetch instructions
    were much faster for the uncached case, but much, MUCH slower for
    the partially cached (L2) or fully (L1) cached case, making them 
    unsuitable for a generic copy.  I am rather miffed that AMD screwed up
    the non-temporal instructions so badly.

    I also think there might be some odd instruction pipeline effects
    that skew the results when only one or two instructions are between
    the load into an %xmm register and the store from the same register.
    I tried using 2, 4, and 8 XMM registers.  8 XMM registers seemed to
    work the best.

    Of course, I primarily tested on an Athlon 64 3200+, so YMMV.  (One
    of the first Athlon 64's, so it has a 1MB L2 cache).

:>    The key for fast block copying is to not issue any memory writes other
:>    then those related directly to the data being copied.  This avoids
:>    unnecessary RAS cycles which would otherwise kill copying performance.
:>    In tests I found that copying multi-page blocks in a single loop was
:>    far more efficient then copying data page-by-page precisely because
:>    page-by-page copying was too complex to be able to avoid extranious
:>    writes to memory unrelated to the target buffer inbetween each page copy.
:
:By page-by-page, do you mean prefetch a page at a time into the L1
:cache?

    No, I meant that copying taking, e.g. a vm_page_t array and doing
    page-by-page mappings copying in 4K chunks seems to be a lot slower
    then doing a linear mapping of the entire vm_page_t array and doing
    one big copy.  Literally the same code, just rearranged a bit.  Just
    writing to the stack in between each page was enough to throw it off.

:I've noticed strange loss (apparently) from extraneous reads or writes
:more for benchmarks that do just (very large) writes.  An at least old
:Celerons and AthlonXPs, the writes go straight to the L1/L2 caches
:(unless you use movntq on AthlonXP's).  The caches are flushed to main
:memory some time later, apparently not very well since some pages take
:more than twice as long to write as others (as seen by the writer
:filling the caches), and the slow case happens enough to affect the
:average write speed by up to 50%.  This problem can be reduced by
:putting memory bank bits in the page colors.  This is hard to get right
:even for the simple unrepresentative case of large writes.
:
:Bruce

    I've seen the same effects and come to the same conclusion.  The
    copy code I eventually settled on was this (taken from my i386/bcopy.s).
    It isn't as fast as using movntdq for the fully uncached case, but it
    seems to perform in the system the best because the data tends to be
    accessed and in the cache by someone in real life (e.g. source data
    tends to be in the cache even if the device driver doesn't touch the
    target data).

    I wish AMD had made movntdq work the same as movdqa for the case where
    the data was already in the cache, then movntdq would have been the
    clear winner.

    The prefetchnta I have commented out seemed to improve performance,
    but it requires 3dNOW and I didn't want to NOT have an MMX copy mode
    for cpu's with MMX but without 3dNOW.  Prefetching less then 128 bytes
    did not help, and prefetching greater then 128 bytes (e.g. 256(%esi))
    seemed to cause extra RAS cycles.  It was unbelievably finicky, not at
    all what I expected.

	[ mmx_save_block does a 2048 check on the length and the FPU
	  setup and kernel fpu lock bit ]

ENTRY(asm_xmm_bcopy)
        MMX_SAVE_BLOCK(asm_generic_bcopy)
        cmpl    %esi,%edi       /* if (edi < esi) fwd copy ok */
        jb      1f
        addl    %ecx,%esi
        cmpl    %esi,%edi       /* if (edi < esi + count) do bkwrds copy */
        jb      10f
        subl    %ecx,%esi
1:
        movl    %esi,%eax       /* skip xmm if the data is not aligned */
        andl    $15,%eax
        jnz     5f
        movl    %edi,%eax
        andl    $15,%eax
        jz      3f
        jmp     5f

        SUPERALIGN_TEXT

2:
        movdqa  (%esi),%xmm0
        movdqa  16(%esi),%xmm1
        movdqa  32(%esi),%xmm2
        movdqa  48(%esi),%xmm3
        movdqa  64(%esi),%xmm4
        movdqa  80(%esi),%xmm5
        movdqa  96(%esi),%xmm6
        movdqa  112(%esi),%xmm7
        /*prefetchnta 128(%esi) 3dNOW */
        addl    $128,%esi

        /*
         * movdqa or movntdq can be used.
         */
        movdqa  %xmm0,(%edi)
        movdqa  %xmm1,16(%edi)
        movdqa  %xmm2,32(%edi)
        movdqa  %xmm3,48(%edi)
        movdqa  %xmm4,64(%edi)
        movdqa  %xmm5,80(%edi)
        movdqa  %xmm6,96(%edi)
        movdqa  %xmm7,112(%edi)
        addl    $128,%edi
3:
        subl    $128,%ecx
        jae     2b
        addl    $128,%ecx
        jz      6f
        jmp     5f

	[ fall through to loop to handle blocks less then 128 bytes ]

        SUPERALIGN_TEXT
4:
        movq    (%esi),%mm0
	movq    8(%esi),%mm1
	movq    16(%esi),%mm2
	movq    24(%esi),%mm3
	... 

10:
	[ backwards copy code ... ]

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>