svn commit: r303583 - head/sys/amd64/amd64

Sun Jul 31 13:51:38 UTC 2016

On Sun, Jul 31, 2016 at 11:11:25PM +1000, Bruce Evans wrote:

> Misalignment of this loop made it almost twice as slow on old Turion2 with
> slow DDR2 memory.  It made no difference on Haswell.  I added an extra
> movnti, but that makes little or no differences.  2 more movnti's wouldn't
> fit in a 16-byte cache line so are slower unless even more care is taken
> with alignment (or with less care, 4 with misalignment are not less than
> twice as slow as 1 with alignment).
> 
> I thought that alignment and unrolling didn't matter here, because movnti
> has to wait for memory and almost any loop runs fast enough to keep up.
> The timing on my old system is something like: CPUs at 2 GHz; main memory
> at 4 GB/sec; movnti is only 4 bytes wide on i386 (so this problem
> only affects i386, at least with slow memory).  So sustaining 4 GB/sec
> requires 1 G movnti's/sec, so the loop needs to run at 2 cycles/iteration
> to keep up.  But when it is misaligned, it runs at 3-4 cycles/iteration.
> Alignment makes it take about 2, and the extra movnti is for safety and
> to work with faster memory.
> 
> On Haswell with CPUs at 4 GHz, 2 cycles/iteration gives 8 GB/sec on
> i386 and 16 GB/sec on amd64 with wider movnti.  IIRC, 16 GB/sec is about
> the main memory speed so nothing better is possible but just 1 extra
> movnti gives more with faster memory.  This is just worse than bzero()

What about modern system with 120 GB/sec main memory speed?