svn commit: r303583 - head/sys/amd64/amd64

Mon Aug 1 02:35:21 UTC 2016

On Sun, 31 Jul 2016, Konstantin Belousov wrote:

> On Sun, Jul 31, 2016 at 11:11:25PM +1000, Bruce Evans wrote:

I said that I didn't replace (sse2) pagecopy() by bcopy() on amd64 for
Haswell.  Actually I do, for a small improvement on makeworld.  i386
doesn't have (sse*) pagecopy() except in some of my versions, so I
don't need to do anything to get the same improvement on the same
Haswell.

>> On Haswell, "rep stos" takes about 25 cycles to start up, and the function
>> call overhead is in the noise.  25 cycles is a lot.  Haswell can move
>> 32 bytes/cycle from L2 to L2, so it misses moving 800 bytes or 1/5 of a
>> page in its startup overhead.  Oops, that is for "rep movs".  "rep stos"
>> is similar.

> The commit message contained a probable explanation of the reason why
> the change demonstrated measurable improvement in non-microbenchmark load.

Pagefaults give some locality, but I think not enough to explain much
of the improvement or the larger negative improvements that I measure.

makeworld isn't a micro-benchmark.  For a tuned ~5.2 world it does
about 32 million pagezero()s.

makeworld does only 2728 pagefaults with warm (VMIO and buffer...)
caches on i386.  24866 with cold caches.  On amd64, 15% lower.  Page
reclaims are about 17 million on i386 and 27 million on amd64.  Either
page faults each touch a lot of pages (so that nontemporal stores
should help in theory by avoiding busting L1 and depleting L2 on every
pagefault), or there is a lot of pre-zeroing (so again nontemporal
stors should help in theory).

In fact, nontemporal stores help in practice on Turion2.  Haswell has
better caches and that is probably the main reason that nontemporal
stores are slower in practice.  Turion2 also benefited from the old
implementation of pagezero in idle.  Clearly, zeroing in idle should
use nontemporal stores.  But when nontemporal stores are much slower,
there is less likely to be enough otherwise-idle cycles to do enough
of them.  Zeroing in idle works poorly now, and is turned off.  On
systems with HTT, idle CPUs aren't created equally and aren't really
idle if using them would steal sources from another CPU.

> That said, the only thing I am answering and asking there is the above
> claim about 25 cycles overhead of rep;stosq on hsw. I am curious how
> the overhead was measured. Note: Agner Fog' tables state that fast mode
> takes <2n uops and has reciprocal throughput of 0.5n worst case and do
> not demostrate any setup overhead for hsw.

I think the target is 0.25n best case (32 bytes/cycle only 8 bytes wide
using integer instructions).

ISTR that Fog says something about the latency.  He does for older
CPUs.  I've never noticed latency for x86 string instructions being
below about 15 cycles, and the fast string operations have to do more
setup so it would be surprising if they had lower latency.

To measure latency, just time bcopy() and bzero() with different sizes
in a loop and take differences.  Use small sizes to stay in L1 and
avoid cache misses (except for preemption). I get the following times
for amd64 on Haswell @ 4.080 GHz.  (These times also disprove my claim
that bzero() is just as good as a specialized function -- latency makes
it significantly slower except for unusually large sizes.):

                                  size 4096    size 8192
0.25n throughput:                130.56       130.56
rep movsb alone in a function:    96.5        110.9      (speeds in 1e9 B/s)
45+0.25n:                         96.6        111.0
memcpy (rep movsq in libc):       72.5         92.9 
102+0.25n:                        72.7         93.4
rep stosq alone in function:     105.8        116.7
31+0.25n:                        105.1        116.9

25 is about right for rep stosq inline -- the function call adds about 5,
and that is in the fastest possible case with the call in a loop.  libc
memcpy must be doing something very stupid to take 102 cycles.

Note that Haswell can't get very near 0+0.25n because sizes slightly
larger than 2*8192.  Haswell's L1 is too small to get very near to
amortising the startup overhead.  The fastest speed I could find for
rep movsb in a function was 115.4 for size 13K.  Larger sizes are
slower because they don't fit in L1 (2 * 14K fits in 32K L1 but is
still slower for some reason).

Latency for non-rep string functions is also interesting.  I think it
is almost as high, making these instructions useless for all purposes
except saving space on all CPUs and saving time on CPUs almost as old
as the 8088 (on the 8088, instruction fetch was very slow, so it was
faster to use 1-byte instructions if possible).

Bruce