svn commit: r334419 - head/sys/amd64/amd64

Thu May 31 17:16:08 UTC 2018

On Thu, 31 May 2018, Mateusz Guzik wrote:

> On Thu, May 31, 2018 at 09:19:58PM +1000, Bruce Evans wrote:
>> On Thu, 31 May 2018, Mateusz Guzik wrote:
>>
>>> Log:
>>>  amd64: switch pagecopy from non-temporal stores to rep movsq
>>
>> As for pagezero, this pessimizes for machines with slow movsq and/or
> caches
>> (mostly older machines).
>
> Can you give examples of such machines? I tested with old yellers like
> Nehalem and Westmere, no loss.

Original Athlon64, and Turion2 on a 2006 laptop.  I already mentioned
Turion64, and my commit to fix the loss of nontemporal pagezero on amd64
gives timing info for both in a mixed-up way (only the Athlon has PC3200).
sse2_pagezero was actually connected at the time, but only to idlezero
and that was removed soon after.  Nontemporal stores are clearly best
for idlezero, but doing anything in idle is not so good since it might
wasted power or steal resources from a shared core or increase latency...
It was good on the Turion2 in 2007.  Turion2 doesn't have a shared core
and or many Cx states so it uses almost as much power zeroing pages as
idling.

>>>  The copied data is accessed in part soon after and it results with
> additional
>>>  cache misses during a -j 1 buildkernel WITHOUT_CTF=yes KERNFAST=1, as
> measured
>>>  with pmc stat.
>>
>> Of course it causes more cache misses later, but for large data going
> through
>> slow caches is much slower so the cache misses later cost less.
>
> The note was predominantly for people who would want to defend nt stores
> claiming it prevents evicting cached data by data being copied and then
> mostly not accessed.

I read it more carefully and can interpret it to say the opposite of
what you want.  Since a new system gets no benefit in real time, the
only significant differences are probably tiny power savings on new
systems and slower runtimes on older systems.

However, I saw tiny improvements in real time for makeworld with
pagecopy = bcopy on Haswell.  Well below 1%, while improvements for
pagezero = bzero were closer to 1%.  I now have better statistics
generation and analysis and recently spent a lot of time trying to
verify scheduler improvements of about 1%.

>> It is negatively useful to write this in asm.  This is now just memcpy()
>> and the asm version of that is fast enough, though movsq takes too long
>> to start up.  This memcpy() might be inlined and then it would be
>> insignificantly faster than the function call.  __builtin_memcpy() won't
>> actually inline it, since its size is large and compilers know that they
>> don't understand memory.
>
> It is true that currently it can be the current memcpy with almost no loss.
>
> However, even on a kernel with #define memcpy __builtin_memcpy, there
> are plenty of calls with very small sizes. See the list here (taken
> during buildkernel):
>
> https://people.freebsd.org/~mjg/bufsizes.txt
>
> In particular you can find a lot of < 64 entries.

But pagecopy is 4K.  That is still too small to amortize string instruction
overhead for Haswell in the cached case -- see my old mail -- by not much
is to be gained by using a specialized version since the cached case is
very fast.

> Spinning up rep stosb for such sizes even with ERMS turns out to be
> pessimal even on Skylake. In other words, the primitive will need to get
> special casing for small-sized callers. Known big-size callers should be
> moved to something else. As such, pointing pagecopy at the primitive is
> imo a bad idea.

That is with most current implementations of ERMS.  I expect the startup
overhead will be small after a couple more generations of CPUs.  Then
optimizations to not use string optimizations will be as silly as 30-year
old optimizations to use them.  Or my 20 year old optimizations to use
the FPU for bcopy, bzero, copyin and copyout, but not pagezero or pagecopy.
This optimization was good for just 1 generation of CPUs (Pentium1).

i386 still has a silly 20 year old i686_pagezero which is still used
on all i386's that don't have SSE2 (not many of these now).  This would
have been good for just 1 or 2 generations of CPUs (PentiumPro and
maybe Celeron) if it were written correctly.  It is intended to avoid
writing zeros to cache lines that are already zero, as was good on
PentiumPro.  But it actually zeros almost everything after finding a
nonzero byte.  Thus it is a pessimization even on PentiumPro unless
many pages passed to it are already all zero.

Bruce