non-temporal copyin/copyout?

Bruce Evans bde at zeta.org.au
Sat Feb 18 05:29:32 PST 2006


On Fri, 17 Feb 2006, Andrew Gallatin wrote:

> Has anybody considered using non-temporal copies for the in-kernel
> bcopy on amd64?

Yes.  It's probably a small pessimization sunce large bcopys are (or
should be) rare.  If you really mean copyin/copyout as in the subject
line, then things are less clear.

> A quick test in userspace shows that for large copies, an adapted
> pagecopy (from amd64/amd64/support.S) more than doubles bcopy
> bandwidth from 1.2GB/s to 2.5GB/s on my on my Athlon64 X2 3800+.

Is this with 5+GHz memory or with slower memory with the source cached?
I've seen 1.7GB/s in non-quick tests in user space with PC3200 memory
overclocked slightly.  This is almost twice as fast as using the best
nontemporal copy method (which gives 0.9GB/s on the same machine).

> I'm bringing this up because I've noticed that FreeBSD 10GbE
> performance is far below Solaris/amd64 and linux/x86_64 when using the
> PCI-e 10GbE adaptor that I'm doing drivers for.  For example, Solaris
> can recieve a netperf TCP stream at 9.75Gb/sec while using only 47%
> CPU as measured by vmstat.  (eg, it is using a little less than a
> single core).  In contrast, FreeBSD is limited to 7.7Gb/sec, and uses
> nearly 90% CPU.  When profiling with hwpmc, I see a profile which
> shows up to 70% of the time is spent in copyout.

The problems with always using nontemporal copies is that they might
be much slower if the target is already cached (which would often be
the case, for example, if the same small buffer is used repeatedly),
and they would be slower if the application actually uses the data
soon enough after reading it that it doesn't become uncached (if it
was cached as a side effect of the copy).

I once thought that movnt* doesn't take any advantage of cached data.
Tesing on AthlonXP showed that this isn't much of a problem -- repeated
movnt{q,ps}'s to the same small buffer go almost as fast (to within about
10%, with half the extra overhead for prefetchnta) as the best temporal
method, provided the target buffer is read into the cache first (otherwise
temporal copies are limited to the bandwidth of main memory, which is (3
to 5 times slower on my test machines).

However, on my Athlon64 and sledge's Opteron, movnt{q,ps,i} is limited
to the speed of main memory whether or not the target buffer is pre-read.

If it weren't for the Athlon64 behaviour, then using nontemporal copies
for all larger copyin/outs would probably be best.  "Large" wouldn't
need to be very large for the 10% overhead to be a reasonable tradeoff.
If the target were cached then the copy would go 10% slower (than very
fast), and if the target weren't cached but the data weren't actually
nontemporal (because the application actually uses it soon), then the
copy would go as fast as possible and the cost of reading the data
into the cache would be paid later where it would give a total cost
of about the same as from reading it as part of the copy.

With the Athlon64 behaviour, I think nontemporal copies should only be
used in cases where it is know that the copies really are nontemporal.
We use them for page copying now because this is (almost) known.  For
copyout(), it would be certainly known only for copies that are so large
that they can't fit in the L2 cache.  copyin() might be different, since
it might often be known that the data will be DMA'ed out by a driver and
need never be cached.

Bruce


More information about the freebsd-amd64 mailing list