non-temporal copyin/copyout?

Sat Feb 18 14:02:51 PST 2006

Bruce Evans writes:
 > On Fri, 17 Feb 2006, Andrew Gallatin wrote:
 > 
 > > Has anybody considered using non-temporal copies for the in-kernel
 > > bcopy on amd64?
 > 
 > Yes.  It's probably a small pessimization sunce large bcopys are (or
 > should be) rare.  If you really mean copyin/copyout as in the subject
 > line, then things are less clear.

Yes, copyin/copyout is what I really meant.

 > > A quick test in userspace shows that for large copies, an adapted
 > > pagecopy (from amd64/amd64/support.S) more than doubles bcopy
 > > bandwidth from 1.2GB/s to 2.5GB/s on my on my Athlon64 X2 3800+.
 > 
 > Is this with 5+GHz memory or with slower memory with the source cached?
 > I've seen 1.7GB/s in non-quick tests in user space with PC3200 memory
 > overclocked slightly.  This is almost twice as fast as using the best
 > nontemporal copy method (which gives 0.9GB/s on the same machine).

This is a "DFI Lanparty UTnF4 Ultra-D" with an Nforce 4 chipset, and 2
256 MB sticks of PC3200 ram.  The timings I mention above closely
match the lmbench "bcopy" benchmark for large buffers (> L2 cache)
when run on FreeBSD vs when run on Solaris (which uses a non-temporal
bcopy even in userspace).

<....>

 > With the Athlon64 behaviour, I think nontemporal copies should only be
 > used in cases where it is know that the copies really are nontemporal.
 > We use them for page copying now because this is (almost) known.  For
 > copyout(), it would be certainly known only for copies that are so large
 > that they can't fit in the L2 cache.  copyin() might be different, since
 > it might often be known that the data will be DMA'ed out by a driver and
 > need never be cached.

I think you could make arguments for doing a non-temporal copy for
both copyin and copyout when the size exceeds some tunable threshold.
Solaris even uses a fixed threshold, and I believe the threshold is
quite small (128 bytes).  See
http://cvs.opensolaris.org/source/xref/on/usr/src/uts/intel/ia32/ml/copy.s

Maybe I'm being naive, but I would assume that most bulk data, both
copied in and copied out should never be accessed by the kernel in a
high performance system.  Most Gigabit or better, and many 100Mb
network drivers do checksum offloading on both send and receive, so
there is no need for the kernel to touch any data which is copied in
or out for network sends or receives.  Further, I can imagine a 
network server (like a userspace nfs server or samba) turning around
and writing data to disk which it received via a socket read without
ever looking at the buffer.

I don't know the storage system as well as the networking system, but
unless a disk driver is using PIO, I don't think the data is ever
touched by the kernel.

This is all academic, as I don't know enough about x86_64 asm to
implement any of this.  But I have an ideal testbed if anybody
would be inclined to implement it.  

Drew