amd64 pmap pagecopy() optimization()?

Sun Dec 12 16:52:44 UTC 2010

Hi,

In svn r127653, a microoptimized pagecopy() implementation was added to 
amd64's support.S. The pagecopy() prefetches the entire page first and 
then uses a partly-unrolled loop of loads & non-temporal stores. The 
commit notes 'it is roughly four times faster than bcopy() for uncached 
pages'.

Just wondering, how was this measured? I ported the routine to i386 and 
tried it out in userland, but found it between four and six times slower 
than the BSD and GNU libc bcopy()ies; I admit to not trying very hard to 
measure on only uncached pages though...

Also, why prefetch the entire page before the load / NT store loop? If I 
read the Intel optimization guide correctly, a loop of 
prefetch(n+1) / load / store would be a better call? (I tried this on i386 
also, it was a bit faster than the current style, but still nowhere near 
bcopy()...).

Thanks!
-- vs