amd64 pmap pagecopy() optimization()?
vsrinivas at dragonflybsd.org
Sun Dec 12 16:52:44 UTC 2010
In svn r127653, a microoptimized pagecopy() implementation was added to
amd64's support.S. The pagecopy() prefetches the entire page first and
then uses a partly-unrolled loop of loads & non-temporal stores. The
commit notes 'it is roughly four times faster than bcopy() for uncached
Just wondering, how was this measured? I ported the routine to i386 and
tried it out in userland, but found it between four and six times slower
than the BSD and GNU libc bcopy()ies; I admit to not trying very hard to
measure on only uncached pages though...
Also, why prefetch the entire page before the load / NT store loop? If I
read the Intel optimization guide correctly, a loop of
prefetch(n+1) / load / store would be a better call? (I tried this on i386
also, it was a bit faster than the current style, but still nowhere near
More information about the freebsd-hackers