fast bcopy...
Luigi Rizzo
rizzo at iet.unipi.it
Wed May 2 18:06:13 UTC 2012
as part of my netmap investigations, i was looking at how
expensive are memory copies, and here are a couple of findings
(first one is obvious, the second one less so)
1. especially on 64bit machines, always use multiple of at
least 8 bytes (possibly even larger units). The bcopy code
in amd64 seems to waste an extra 20ns (on a 3.4 GHz machine)
when processing blocks of size 8n + {4,5,6,7}.
The difference is relevant, on that machine i have
bcopy(src, dst, 1) ~12.9ns (data in L1 cache)
bcopy(src, dst, 3) ~12.9ns (data in L1 cache)
bcopy(src, dst, 4) ~33.4ns (data in L1 cache) <--- NOTE
bcopy(src, dst, 32) ~12.9ns (data in L1 cache)
bcopy(src, dst, 63) ~33.4ns (data in L1 cache) <--- NOTE
bcopy(src, dst, 64) ~12.9ns (data in L1 cache)
Note how the two marked lines are much slower than the others.
Same thing happens with data not in L1
bcopy(src, dst, 64) ~ 22ns (not in L1)
bcopy(src, dst, 63) ~ 44ns (not in L1)
...
Continuing the tests on larger sizes, for the next item:
bcopy(src, dst,256) ~19.8ns (data in L1 cache)
bcopy(src, dst,512) ~28.8ns (data in L1 cache)
bcopy(src, dst,1K) ~39.6ns (data in L1 cache)
bcopy(src, dst,4K) ~95.2ns (data in L1 cache)
An older P4 running FreeBSD4/32 bit the operand size seems less
sensitive to odd sizes.
2. apparently, bcopy is not the fastest way to copy memory.
For small blocks and multiples of 32-64 bytes, i noticed that
the following is a lot faster (breaking even at about 1 KBytes)
static inline void
fast_bcopy(void *_src, void *_dst, int l)
{
uint64_t *src = _src;
uint64_t *dst = _dst;
for (; l > 0; l-=32) {
*dst++ = *src++;
*dst++ = *src++;
*dst++ = *src++;
*dst++ = *src++;
}
}
fast_bcopy(src, dst, 32) ~ 1.8ns (data in L1 cache)
fast_bcopy(src, dst, 64) ~ 2.9ns (data in L1 cache)
fast_bcopy(src, dst,256) ~10.1ns (data in L1 cache)
fast_bcopy(src, dst,512) ~19.5ns (data in L1 cache)
fast_bcopy(src, dst,1K) ~38.4ns (data in L1 cache)
fast_bcopy(src, dst,4K) ~152.0ns (data in L1 cache)
fast_bcopy(src, dst, 32) ~15.3ns (not in L1)
fast_bcopy(src, dst,256) ~38.7ns (not in L1)
...
The old P4/32 bit also exhibits similar results.
Conclusion: if you have to copy packets you might be better off
padding the length to a multiple of 32, and using the following
function to get the best of both worlds.
Sprinkle some prefetch() for better taste.
// XXX only for multiples of 32 bytes, non overlapped.
static inline void
good_bcopy(void *_src, void *_dst, int l)
{
uint64_t *src = _src;
uint64_t *dst = _dst;
#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)
if (unlikely(l >= 1024)) {
bcopy(src, dst, l);
return;
}
for (; l > 0; l-=32) {
*dst++ = *src++;
*dst++ = *src++;
*dst++ = *src++;
*dst++ = *src++;
}
}
cheers
luigi
More information about the freebsd-current
mailing list