fast bcopy...
Steven Atreju
snatreju at googlemail.com
Thu May 3 10:28:52 UTC 2012
K. Macy wrote [2012-05-03 02:58+0200]:
> It's highly chipset and processor dependent what works best.
Yes, of course.
Though i was kinda, even shocked, once i've seen this first:
http://marc.info/?l=dragonfly-commits&m=132241713812022&w=2
So we don't use our assembler version for new gccs and HAMMER or
SSE3+ (the decision for these was rather arbitrarily, except they
were yet existent for an instant implementation).
> Intel now has non-temporal loads and stores which work much
> better in some cases but provide little benefit in others.
Yes, our 2002 tests have shown that these were *extremely*
dependent upon alignment. (Note: 2002. o-)
Hmm, it doesn't really matter, but i guess this is a good time to
thank the FreeBSD hackers for that FPU stack FILD/FISTP idea!
I'll append the copy related notes of our doc/memperf.txt.
Thanks,
> -Kip
Steven.
I. x86 (AMD Athlon 1600+, 256MB DDR, 133/133 FSB)
-------------------------------------------------
COPY
....
The basic idea is always the same:
- Branch off to REPZ MOVSB if less than 16 bytes to go.
- Align at least one pointer on a nice boundary (&3 or &7).
(Done by a byte loop; one 4/8 store is more expensive here.)
We always align the _from pointer due to test experience.
- DEPENDENT.
- Do the remaining maximally 3 bytes in an unrolled MOVSB way.
DEPENDENT:
- !SF_FPU && !defined(SF_X86_MMX): just a matter of REPZ MOVSL.
- Otherwise we use three different loops over 64, 16 and 8 bytes,
respectively. If more than 4 bytes remain after that we use one
additional MOVSL.
Note that the 8 byte loop is not a loop but executes once only.
The big loop uses pairs of MOVNTQ/MOVQ, MOVQ/MOVQ and FILD/FISTP, if
_SSE, _MMX or _FPU, respectively. The _SSE loop exists in addition and
is never used if the non-aligned (the _to) pointer is not also aligned.
The two smaller ones never use SSE's non-temporal moves; this way we
simply can go no matter wether the to pointer is aligned or not.
Tests demonstrated that non-temporal is no win for them anyway.
At the end we add additional SFENCE (if _SSE) and EMMS (_MMX) or FEMMS
(if _3DNOW) to serialize the non-temporal moves and clear the MMX state,
respectively. The SFENCE should not be needed, however.
Prefetching is not used (very bad on Athlon (or i don't understand it)).
1. !_MMX && !_FPU
2. _MMX
3. _FPU (thanks to the FreeBSD crew for this idea!)
4. _MMX+_3DNOW+_SSE implementation (all we have).
([*] times in brackets show which time has been measured if the from
pointer alignment loop has a leading '.ALIGN 2' statement; note
especially the value for 4096... note this value in general.)
UNT: unaligned pointers, to pointer alignment goal
UNF: unaligned pointers, from pointer alignment goal
1000 loops; times in (averaged) microseconds
P.S.: 03-04-01: SSE stuff disabled because speed for smaller ranges
considered to be more important than for large and even more largest ranges.
(And small difference for non-perfect ranges and non-aligned pointers.)
---------------------------------------------------------------------------
|bytes| 1./ UNT/ UNF | 2./ UNT/ UNF | 3./ UNT/ UNF | 4.[*] / UNF |
|--------------------------------------------------------------------------
|16 | 34/ / | 19/ / 37 | 21/ / 37 | 24[ 26]/ 37 |
|15 | 40/ / | 39/ / 35 | 37/ / 35 | 38[ 39]/ 35 |
|32 | 36/ / | 23/ / 30 | 23/ / 30 | 27[ 30]/ 33 |
|31 | 43/ / | 37/ / 28 | 36/ / 28 | 38[ 42]/ 31 |
|64 | 45/ / | 17/ / 38 | 17/ / 36 | 21[ 23]/ 39 |
|63 | 50/ / | 46/ / 35 | 44/ / 34 | 47[ 50]/ 37 |
|128 | 59/ 70/ 74 | 31/ / 45 | 34/ / 47 | 34[ 36]/ 50 |
|127 | 67/ 82/ 62 | 53/ / 45 | 51/ / 44 | 62[ 63]/ 50 |
|256 | 89/ 111/ 108 | 52/ / 74 | 53/ / 77 | 50[ 50]/ 76 |
|255 | 99/ 123/ 96 | 67/ / 73 | 73/ / 75 | 68[ 70]/ 74 |
|512 | 151/ 197/ 177 | 95/ / 131 | 98/ / 137 | 84[103]/ 137 |
|511 | 158/ 208/ 166 | 100/ / 132 | 117/ / 134 | 99[112]/ 135 |
|1024 | 274/ 395/ 314 | 179/ / 255 | 211/ / 270 | 166[207]/ 257 |
|1023 | 280/ 408/ 303 | 196/ / 253 | 225/ / 267 | 184[185]/ 253 |
|2048 | 579/ 765/ 966 | 350/ / 485 | 394/ / 511 | 389[388]/ 486 |
|2047 | 585/ 777/ 942 | 368/ / 484 | 410/ / 520 | 323[398]/ 484 |
|4096 | 1009/1385/1140 | 704/ /1036 | 761/ /1040 | 671[583]/1038 |
|4095 | 1027/1386/1130 | 721/ /1034 | 776/ /1037 | 602[604]/1035 |
|--------------------------------------------------------------------------
P.S.: ooops - i've really forgotten that the SSE stuff has been
completely disabled at a later time! I guess we'll have to redo
some testing eventually!
More information about the freebsd-net
mailing list