[RFC] Port of NetBSD's optimized amd64 string code

Sat Aug 6 22:13:38 GMT 2005

On Wed, 3 Aug 2005, Xin LI wrote:

> On Mon, Aug 01, 2005 at 06:39:16PM -0700, David O'Brien wrote:
>> On Tue, Aug 02, 2005 at 02:25:18AM +0800, Xin LI wrote:
>>> Here is a patchset that I have produced to make our libc aware of the
>>> NetBSD assembly implementation of the string related operations.
>>
>> What performance benchmarks have these been thru?

I would expect insignificant ones.  alc already optimized most of the
amd64 string functions that are worth optimizing.

> In summary:  index, rindex, memchr, strchr, strlen, strlen, strrchr are
> faster than their C counterparts; ffs, strncmp are about the same, and
> swab is worse.

This is not surprising, since the faster ones mostly use better algorithms
so their fastness has very little to do with them being written in asm,
ffs.S has slightly worse code than the gcc builtin, strncmp.S doesn't do
anything special, and swab.S does almost everything pessimally.

> Note that these tests are done within loops that first copy a string to
> a new place, then perform the operation, then free the string, to defeat
> cache effects.

I usually test like this since it is hard to do better, but this tests for
precisely things that shouldn't happen in real applications.

> BETTER CASES
> 
> [delphij at warrior7] ~/test/index> /usr/bin/time ./index
>         5.81 real         5.79 user         0.00 sys
> ASSEMBLY
>         1.84 real         1.82 user         0.00 sys
> 
> [delphij at warrior7] ~/test/rindex> /usr/bin/time ./rindex
>         6.25 real         6.24 user         0.00 sys
> ASSEMBLY
>         2.17 real         1.84 user         0.01 sys
> 
> [delphij at warrior7] ~/test/memchr> /usr/bin/time ./memchr
>         5.93 real         5.91 user         0.00 sys
> ASSEMBLY
>         1.91 real         1.84 user         0.00 sys
> 
> [delphij at warrior7] ~/test/strchr> /usr/bin/time ./strchr
>         5.93 real         5.91 user         0.00 sys
> ASSEMBLY
>         1.84 real         1.83 user         0.00 sys

memchr.S, strchr.S (and thus index.S), strchchr.S (strchr timings below)
(and thus rindex.S) all use the well known method of comparing with the
magic byte sequences \x80\x80... and \x01\0x01...  It's not surprising
that this is several times faster since it accesses the data 4-8 bytes
at a time (8 bytes for amd64), but IMO this optimization belongs only in
C code (if anywhere) so that it optimizes all arches (if it is an
optimization).  glibc has it in C code.  NetBSD also has it in the i386
string library but not in the generic string library.

> [delphij at warrior7] ~/test/strlen> /usr/bin/time ./strlen
>         7.13 real         7.12 user         0.00 sys
> ASSEMBLY
>         1.44 real         1.43 user         0.00 sys

strlen.S uses the magic byte sequence method in the same way (NetBSD also
does this in the i386 strlen.S...).  The method works even better for
strlen(), giving closer to an 8-fold speedup, since there is less overhead.

Other benchmarks for strlen() show strange behaviour.  (Note that neither
the generic library version nor the MD i386/amd64 version of it is normally
used, since the gcc builtin is normally used.)
- the builtin is slightly faster than the MD i386 version, since they use
   the same algorithm (a scasb loop) and the builtin doesn't have function
   call overhead.
- the simple generic C version is often faster than the "optimized" builtin!
   IIRC, alc noticed this for amd64 and didn't implement an amd64 strlen.S
   using the same simple algorithm as the i386 one because it would be a
   pessimization even if it were actually used.  It seems to be pessimal on
   Athlon XP's too.  For 10^8 strlen()s, with other CFLAGS -O2 -static: on
   an AXP2600 overclocked:

%%%
string length 0:
algorithm -fbuiltin:
         0.30 real         0.29 user         0.00 sys
algorithm -fno-builtin:
         0.33 real         0.32 user         0.00 sys
algorithm -DMISTRLEN:
         0.25 real         0.24 user         0.00 sys
string length 1:
algorithm -fbuiltin:
         0.31 real         0.30 user         0.00 sys
algorithm -fno-builtin:
         0.34 real         0.33 user         0.00 sys
algorithm -DMISTRLEN:
         0.26 real         0.25 user         0.00 sys
string length 10:
algorithm -fbuiltin:
         0.40 real         0.39 user         0.00 sys
algorithm -fno-builtin:
         0.43 real         0.42 user         0.00 sys
algorithm -DMISTRLEN:
         0.40 real         0.39 user         0.00 sys
string length 100:
algorithm -fbuiltin:
         1.32 real         1.30 user         0.00 sys
algorithm -fno-builtin:
         1.35 real         1.33 user         0.00 sys
algorithm -DMISTRLEN:
         1.21 real         1.20 user         0.00 sys
string length 1000:
algorithm -fbuiltin:
        10.46 real        10.42 user         0.00 sys
algorithm -fno-builtin:
        10.50 real        10.46 user         0.00 sys
algorithm -DMISTRLEN:
         9.35 real         9.31 user         0.00 sys
%%%

Note that the MI version is fastest for all string lengths.

The C version was sometimes (in results with -O1 and/or not shown above) 
30% slower than it usually is; this turned out to be due to the loop in
strlen() not fitting in a cache line due to the start of the loop
accidentally beginning near a cache line boundary.

On an A64 3400:

%%%
string length 0:
algorithm -fbuiltin:
         0.34 real         0.32 user         0.00 sys
algorithm -fno-builtin:
         0.24 real         0.23 user         0.00 sys
algorithm -fno-builtin zq.S:
         0.27 real         0.26 user         0.00 sys
string length 1:
algorithm -fbuiltin:
         0.35 real         0.33 user         0.00 sys
algorithm -fno-builtin:
         0.25 real         0.24 user         0.00 sys
algorithm -fno-builtin zq.S:
         0.29 real         0.28 user         0.00 sys
string length 10:
algorithm -fbuiltin:
         0.44 real         0.42 user         0.00 sys
algorithm -fno-builtin:
         0.42 real         0.40 user         0.00 sys
algorithm -fno-builtin zq.S:
         0.30 real         0.29 user         0.00 sys
string length 100:
algorithm -fbuiltin:
         1.34 real         1.32 user         0.00 sys
algorithm -fno-builtin:
         1.32 real         1.30 user         0.00 sys
algorithm -fno-builtin zq.S:
         0.55 real         0.54 user         0.00 sys
string length 1000:
algorithm -fbuiltin:
        10.39 real        10.37 user         0.00 sys
algorithm -fno-builtin:
        10.37 real        10.34 user         0.00 sys
algorithm -fno-builtin zq.S:
         2.25 real         2.23 user         0.00 sys
%%%

zq.S is your proposed strlen.S.

> 
> [delphij at warrior7] ~/test/strrchr> /usr/bin/time ./strrchr
>         9.12 real         9.08 user         0.00 sys
> ASSEMBLY
>         4.71 real         4.69 user         0.00 sys
> 
> WORSE/NO EFFECTS
> 
> [delphij at warrior7] ~/test/ffs> /usr/bin/time ./ffs
>         0.33 real         0.33 user         0.00 sys
> ASSEMBLY
>         0.33 real         0.33 user         0.00 sys

The MI library version is poor for ffs, but the builtin is good.  It is
easy to improve the MI version, at least in benchmarks, using a lookup
table, but efficiency of ffs() is unimportant in most applications so no
one has done it.  From an old benchmark of various methods on an A64:

%%%
UNIFORM/BUILTIN_FFS:         0.27 real         0.26 user         0.00 sys
UNIFORM/CPUFUNC_FFS:         0.27 real         0.26 user         0.00 sys
UNIFORM/LIBMET0_FFS:         0.70 real         0.70 user         0.00 sys
UNIFORM/LIBMETH_FFS:         0.72 real         0.72 user         0.00 sys
UNIFORM/LUTMETH_FFS:         0.25 real         0.24 user         0.00 sys
UNIFORM/CPUFUNC_FLS:         0.33 real         0.33 user         0.00 sys
UNIFORM/ILOGMET_FLS:         0.18 real         0.18 user         0.00 sys
UNIFORM/ILOGBME_FLS:         0.37 real         0.36 user         0.00 sys
UNIFORM/ILOGBM0_FLS:         0.39 real         0.39 user         0.00 sys
UNIFORM/LIBMET0_FLS:         1.67 real         1.67 user         0.00 sys
UNIFORM/LIBMETH_FLS:         1.71 real         1.71 user         0.00 sys
RANDBIT/BUILTIN_FFS:         0.27 real         0.26 user         0.00 sys
RANDBIT/CPUFUNC_FFS:         0.27 real         0.26 user         0.00 sys
RANDBIT/LIBMET0_FFS:         1.43 real         1.43 user         0.00 sys
RANDBIT/LIBMETH_FFS:         1.44 real         1.43 user         0.00 sys
RANDBIT/LUTMETH_FFS:         0.25 real         0.24 user         0.00 sys
RANDBIT/CPUFUNC_FLS:         0.33 real         0.33 user         0.00 sys
RANDBIT/ILOGMET_FLS:         0.17 real         0.17 user         0.00 sys
RANDBIT/ILOGBME_FLS:         0.41 real         0.41 user         0.00 sys
RANDBIT/ILOGBM0_FLS:         0.39 real         0.39 user         0.00 sys
RANDBIT/LIBMET0_FLS:         1.16 real         1.16 user         0.00 sys
RANDBIT/LIBMETH_FLS:         1.16 real         1.16 user         0.00 sys
ALLZERO/BUILTIN_FFS:         0.27 real         0.26 user         0.00 sys
ALLZERO/CPUFUNC_FFS:         0.27 real         0.26 user         0.00 sys
ALLZERO/LIBMET0_FFS:         0.53 real         0.53 user         0.00 sys
ALLZERO/LIBMETH_FFS:         0.55 real         0.55 user         0.00 sys
ALLZERO/LUTMETH_FFS:         0.12 real         0.12 user         0.00 sys
ALLZERO/CPUFUNC_FLS:         0.10 real         0.10 user         0.00 sys
ALLZERO/ILOGMET_FLS:         0.12 real         0.12 user         0.00 sys
ALLZERO/ILOGBME_FLS:         0.12 real         0.12 user         0.00 sys
ALLZERO/ILOGBM0_FLS:         0.10 real         0.10 user         0.00 sys
ALLZERO/LIBMET0_FLS:         0.24 real         0.24 user         0.00 sys
ALLZERO/LIBMETH_FLS:         0.29 real         0.28 user         0.00 sys
ALLONE_/BUILTIN_FFS:         0.27 real         0.26 user         0.00 sys
ALLONE_/CPUFUNC_FFS:         0.27 real         0.26 user         0.00 sys
ALLONE_/LIBMET0_FFS:         0.58 real         0.57 user         0.00 sys
ALLONE_/LIBMETH_FFS:         0.60 real         0.59 user         0.00 sys
ALLONE_/LUTMETH_FFS:         0.25 real         0.24 user         0.00 sys
ALLONE_/CPUFUNC_FLS:         0.37 real         0.37 user         0.00 sys
ALLONE_/ILOGMET_FLS:         0.16 real         0.16 user         0.00 sys
ALLONE_/ILOGBME_FLS:         0.37 real         0.37 user         0.00 sys
ALLONE_/ILOGBM0_FLS:         0.39 real         0.39 user         0.00 sys
ALLONE_/LIBMET0_FLS:         0.25 real         0.24 user         0.00 sys
ALLONE_/LIBMETH_FLS:         0.29 real         0.28 user         0.00 sys
ALLLAST/BUILTIN_FFS:         0.27 real         0.26 user         0.00 sys
ALLLAST/CPUFUNC_FFS:         0.27 real         0.26 user         0.00 sys
ALLLAST/LIBMET0_FFS:         2.37 real         2.36 user         0.00 sys
ALLLAST/LIBMETH_FFS:         2.39 real         2.38 user         0.00 sys
ALLLAST/LUTMETH_FFS:         0.25 real         0.24 user         0.00 sys
ALLLAST/CPUFUNC_FLS:         0.37 real         0.37 user         0.00 sys
ALLLAST/ILOGMET_FLS:         0.12 real         0.12 user         0.00 sys
ALLLAST/ILOGBME_FLS:         0.37 real         0.37 user         0.00 sys
ALLLAST/ILOGBM0_FLS:         0.39 real         0.39 user         0.00 sys
ALLLAST/LIBMET0_FLS:         1.77 real         1.76 user         0.00 sys
ALLLAST/LIBMETH_FLS:         1.81 real         1.81 user         0.00 sys
%%%

Here various distributions of the bits are tested, with various algorithms:

BUILTIN_FFS: whatever gcc gives
CPUFUNC_FFS: as in <machine/cpufunc.h> (this uses the builtin for amd64)
LIBMET0_FFS: LIBMETH_FFS compiled with benchmark's CFLAGS
LIBMETH_FFS: MI library method
LUTMETH_FFS: a lookup table method, orignally by cperciva
CPUFUNC_FLS: whatever gcc gives
ILOGMET_FLS: convert to floating point and extract the exponent directly
ILOGBME_FLS: convert to floating point and use ilogb(3) to extract...
ILOGBM0_FLS: ILOGBM0_FLS compiled with benchmark's CFLAGS
LIBMET0_FLS: MI library method compiled with benchmark's CFLAGS
LIBMETH_FLS: MI library method

> [delphij at warrior7] ~/test/strncmp> /usr/bin/time ./strncmp
>         3.90 real         3.88 user         0.00 sys
>         3.88 real         3.87 user         0.00 sys
> ASSEMBLY
>         3.88 real         3.87 user         0.00 sys
> 
> [delphij at warrior7] ~/test/swab> /usr/bin/time ./swab
>         6.59 real         6.53 user         0.01 sys
> ASSEMBLY
>         8.06 real         8.05 user         0.00 sys
> 
Some comments on the patch.

% Index: ffs.S
% ===================================================================
% RCS file: ffs.S
% diff -N ffs.S
% --- /dev/null	1 Jan 1970 00:00:00 -0000
% +++ ffs.S	1 Aug 2005 17:54:04 -0000
% @@ -0,0 +1,22 @@
% +/*
% + * Written by J.T. Conklin <jtc at NetBSD.org>.
% + * Public domain.
% + * Adapted for NetBSD/x86_64 by Frank van der Linden <fvdl at wasabisystems.com>
% + */
% +
% +#include <machine/asm.h>
% +__FBSDID("$FreeBSD$");
% +
% +#if 0
% +	RCSID("$NetBSD: ffs.S,v 1.2 2003/07/26 19:24:38 salo Exp $")
% +#endif
% +
% +ENTRY(ffs)
% +	bsfl	%edi,%eax
% +	jz	L1	 		/* ZF is set if all bits are 0 */
% +	incl	%eax			/* bits numbered from 1, not 0 */
% +	ret
% +
% +	.align 4
% +L1:	xorl	%eax,%eax		/* clear result */
% +	ret

The amd64 builtin uses cmove to avoid the branch.  This is always better,
but is only a small optimization (see above benchmark).

The i386 ffs() does the test and branch before the bsfl so as to avoid
the bsfl if the arg is 0.  This sigificantly optimimizes the case of a
zero arg at no cost for nonzero args.

The gcc builtin ffs() has evolved in many stages:
- very old versions depended on undefined behaviour.  The i386 kernel ffs()
   was written to avoid this bug and it was easy to optimize it while doing
   this.
- not old versions generated essentially the above code.
- not so old versions (3.3.3) generate "sete; negl; orl" to avoid the
   branch.  This is a less efficient variant of the cmove.

% Index: memchr.S
% ===================================================================
% RCS file: memchr.S
% diff -N memchr.S
% --- /dev/null	1 Jan 1970 00:00:00 -0000
% +++ memchr.S	1 Aug 2005 18:09:44 -0000
% @@ -0,0 +1,112 @@

This is good code (algorithm and style).  I don't mind using this
micro-optimization if someone else writes and maintains it :-).

% ...
% +.Lzero:
% +	xorq	%rax,%rax

Some of the other functions only return %eax, which is sufficient for
an int but is a different style.

% Index: strchr.S
% ===================================================================
% RCS file: strchr.S
% diff -N strchr.S
% --- /dev/null	1 Jan 1970 00:00:00 -0000
% +++ strchr.S	1 Aug 2005 18:11:51 -0000
% @@ -0,0 +1,137 @@

Similarly to memchr.S....

% ...
% +	subq	$7,%rdi
% +	jmp	.Ldone
% +1:	testb	%dl,%dl		/* 2nd byte == 0? */

... but not so good style near the end of it:
- nameless labels are fairly easy to write but hard to read
- no blank line before separate blocks of code
- labels on the same line as statements

% Index: strlen.S
% ===================================================================
% RCS file: strlen.S
% diff -N strlen.S
% --- /dev/null	1 Jan 1970 00:00:00 -0000
% +++ strlen.S	1 Aug 2005 18:12:48 -0000
% @@ -0,0 +1,157 @@

Not as good style as memchr.S.

% +	/*
% +	 * There are many well known branch-free sequences which are used
% +	 * for determining whether a zero-byte is contained within a word.
% +	 * These sequences are generally much more efficent than loading
% +	 * and comparing each byte individually.
% +	 *
% +	 * The expression [1,2]:
% +	 *
% +	 * (1)  ~(((x & 0x7f....7f) + 0x7f....7f) | (x | 0x7f....7f))
% +	 *
% +	 * evaluates to a non-zero value if any of the bytes in the
% +	 * original word is zero.
% ...

This comment has some spelling errors and doesn't belong here.  The algorithm
is mostly general and should be described centrally.  A more complicated
variant of It is used without comment in memchr.S and strchr.S.  Describing
things in terms of PowerPC (?) clz instructions belongs even less in the
i386 asm version than it does in a general comment.

% Index: strncmp.S
% ===================================================================
% RCS file: strncmp.S
% diff -N strncmp.S
% --- /dev/null	1 Jan 1970 00:00:00 -0000
% +++ strncmp.S	1 Aug 2005 18:13:51 -0000
% @@ -0,0 +1,108 @@

Note as good as memchr.S.  It uses an algorithm that apparently gives no
better results than the generic C one, and uses meaningless though named
labels throughout.

% Index: strrchr.S
% ===================================================================
% RCS file: strrchr.S
% diff -N strrchr.S
% --- /dev/null	1 Jan 1970 00:00:00 -0000
% +++ strrchr.S	1 Aug 2005 18:15:07 -0000
% @@ -0,0 +1,127 @@

Similarly to strchr.S. but less worth optimizing.

% Index: swab.S
% ===================================================================
% RCS file: swab.S
% diff -N swab.S
% --- /dev/null	1 Jan 1970 00:00:00 -0000
% +++ swab.S	1 Aug 2005 18:18:17 -0000
% @@ -0,0 +1,47 @@

Shows how not to optimize things.

% +#define LOAD_SWAP_STORE_WORD \
% +	lodsw	; \

The lods instruction should never be used, since it is very slow.

% +	xchgb	%al,%ah ; \

Partial register operations should be avoided.  I don't know if this
is important on amd64's.  On Athlons xchgb of 2 registers is on the
VectorPath and takes 2 cycles, so is probably possible to do 2 byte
loads to the right places (%al and %ah) in the same time that it takes 
to fix up the bytes after slowly loading them as a word, provided there
is no penalty for the partial register operations.  I think the C code
wins by doing something similar.

% +	stosw

stosw Is OK.

% +L2:	shrq	$3,%rdx			# copy remainder 8 words at a time
% +	jz	L4			# while swapping alternate bytes.
% +L3:
% +	LOAD_SWAP_STORE_WORD
% +	LOAD_SWAP_STORE_WORD
% +	LOAD_SWAP_STORE_WORD
% +	LOAD_SWAP_STORE_WORD
% +	LOAD_SWAP_STORE_WORD
% +	LOAD_SWAP_STORE_WORD
% +	LOAD_SWAP_STORE_WORD
% +	LOAD_SWAP_STORE_WORD
% +
% +	decq	%rdx
% +	jnz	L3

To optimize the amd64 version, I would first try loading 8 bytes at a time
and swap them using a minimal combination of bswaps and rotates.

Bruce