svn commit: r213326 - head/lib/libc/i386/string

Fri Oct 1 21:02:04 UTC 2010

On Fri, 1 Oct 2010, Dimitry Andric wrote:

> On 2010-10-01 15:22, Roman Divacky wrote:
>> there's "rep cmps" in bcmp.S and memcmp.S in both amd64/i386
>> 
>> they both have C counterparts, no idea how fast those are (they
>> are going char by char).
>> 
>> does this wisdom apply to those too?
>
> Well, these versions compare by words first, using "repe cmpsl" (for
> i386) and "rep cmpsq" (for amd64).  Only if there any bytes remaining
> after that, they are checked with "repe cmpsb".

I think the speedup is not really because string instructions are very
slow, but because the C strlen() is sophisticated.  The asm versions compare
a byte at a time, while the C version mostly compares sizeof(long) words
at a time.  Only the "rep lodsb" is very slow, and "rep movs*" is quite
fast, at least if it asks to move 8 bytes at a time or the implementation
converts to that.

However, I double that the sophistication of the C version is an
optimization.  It has more overhead for short strings, and most strings are
short.

Many of the asm functions already don't use string instructions, but use
ordinary byte access instructions that would be generated for a C program
written using the same algorithm, except the compiler would add overhead
instructions.

> It is probably possible to optimize these versions for the case the
> source pointers are not aligned.  That will not occur too much,
> probably, and complicate the implementation.

And would be a negative optimization, or irrelevant, since the source
pointers must already be aligned if efficiency is important.

> On the other hand, the C implementations of bcmp and memcmp are
> basically simple "while (*src++ == *dst++);" implementations, so it is
> entirely up to the compiler to do any optimization magic there.

It is only when the asm versions are more sophisticated than that when
the asm versions are worth having.

The ones worth doing in asm for x86 are probably shown by ls -C of
src/libc/amd64/string:

CVS/		bcopy.S		memcpy.S	strcat.S
Makefile.inc	bzero.S		memmove.S	strcmp.S
bcmp.S		memcmp.S	memset.S	strcpy.S

Originally, amd64 didn't have any MD implementations of string functions
in libc, and no one really noticed since even bcopy is only about twice
as slow in C using a naive algorithm and a naive compiler, but eventually
the above were added.  Analysis:

bcmp.S: uses cmpsq.  Probably not called enough to matter.
bcopy.S: uses movsq.  Adequate, though it can be beaten by up to a factor
     of 2 using SSE2 for large copies on some CPUs
bzero.S: like bcopy.S
memcmp.S: like bcmp.S.  Should be called even less than bcmp(), or at least
     halve the number of uses of each, sinc bcmp() is the BSD spelling
memcpy.S: just wraps bcopy.S
memmove.S: just wraps bcopy.S
memset.S: uses stosq, and has too much alignment overhead
strcat.S: sophisticated, probably to a fault.  I think it uses the same
     algorithm as the C version to find the end of the source and target
     strings, with large code to manually inline this for each.  Probably
     not called enough to matter
strcmp.S: sophisticated again.  Uses the same method as the C strlen
     (0x8080 trick) to find the end(s).  The C version doesn't bother with
     this.
strcpy.S: like strcmp.S.  Again, the C version is unsophisticated.

Note that strlen.S is not present here.  This is because the compiler
normally inlines strlen() using "repnz scasb", so the sophisticated C
version and the i386 asm version are not normally used and the missing
amd64 asm version is not normally missed.  "Normally" involves compiling
with -O.  Howver, with -O2 gcc knows that it doesn't understand string
instructions, and it generates a call to strlen() instead of "optimizing"
it to "repnz scasb"!  The "optimization" seems to be independent of
the target -- it happens even on amd64 without -O2.  gcc understands
the string instruction memcpy() better than this -- it used generate
"rep movs*", but now either calls memcpy() or does load-store with
non-string instructions.

ls -C of src/libc/amd64/string shows too many string functions:

Makefile.inc	index.S		memset.S	strcpy.S	wcschr.S
bcmp.S		memchr.S	rindex.S	strlen.S	wcscmp.S
bcopy.S		memcmp.S	strcat.S	strncmp.S	wcslen.S
bzero.S		memcpy.S	strchr.S	strrchr.S	wmemchr.S
ffs.S		memmove.S	strcmp.S	swab.S

bcmp.S: like amd64
bcopy.S: like amd64
bzero.S like amd64
ffs.S: another mistake.  gcc normally inlines this, and cperciva has a
     faster version using a lookup table, and I have a faster version
     using floating point.  The last 2 methods may only be faster for
     32-bit ffs (they don't scale so well to 64 bits), but only the 32
     bit one has an "optimized" asm version
index.S: unsophisticated, but avoids string instructions
memchr.S: like index.S
memcmp.S: like bcmp.S
memcpy.S: just wraps bcopy.S
memmove.S: just wraps bcopy.S
memset.S: link 
rindex.S: like index.S
strcat.S: unsophisticated, but avoids string instructions
strchr.S: even more like index.s
strcmp.S: unsophisticated, but avoids string instructions
strcpy.S: unsophisticated, but avoids string instructions
strlen.S: when it existed: unsophisticated, but avoids string instructions
strncmp.S: like strcmp.S
strrchr.S: like index.S
swab.S: by far the best example of how to optimize using string instructions,
     but only for an 8088.  It uses "lodsw; rorw; stosw;".
wcschr.S: like strchr.S (?)
wcscmp.S: like strcmp.S (?)
wcslen.S: uses a simple loop with 8-fold unrolling and no strong instructions.
    The C version uses a simple loop with no unrolling.  Probably a silly
    optimization -- if the unrolling is good, then put it in the C version
    and let the compiler generate almost the same code as here, or better.
    The best unrolling factor may be MD, but I doubt 8 is best for all x86's.
    The 0x8080 trick doesn't really apply to wide characters, so this function
    (and wider anything) are inherently slower.
wmemchr.S: like memchr.S (?)

Bruce