HEADS DOWN

Bruce Evans bde at zeta.org.au
Sun May 6 00:59:27 UTC 2007


On Sun, 6 May 2007, Andrey Chernov wrote:

> On Sat, May 05, 2007 at 04:48:44PM -0500, Sean C. Farley wrote:
>>  I have the same assembly output.  Inlined __strleneq() ends up being
>>  faster on my system than GCC's strlen() when I changed all calls where
>>  checkEquals equaled false.  I believe you that it should be faster with
>>  GCC's version, but it is not ending up that way on my Athlon XP and
>>  Pentium 4 systems running FreeBSD 6.2.
>>
>>  There is now a sysenv-strlen.c that I tested the timings.c program in
>>  regressions/environment directory.  It keeps showing __strleneq() to be
>>  faster.
>
> I wonder how it possible. Your after "if" variant becomes
> .L13:
>        incl    %eax
>        cmpb    $0, (%eax)
>        jne .L13
> which should be slower in general than gcc ones.

No, it should be faster on most machines.  I just happened to look at
an optimization manual which reminded me that most string instructions
should never be used since they have large setup overheads and most
of them are slower even after setup.  I thought that scasb wasn't so
bad, but the manual went as far as saying that scasb is one of the
string instructions that should never be used.

AthlonXP timing (execution latency) for some (repeated) string
instructions (forward direction) (non-repeated string instructions are
even less useful, and the reverse direction has even higher overheads):

     movs    15 + 4/3 *c    can sometimes beat load-store
     stos    14 + 1   *c    wastes up to 1/2 of store bandwidth
     lods    15 + 2   *c    wastes up to 3/4 of load bandwidth
     scas    15 + 5/2 *c    1.25 times slower than the above simple loop
     cmps    16 + 10/3*c    10/4 times slower than a simple loop

The setup overhead for using string instructions may be much larger than
the 14-16 in the above table.  It is also necessary to set the direction
flag and maybe to shuffle registers so as to use the registers required
by the string instructions.  cld/std is fast enough on Athlons (1/2
cycles) but is slow on Pentium4 and later (IIRC, 43 cycles for std on
one Pentium model).

All simple loops like the above take only 2 cycles (execution latency)
on many modern CPUs including Athlons.  The memory compare takes 2
cycles (1 to load and 1 to compare).  The loop overhead is also 2
cycles, but this executes in parallel.  Athlons have 3 fairly independent
pipelines, so all loops between an empty one and one doing twice as much
non-loop-overhead work as the above execute in 2 cycles provided the work
is sufficiently independent, and it may be possible to implement strlen()
up to 3/2 times faster using a not-so simple loop while still using a
byte-wise algorithm.  Speeding it up like this isn't easy -- I can't
think of anything better to do than cmpb of another byte, but the loop
would need to be unrolled for this to work and then we would have
a 16-bit-word-wise algorithm that has most of the disadvantages and
few of the advantages of a 16/32/64-bit-word-wise algorithm.  The simplest
loop above is likely to be best for average strings since it low
setup overhead and no special cases.

gcc understands very little of this and still generates string
instructions (gcc does know to issue the cld as early as possible so
that its latency probably doesn't matter).  Inlining strlen() using
string instructions may still be both a space and time optimization
relative to the extern version, but cannot compete with the simplest
loop.

THe FreeBSD string library understands only a little of this.  The
i386 part is full of string instructions that may have been good for
original i386's (but which do things like non-rep string instructions
that were probably always pessimizations).  The amd64 part is smarter.
It doesn't have strlen.S or special support for most str and mem
functions, but it does have strcmp.S and of course memcmp.S.  Its
strcmp.S uses a 64-bit word-wise algorithm so it will probably beat
the simple byte-wise algorithm; it probably doesn't need to be written
in asm to do this, but many machine dependencies are involved in knowing
that 64 bits can be used, and are best, and to know and use the best
alignment.

glibc-2.4 has tege's old code for the 32/64-bit word-wise algorithm for
the generic strlen() in C, but the i386 version is "optimized" to use
rep scasb like the gcc builtin.

Of course, optimizing strlen() is unimportant, since even the slowest
method runs at nearly 1GB/S on modern machines and you rarely have more
than a few MB of strings to process.

Bruce


More information about the freebsd-arch mailing list