[PATCH] Mantaining turnstile aligned to 128 bytes in i386 CPUs

Fri Jan 19 07:14:26 UTC 2007

On Fri, 19 Jan 2007, Peter Jeremy wrote:

> On Thu, 2007-Jan-18 18:03:20 +1100, Bruce Evans wrote:
>> On Wed, 17 Jan 2007, Matthew Dillon wrote:
>>>     Alignment is critical.  If the data is not aligned, don't bother.  128
>>>     bits means 16 byte alignment.
>>
>> The above benchmark output is for aligned data :-).  I don't try hard to
>> optimize or benchmark misaligned cases.
>
> How realistic is this?  Has anyone collected statistics on the size and
> alignment of bzero/bcopy calls?  How much of the time is the size known
> at compile time?

I think perfect alignment is very realistic.  If not, it is an application
bug :), just like for misaligned integer accesses on arches that allow
this.  In the kernel, other parts of the kernel are the application and
it is reasonable to require perfect alignment.

I recently did a dynamic search for misaligned (but only 32-bit
non-aligned) bxx's (maybe only bzeros) in low-level network code and
found only a couple.  For the original i586 FPU optimizations, I
gatherer statistics for bcopy/bzero.  IIRC, alignment (64-bit?) was
normal, at least for the large copies of interest, and large bcopys
were so uncommon that it was a complete waste of time to optimize them
(at least for my applications).  Large bzeros/copyins/copyouts are
more common.

FreeBSD has some optimizations in low-level networking code for bcopys
with a small size that is known at compile time (just use gcc's
builtin_memcpy).  These were lost to -ffreestanding and/or gcc's
aggressive optimization of things like printf using the builtin printf.
(-ffreestanding implies -fno-builtin, and no one cared enough about
the loss to turn builtins back on.  If you turn them back on, then
they should be turned on individually as recommended in gcc.info to
avoid conflicts.  This is easy enough for the memcpy builtin but messy
if you want all the old builtins starting with strlen.)  I looked at
these lost optimizations again while trying to optimize the low- level
networking code for packets-per-second.  The difficulty of implementing
memcpy/bcopy perfectly is shown by gcc's builtin not being very close
to getting it right for small fixed sizes even with -march=...  I lost
interest in this for now when I found that optimizations were impossible
to measure because the packet rate depends mysteriously on the layout
of the text section.  My changes may have given +10%, but unrelated
changes gave +-30%.  The most mysterious one was -20% when cvs updated
added ~500 bytes of object code that is never executed.  Using builtin
memcpy didn't have a noticeable effect here.

Bruce