Re: Checksum/copy (was: Re: cvs commit: src/sys/netinet ip_output.c)

From: Peter Jeremy <>
Date: Fri, 28 Mar 2003 18:35:14 +1100
On Fri, Mar 28, 2003 at 05:04:21PM +1100, Bruce Evans wrote:
>"i686" basically means "second generation Pentium" (PentiumPro/PII/Celeron)
>(later x86's are mostly handled better using CPU features instead of
>a 1-dimensional class number).  Hand-"optimized" bzero's are especially
>pessimal for this class of CPU.

That matches my memory of my test results as well.  The increasing
clock multipliers mean that it doesn't matter how slow "rep stosl" is
in clock cycle terms - maim memory is always going to be slower.

>Benefits from SSE for bzeroing and bcopying, if any, would probably
>come more from bypassing caches and/or not doing read-before-write
>(SSE instructions give control over this) than from operating on wider
>data.  I'm dubious about practical benefits.  Obviously it is not useful
>to bust the cache when bzeroing 8MB of data, but real programs and OS's
>mostly operate on smaller buffers.  It is negatively useful not to put
>bzero'ed data in the (L[1-2]) cache if the data will be used soon, and
>generally hard to predict if it will be used soon.

Unless Intel have fixed the P4 caches, you definitely don't want to
use the L1 cache for page sized bzero/bcopy.

Avoiding read-before-write should roughly double bzero speed and give
you about 50% speedup on bcopy - this should be worthwhile.  Caching
is more dubious - placing a slow-zeroed page in L1 cache is very
probably a waste of time.  At least part of an on-demand zeroed page
is likely to be used in the near future - but probably not all of it.
Copying is even harder to predict - at least one word of a COW page is
going to be used immediately, but bcopy() won't be able to tell which

I don't know how much control SSE gives you over caching - is it just
cache/no-cache, or can you control L1+L2/L2-only/none?  In the latter
case, telling bzero and bcopy destination to use L2-only is probably a
reasonable compromise.  The bcopy source should probably not evict
cache data - if data is cached, use it, otherwise fetch from main
memory and bypass caches.

Then there are other processor families... 

Finally, how many different bcopy/bzero variants to we want?  A
"page-size" variant has the advantage of not having to worry about
alignment or remaining-bytes issues but doubles the number of
bzero/bcopy variants we need to maintain.  Likewise, different
variants optimised for different feature sets of different CPUs
in different families could very rapidly get out of hand.

Received on Thu Mar 27 2003 - 23:35:34 UTC