Checksum/copy (was: Re: cvs commit: src/sys/netinet ip_output.c)

Thu Mar 27 22:41:47 PST 2003

On Fri, 28 Mar 2003, Greg 'groggy' Lehey wrote:

> On Thursday, 27 March 2003 at 19:07:15 +1100, Bruce Evans wrote:
> > On Wed, 26 Mar 2003, Mike Silbersack wrote:
> >> On my Mobile Celeron, a for (i = 0; i < max; i++) array[i]=0 runs
> >> faster than bzero.  :(
> >
> > Saved data from my benchmarks show that bzero (stosl) was OK on
> > 486's, poor on original Pentiums, OK on K6-1's, best by far on
> > second generation Celerons (ones like PII) and poor on Athlon XP's
> > (but not as relatively bad as on original Pentiums).
>
> What happened to i686_bzero?  I was sure that years ago one existed,
> but now all machines I use (i686 class) all use generic_bzero.

I nuked it in:

%%%
RCS file: /home/ncvs/src/sys/i386/i386/support.s,v
Working file: support.s
head: 1.93
...
----------------------------
revision 1.40
date: 1996/10/09 18:16:17;  author: bde;  state: Exp;  lines: +291 -60
...
Removed old, dead i586_bzero() and i686_bzero().  Read-before-write is
usually bad for i586's.  It doubles the memory traffic unless the data
is already cached, and data is (or should be) very rarely cached for
large bzero()s (the system should prefer uncached pages for cleaning),
and the amount of data handled by small bzero()s is relatively small
in the kernel.
...
----------------------------
%%%

"i686" basically means "second generation Pentium" (PentiumPro/PII/Celeron)
(later x86's are mostly handled better using CPU features instead of
a 1-dimensional class number).  Hand-"optimized" bzero's are especially
pessimal for this class of CPU.  The log message is mainly about
PentiumPro's.  Later models aren't as bad.  E.g. on a Celeron 400 MHz
overclocked to 6*75MHz:

[bzero times, 4K buffer]
zero0: 2169427140 B/s (  46095 us) (stosl)
zero1: 1178408485 B/s (  84860 us) (unroll 16)
zero2: 1180481213 B/s (  84711 us) (unroll 16 preallocate)
zero3: 1564647390 B/s (  63912 us) (unroll 32)
zero4: 1287279636 B/s (  77683 us) (unroll 32 preallocate)
zero5: 1482553913 B/s (  67451 us) (unroll 64)
zero6: 1469029028 B/s (  68072 us) (unroll 64 preallocate)
zero7: 1774492387 B/s (  56354 us) (fstl)
zero8:  888397008 B/s ( 112562 us) (movl)
zero9: 1179409162 B/s (  84788 us) (unroll 8)
zeroA: 2125122067 B/s (  47056 us) (generic_bzero)
zeroB: 1575245644 B/s (  63482 us) (i486_bzero)
zeroC:  960381695 B/s ( 104125 us) (i586_bzero)
zeroD: 1289637018 B/s (  77541 us) (i686_pagezero)

[bzero times, 8M buffer]
zero0:  140685510 B/s ( 698750 us) (stosl)
zero1:  141949085 B/s ( 692530 us) (unroll 16)
zero2:  142107500 B/s ( 691758 us) (unroll 16 preallocate)
zero3:  141911380 B/s ( 692714 us) (unroll 32)
zero4:  141969995 B/s ( 692428 us) (unroll 32 preallocate)
zero5:  141955645 B/s ( 692498 us) (unroll 64)
zero6:  141986195 B/s ( 692349 us) (unroll 64 preallocate)
zero7:  141935968 B/s ( 692594 us) (fstl)
zero8:  142159904 B/s ( 691503 us) (movl)
zero9:  142006295 B/s ( 692251 us) (unroll 8)
zeroA:  140841519 B/s ( 697976 us) (generic_bzero)
zeroB:  142013476 B/s ( 692216 us) (i486_bzero)
zeroC:  141868782 B/s ( 692922 us) (i586_bzero)
zeroD:  360165750 B/s ( 272941 us) (i686_pagezero)
zeroE:  140712494 B/s ( 698616 us) (bzero (stosl))

The best hand-"optimized" versions using integer registers are only about
12.5% slower than generic_bzero for buffers that fit in the L1 cache, and
all bzero methods except i686_pagezero() have the same speed for buffers
that don't fit in any cache.  i686_pagezero() is faster if the buffer is
already all zeros and other slower (the above time is for all zeros).
The version of i686_pagezero() in the kernel is especially pessimal (see
another reply in this thread).

I didn't try hard to use MMX registers.  In simple tests, 64-bit memory
accesses provided no benefits at least in the uncached case, which is
probably for the same reason that 64-bit memory accesses via the FPU
provide no benefits.  I believe all writes go through write buffers in
the CPU, and these worked poorly on PentiumPro's and mediocrely on
PII/Celeron.  They work much better on more modern CPUs, as they must
to keep up with increases with memory bandwidth.

Write bandwidth for the PentiumPr family is also limited by
read-before-write.  This more than halves the write bandwidth for large
cache-busting bzero's like the 8MB ones above.  The halving can be
seen in the above benchmarks.  The main memory bandwidth is approx
360MB/sec on this system, and this is achieved by i686_bzero() since
it just reads the buffer to verify that it is all zero's (optimized
read bandwidth tests that just throw the data away run at the same
speed).  Read-before- write halves the maximum write bandwidth to
180MB/sec.  In practice, the write bandwidth is limited to 140MB/sec
(slower than on Pentium I systems with a main memory bandwidth of
180MB/sec! -- these can get near the max for both read and write).

Benefits from SSE for bzeroing and bcopying, if any, would probably
come more from bypassing caches and/or not doing read-before-write
(SSE instructions give control over this) than from operating on wider
data.  I'm dubious about practical benefits.  Obviously it is not useful
to bust the cache when bzeroing 8MB of data, but real programs and OS's
mostly operate on smaller buffers.  It is negatively useful not to put
bzero'ed data in the (L[1-2]) cache if the data will be used soon, and
generally hard to predict if it will be used soon.

Bruce