Checksum/copy (was: Re: cvs commit: src/sys/netinet ip_output.c)

Sun Mar 30 22:31:55 PST 2003

On Fri, 28 Mar 2003, Peter Jeremy wrote:

> On Fri, Mar 28, 2003 at 05:04:21PM +1100, Bruce Evans wrote:
> >"i686" basically means "second generation Pentium" (PentiumPro/PII/Celeron)
> >(later x86's are mostly handled better using CPU features instead of
> >a 1-dimensional class number).  Hand-"optimized" bzero's are especially
> >pessimal for this class of CPU.
>
> That matches my memory of my test results as well.  The increasing
> clock multipliers mean that it doesn't matter how slow "rep stosl" is
> in clock cycle terms - maim memory is always going to be slower.

There are still some surprising differences (see attached timings for
some examples), but I think they are more for how the code affects
caches and write buffers.  The exact behaviour is very machine-dependent
so it is hard to optimize in general-purpose production code.

> >Benefits from SSE for bzeroing and bcopying, if any, would probably
> >come more from bypassing caches and/or not doing read-before-write
> >(SSE instructions give control over this) than from operating on wider
> >data.  I'm dubious about practical benefits.  Obviously it is not useful
> >to bust the cache when bzeroing 8MB of data, but real programs and OS's
> >mostly operate on smaller buffers.  It is negatively useful not to put
> >bzero'ed data in the (L[1-2]) cache if the data will be used soon, and
> >generally hard to predict if it will be used soon.
>
> Unless Intel have fixed the P4 caches, you definitely don't want to
> use the L1 cache for page sized bzero/bcopy.

Athlons have many similarities to Celerons here.

> Avoiding read-before-write should roughly double bzero speed and give
> you about 50% speedup on bcopy - this should be worthwhile.  Caching

It actually gives a 66% speedop for bzero on my AthlonXP.  For some
reason, at least for very large buffers, read accesses through the
cache can use only 1/2 of the memory bandwidth, and write accesses can
use only 1/3 of it (and this is after tuning for bank organization --
I get a 33% speedup for the write benchmark and 0% for real work by
including a bit for the bank number in the page color in a deteriministic
way, and almost as much for including the bit in a random way).  Using
SSE instructions (mainly movntps) gives the full bandwidth for at
least bzero for large buffers (3x better), but it reduces bandwidth
for small already cached buffers (more than 3x worse):

%%%
Times on an AthlonXP-1600 overclocked by 146/133, with 1024MB of PC2700
memory and all memory timings tuned as low as possible (CAS2, but 2T cmds):

4K buffer (almost always cached):
zero0: 5885206293 B/s (6959824 us) (stosl)
zero1: 7842053086 B/s (5223122 us) (unroll 16)
zero2: 7049051312 B/s (5810711 us) (unroll 16 preallocate)
zero3: 9377720907 B/s (4367799 us) (unroll 32)
zero4: 7803040290 B/s (5249236 us) (unroll 32 preallocate)
zero5: 9802682719 B/s (4178448 us) (unroll 64)
zero6: 8432350664 B/s (4857483 us) (unroll 64 preallocate)
zero7: 5957318200 B/s (6875577 us) (fstl)
zero8: 3007928933 B/s (13617343 us) (movl)
zero9: 4011348905 B/s (10211029 us) (unroll 8)
zeroA: 5835984056 B/s (7018525 us) (generic_bzero)
zeroB: 8334888325 B/s (4914283 us) (i486_bzero)
zeroC: 2545022700 B/s (16094159 us) (i586_bzero)
zeroD: 7650723550 B/s (5353742 us) (i686_pagezero)
zeroE: 5755535593 B/s (7116627 us) (bzero (stosl))
zeroF: 2282741753 B/s (17943335 us) (movntps)

movntps is the SSE method.  It's significantly slower for this case.

400MB buffer (never cached):
zero0:  714045391 B/s ( 573633 us) (stosl)
zero1:  705180737 B/s ( 580844 us) (unroll 16)
zero2:  670897998 B/s ( 610525 us) (unroll 16 preallocate)
zero3:  690538809 B/s ( 593160 us) (unroll 32)
zero4:  661854647 B/s ( 618867 us) (unroll 32 preallocate)
zero5:  670525682 B/s ( 610864 us) (unroll 64)
zero6:  663334877 B/s ( 617486 us) (unroll 64 preallocate)
zero7:  781025057 B/s ( 524439 us) (fstl)
zero8:  608491547 B/s ( 673140 us) (movl)
zero9:  696489665 B/s ( 588092 us) (unroll 8)
zeroA:  713958268 B/s ( 573703 us) (generic_bzero)
zeroB:  689875870 B/s ( 593730 us) (i486_bzero)
zeroC:  721477338 B/s ( 567724 us) (i586_bzero)
zeroD:  746453616 B/s ( 548728 us) (i686_pagezero)
zeroE:  714016763 B/s ( 573656 us) (bzero (stosl))
zeroF: 2240602162 B/s ( 182808 us) (movntps)

Now movntps is about 3 times faster than everything else.  This is the
first time I've seen a magic number near 2100 for memory named with a
magic number near 2100.  This machine used to use PC2100 with the
same timing, but it developed errors (burnt out?).  Now it has PC2700
memory so it is within spec an can reasonably be expected to run a little
faster than PC2100 should.
%%%

> is more dubious - placing a slow-zeroed page in L1 cache is very
> probably a waste of time.  At least part of an on-demand zeroed page
> is likely to be used in the near future - but probably not all of it.
> Copying is even harder to predict - at least one word of a COW page is
> going to be used immediately, but bcopy() won't be able to tell which
> word.

For makeworld, using movntps in i686_pagezero() gives a whole 14 seconds
(0.7%) improvement:

%%%
Before:

bde-current with ... + KSEIII + idlezero_enable + pmap - even coloring
async mounted /c
my-Makefile
after perl removal and new gcc and ufs2 and aout utilities removal
with 2 fairly new drives
1532 MHz AthlonXP 1600
1024MB
make catches SIGCHLD
i686_bzero not used
--------------------------------------------------------------
>>> elf make world completed on Mon Mar 31 02:10:47 EST 2003
                    (started on Mon Mar 31 01:38:24 EST 2003)
--------------------------------------------------------------
     1943.14 real      1575.25 user       218.88 sys
     40204  maximum resident set size
      2166  average shared memory size
      1988  average unshared data size
       128  average unshared stack size
  13039568  page reclaims
     11639  page faults
         0  swaps
     20008  block input operations
      6265  block output operations
         0  messages sent
         0  messages received
     33037  signals received
    207588  voluntary context switches
    518358  involuntary context switches

After:

bde-current with ... + KSEIII + idlezero_enable + pmap - even coloring
async mounted /c
my-Makefile
after perl removal and new gcc and ufs2 and aout utilities removal
with 2 fairly new drives
1532 MHz AthlonXP 1600
1024MB
make catches SIGCHLD
i686_bzero used and replaced by one that uses SSE (movntps)
--------------------------------------------------------------
>>> elf make world completed on Mon Mar 31 02:46:43 EST 2003
                    (started on Mon Mar 31 02:14:35 EST 2003)
--------------------------------------------------------------
     1929.02 real      1576.67 user       205.30 sys
     40204  maximum resident set size
      2166  average shared memory size
      1990  average unshared data size
       128  average unshared stack size
  13039590  page reclaims
     11645  page faults
         0  swaps
     20014  block input operations
      6416  block output operations
         0  messages sent
         0  messages received
     33037  signals received
    208376  voluntary context switches
    512820  involuntary context switches
%%%

Whether 14 seconds is a lot depends on your viewpoint.  It is a lot
out of the kernel time of 218 seconds considering that only one function
was optimized and some of the optimization doesn't affect the real
time since it is done at idle priority in pagezero.  pagezero's time
was reduced from 57 seconds to 28 seconds.

Code for the above (no warranties; only works for !SMP and I didn't
check that the FP context switching is safe...):

%%%
Index: support.s
===================================================================
RCS file: /home/ncvs/src/sys/i386/i386/support.s,v
retrieving revision 1.93
diff -u -2 -r1.93 support.s

--- support.s	22 Sep 2002 04:45:20 -0000	1.93
+++ support.s	31 Mar 2003 02:37:02 -0000
@@ -66,4 +68,9 @@
 	.space	3
 #endif
+#define	HACKISH_SSE_PAGEZERO
+#ifdef HACKISH_SSE_PAGEZERO
+zero:
+	.long	0, 0, 0, 0
+#endif

 	.text
@@ -333,70 +342,101 @@
 	movl	%edx,%edi
 	xorl	%eax,%eax
-	shrl	$2,%ecx
 	cld
+	shrl	$2,%ecx
 	rep
 	stosl
 	movl	12(%esp),%ecx
 	andl	$3,%ecx
-	jne	1f
-	popl	%edi
-	ret
-
-1:
+	je	1f
 	rep
 	stosb
+1:
 	popl	%edi
 	ret
-#endif /* I586_CPU && defined(DEV_NPX) */
+#endif /* I586_CPU && DEV_NPX */

+#ifdef I686_CPU
 ENTRY(i686_pagezero)
-	pushl	%edi
-	pushl	%ebx
+	movl	4(%esp),%edx
+  	movl	$PAGE_SIZE, %ecx

-	movl	12(%esp), %edi
-	movl	$1024, %ecx
-	cld
+#ifdef HACKISH_SSE_PAGEZERO
+	pushfl
+	cli
+	movl	%cr0,%eax
+	clts
+	subl	$16,%esp
+	movups	%xmm0,(%esp)
+	movups	zero,%xmm0
+	ALIGN_TEXT
+1:
+	movntps	%xmm0,(%edx)
+	movntps	%xmm0,16(%edx)
+	movntps	%xmm0,32(%edx)
+	movntps	%xmm0,48(%edx)
+	addl	$64,%edx
+	subl	$64,%ecx
+	jne	1b
+	movups	(%esp),%xmm0
+	addl	$16,%esp
+	movl	%eax,%cr0
+	popfl
+	ret
+2:
+#endif /* HACKISH_SSE_PAGEZERO */

 	ALIGN_TEXT
 1:
-	xorl	%eax, %eax
-	repe
-	scasl
-	jnz	2f
+	movl	(%edx), %eax
+	orl	4(%edx), %eax
+	orl	8(%edx), %eax
+	orl	12(%edx), %eax
+	orl	16(%edx), %eax
+	orl	20(%edx), %eax
+	orl	24(%edx), %eax
+	orl	28(%edx), %eax
+	jne	2f
+	movl	32(%edx), %eax
+	orl	36(%edx), %eax
+	orl	40(%edx), %eax
+	orl	44(%edx), %eax
+	orl	48(%edx), %eax
+	orl	52(%edx), %eax
+	orl	56(%edx), %eax
+	orl	60(%edx), %eax
+	jne	3f
+
+	addl	$64, %edx
+	subl	$64, %ecx
+	jne	1b

-	popl	%ebx
-	popl	%edi
 	ret

 	ALIGN_TEXT
-
 2:
-	incl	%ecx
-	subl	$4, %edi
-
-	movl	%ecx, %edx
-	cmpl	$16, %ecx
-
-	jge	3f
-
-	movl	%edi, %ebx
-	andl	$0x3f, %ebx
-	shrl	%ebx
-	shrl	%ebx
-	movl	$16, %ecx
-	subl	%ebx, %ecx
-
+	movl	$0, (%edx)
+	movl	$0, 4(%edx)
+	movl	$0, 8(%edx)
+	movl	$0, 12(%edx)
+	movl	$0, 16(%edx)
+	movl	$0, 20(%edx)
+	movl	$0, 24(%edx)
+	movl	$0, 28(%edx)
 3:
-	subl	%ecx, %edx
-	rep
-	stosl
-
-	movl	%edx, %ecx
-	testl	%edx, %edx
-	jnz	1b
+	movl	$0, 32(%edx)
+	movl	$0, 36(%edx)
+	movl	$0, 40(%edx)
+	movl	$0, 44(%edx)
+	movl	$0, 48(%edx)
+	movl	$0, 52(%edx)
+	movl	$0, 56(%edx)
+	movl	$0, 60(%edx)
+
+	addl	$64, %edx
+	subl	$64, %ecx
+	jne	1b

-	popl	%ebx
-	popl	%edi
 	ret
+#endif /* I686_CPU */

 /* fillw(pat, base, cnt) */
%%%

> I don't know how much control SSE gives you over caching - is it just
> cache/no-cache, or can you control L1+L2/L2-only/none?  In the latter
> case, telling bzero and bcopy destination to use L2-only is probably a
> reasonable compromise.  The bcopy source should probably not evict
> cache data - if data is cached, use it, otherwise fetch from main
> memory and bypass caches.

There seems to be control in individual instructions for reads, but only
a complete bypass for writes (movntps from an SSE register to memory).
Writing can still be tuned with explicit reads or prefetches after
writes.  I've only looked briefly at 3-year-old Intel manuals.

> Finally, how many different bcopy/bzero variants to we want?  A

I don't want many :-).

Bruce