Re: Checksum/copy

From: Bruce Evans <bde_at_zeta.org.au>
Date: Fri, 28 Mar 2003 18:44:06 +1100 (EST)
On Thu, 27 Mar 2003, Dag-Erling [iso-8859-1] Smørgrav wrote:

> David Malone <dwmalone_at_maths.tcd.ie> writes:
> > On Thu, Mar 27, 2003 at 09:57:35AM +0100, des_at_ofug.org wrote:
> > > Might it be a good idea to have separate b{copy,zero} implementations
> > > for special purposes like pmap_{copy,zero}_page?
> > We do have a i686_pagezero already, which seems to be used in
> > pmap_zero_page - I guess it may not be well tuned to modern processors,
> > as it is almost 5 years old.

Indeed.

> i686_pagezero uses 'rep stosl' after an initial 'rep scasl' to check
> if the page was already zero (which is a pessimization unless we zero
> a lot of pages that are already zeroed).  SSE can do far better than
> that.

Even integer instructions can do significantly better than scasl/stosl
on "686"s (PentiumPro and similar CPUs).  Implementation bugs in
i686_pagezero() include:
- scasl is one of the slowest ways to read memory, at least on old
  Celerons (I believe PPro's have similar timing for string operations).
  It is a bit slower than lodsl, which is about 3.5 times slower than
  a lightly unrolled movl loop for the L1-cached case and about 2 times
  slower for the uncached case.
- the code apparently intends to check 16 words at a time, but due to
  getting a comparison backwards it actually zeros everything else as
  soon as it finds a nonzero word, with extra obfuscations and
  pessimizations when it is within 16 words of the end.
Implementation non-bugs include using stosl to do the zeroing.  Unlike
lodsl and scasl, stosl is actually useful for its intended purpos on
"686"s.

Instead of fixing the comparison and any other logic bugs, I rewrote the
function using orl instead of scasl, and simpler logic (ignore the changes
for the previous function in the same hunk).

%%%
Index: support.s
===================================================================
RCS file: /home/ncvs/src/sys/i386/i386/support.s,v
retrieving revision 1.93
diff -u -2 -r1.93 support.s
--- support.s	22 Sep 2002 04:45:20 -0000	1.93
+++ support.s	22 Sep 2002 09:51:27 -0000
_at__at_ -333,70 +337,58 _at__at_
 	movl	%edx,%edi
 	xorl	%eax,%eax
-	shrl	$2,%ecx
 	cld
+	shrl	$2,%ecx
 	rep
 	stosl
 	movl	12(%esp),%ecx
 	andl	$3,%ecx
-	jne	1f
-	popl	%edi
-	ret
-
-1:
+	je	1f
 	rep
 	stosb
+1:
 	popl	%edi
 	ret
-#endif /* I586_CPU && defined(DEV_NPX) */
+#endif /* I586_CPU && DEV_NPX */

+#ifdef I686_CPU
 ENTRY(i686_pagezero)
-	pushl	%edi
-	pushl	%ebx
-
-	movl	12(%esp), %edi
+	movl	4(%esp), %edx
 	movl	$1024, %ecx
-	cld

 	ALIGN_TEXT
 1:
-	xorl	%eax, %eax
-	repe
-	scasl
-	jnz	2f
+	movl	(%edx), %eax
+	orl	4(%edx), %eax
+	orl	8(%edx), %eax
+	orl	12(%edx), %eax
+	orl	16(%edx), %eax
+	orl	20(%edx), %eax
+	orl	24(%edx), %eax
+	orl	28(%edx), %eax
+	jne	2f
+
+	addl	$32, %edx
+	subl	$32/4, %ecx
+	jne	1b

-	popl	%ebx
-	popl	%edi
 	ret

 	ALIGN_TEXT
-
 2:
-	incl	%ecx
-	subl	$4, %edi
+	movl	$0, (%edx)
+	movl	$0, 4(%edx)
+	movl	$0, 8(%edx)
+	movl	$0, 12(%edx)
+	movl	$0, 16(%edx)
+	movl	$0, 20(%edx)
+	movl	$0, 24(%edx)
+	movl	$0, 28(%edx)
+
+	addl	$32, %edx
+	subl	$32/4, %ecx
+	jne	1b

-	movl	%ecx, %edx
-	cmpl	$16, %ecx
-
-	jge	3f
-
-	movl	%edi, %ebx
-	andl	$0x3f, %ebx
-	shrl	%ebx
-	shrl	%ebx
-	movl	$16, %ecx
-	subl	%ebx, %ecx
-
-3:
-	subl	%ecx, %edx
-	rep
-	stosl
-
-	movl	%edx, %ecx
-	testl	%edx, %edx
-	jnz	1b
-
-	popl	%ebx
-	popl	%edi
 	ret
+#endif /* I686_CPU */

 /* fillw(pat, base, cnt) */
%%%

Implementation notes: using orl might not be best (due to pipelining issues).
Using movl instead of stosl might not be best (I used it to simplify the
logic and reduce initilization overheads).

This hasn't been tested recently.  I've had it disabled in pmap.c for
as long as I can remember, to prepare for complete testing (my pmap.c
just uses bzero()).

The importance of optimizing this function can be gauged by the number of
people who have noticed that it never worked right and the number of
commits to make it work right.

Zeroing pages is not completely unimportant, however.  The pagezero task
takes about 5% of the time for a makeworld here.  Most of this time is
"free" here since pagezero can run while the system is waiting for disks,
and I don't run much else while doing makeworld benchmarks.  However, it
is not free time under different/heavier loads.

Bruce
Received on Thu Mar 27 2003 - 23:44:29 UTC