svn commit: r303583 - head/sys/amd64/amd64

Sun Jul 31 13:11:39 UTC 2016

On Sun, 31 Jul 2016, Mateusz Guzik wrote:

> Log:
>  amd64: implement pagezero using rep stos
>
>  The current implementation uses non-temporal writes. This turns out to
>  be detrimental to performance if the page is used shortly after, which
>  is the typical case with page faults.
>
>  Switch to rep stos.
>
>  Reviewed by:	kib
>  MFC after:	1 week

This is very MD.  Non-temporal writes are much faster on my old Turion2
amd64 system, especially when their pipelining is fixed.  "rep stosb"
is the best method on Haswell, but it is only slightly better so the
default should remain nontemporal writes.  I use sysctls to select
the best function and don't have automatic speed detection.  My old
speed detection tests for selection of using the npx method turned
out to be fragile.

> Modified: head/sys/amd64/amd64/support.S
> ==============================================================================
> --- head/sys/amd64/amd64/support.S	Sun Jul 31 10:37:09 2016	(r303582)
> +++ head/sys/amd64/amd64/support.S	Sun Jul 31 11:34:08 2016	(r303583)
> @@ -64,17 +64,10 @@ END(bzero)
> /* Address: %rdi */
> ENTRY(pagezero)
> 	PUSH_FRAME_POINTER
> -	movq	$-PAGE_SIZE,%rdx
> -	subq	%rdx,%rdi
> +	movq	$PAGE_SIZE/8,%rcx
> 	xorl	%eax,%eax
> -1:
> -	movnti	%rax,(%rdi,%rdx)
> -	movnti	%rax,8(%rdi,%rdx)
> -	movnti	%rax,16(%rdi,%rdx)
> -	movnti	%rax,24(%rdi,%rdx)
> -	addq	$32,%rdx
> -	jne	1b
> -	sfence
> +	rep
> +	stosq
> 	POP_FRAME_POINTER
> 	ret
> END(pagezero)

This shouldn't be a special function.  Just use bzero().  The compiler
might inline bzero() but shouldn't since this is very MD.

On Haswell, "rep stos" takes about 25 cycles to start up, and the function
call overhead is in the noise.  25 cycles is a lot.  Haswell can move
32 bytes/cycle from L2 to L2, so it misses moving 800 bytes or 1/5 of a
page in its startup overhead.  Oops, that is for "rep movs".  "rep stos"
is similar.

Here are my patches.

X diff -c2 ./amd64/amd64/pmap.c~ ./amd64/amd64/pmap.c
X *** ./amd64/amd64/pmap.c~	Sat Jun 25 09:07:20 2016
X --- ./amd64/amd64/pmap.c	Sat Jun 25 09:07:09 2016
X ***************
X *** 353,356 ****
X --- 353,364 ----
X       &pg_ps_enabled, 0, "Are large page mappings enabled?");
X 
X + static int pagecopy_memcpy;
X + SYSCTL_INT(_vm_pmap, OID_AUTO, pagecopy_memcpy, CTLFLAG_RW,
X +     &pagecopy_memcpy, 0, "Use memcpy for pagecopy?");
X + 
X + static int pagezero_bzero;
X + SYSCTL_INT(_vm_pmap, OID_AUTO, pagezero_bzero, CTLFLAG_RW,
X +     &pagezero_bzero, 0, "Use bzero for pagezero?");
X + 
X   #define	PAT_INDEX_SIZE	8
X   static int pat_index[PAT_INDEX_SIZE];	/* cache mode to PAT index conversion */

I don't enable pagecopy_memcpy because the nontemporal method is best for
Haswell.

X ***************
X *** 5154,5159 ****
X 
X   /*
X !  *	pmap_zero_page zeros the specified hardware page by mapping
X !  *	the page into KVM and using bzero to clear its contents.
X    */
X   void
X --- 5162,5166 ----
X 
X   /*
X !  * Zero the specified hardware page.
X    */
X   void

Style fix.  The comment is banal.  It gave too many implementation
details about another implementation.  Not one applies here.  The
comment should be removed, but I made it just banal to minimise
diffs.

X ***************
X *** 5162,5173 ****
X   	vm_offset_t va = PHYS_TO_DMAP(VM_PAGE_TO_PHYS(m));
X 
X ! 	pagezero((void *)va);
X   }
X 
X   /*
X !  *	pmap_zero_page_area zeros the specified hardware page by mapping 
X !  *	the page into KVM and using bzero to clear its contents.
X !  *
X !  *	off and size may not cover an area beyond a single hardware page.
X    */
X   void
X --- 5169,5181 ----
X   	vm_offset_t va = PHYS_TO_DMAP(VM_PAGE_TO_PHYS(m));
X 
X ! 	if (pagezero_bzero)
X ! 		bzero((void *)va, PAGE_SIZE);
X ! 	else
X ! 		pagezero((void *)va);
X   }
X 
X   /*
X !  * Zero an an area within a single hardware page.  off and size must not
X !  * cover an area beyond a single hardware page.
X    */
X   void
X ***************
X *** 5208,5212 ****
X   	vm_offset_t dst = PHYS_TO_DMAP(VM_PAGE_TO_PHYS(mdst));
X 
X ! 	pagecopy((void *)src, (void *)dst);
X   }
X 
X --- 5216,5223 ----
X   	vm_offset_t dst = PHYS_TO_DMAP(VM_PAGE_TO_PHYS(mdst));
X 
X ! 	if (pagecopy_memcpy)
X ! 		memcpy((void *)dst, (void *)src, PAGE_SIZE);
X ! 	else
X ! 		pagecopy((void *)src, (void *)dst);
X   }
X

All the comments were wrong.

X diff -c2 ./amd64/amd64/support.S~ ./amd64/amd64/support.S
X *** ./amd64/amd64/support.S~	Wed Feb 24 22:35:30 2016
X --- ./amd64/amd64/support.S	Mon Mar 28 10:43:37 2016
X ***************
X *** 68,71 ****
X --- 68,73 ----
X   	subq	%rdx,%rdi
X   	xorl	%eax,%eax
X + 	jmp	1f
X + 	.p2align 5,0x90
X   1:
X   	movnti	%rax,(%rdi,%rdx)

The loop was misaligned.  See the i386 version for a comment in the
code and more details.

X diff -c2 ./i386/i386/pmap.c~ ./i386/i386/pmap.c
X *** ./i386/i386/pmap.c~	Tue Apr 19 05:23:58 2016
X --- ./i386/i386/pmap.c	Tue Apr 19 05:24:17 2016
X ***************
X *** 226,229 ****
X --- 226,233 ----
X       &pg_ps_enabled, 0, "Are large page mappings enabled?");
X 
X + static int pagezero_bzero;
X + SYSCTL_INT(_vm_pmap, OID_AUTO, pagezero_bzero, CTLFLAG_RW,
X +     &pagezero_bzero, 0, "Use bzero for pagezero?");
X + 
X   #define	PAT_INDEX_SIZE	8
X   static int pat_index[PAT_INDEX_SIZE];	/* cache mode to PAT index conversion */

I don't bother implementing pagecopy_memcpy on i386.  It has to use
"rep movsl", and that is not so good on older systems with not so
good store buffers and no "fast string functions".  On Haswell,
"rep stos[bwl[q]] are equally fast, but no faster than the
nontemporal method.

In other trees, I implement pagecopy_sse for i386.  This gives differences
that are too small to measure (it uses movntps instead of movnti so that
it works on SSE1 and gives 16-byte accesses).

X ***************
X *** 4178,4181 ****
X --- 4182,4189 ----
X   pagezero(void *page)
X   {
X + 	if (pagezero_bzero) {
X + 		bzero(page, PAGE_SIZE);
X + 		return;
X + 	}
X   #if defined(I686_CPU)
X   	if (cpu_class == CPUCLASS_686) {
X diff -c2 ./i386/i386/support.s~ ./i386/i386/support.s
X *** ./i386/i386/support.s~	Sat Jan 23 05:14:15 2016
X --- ./i386/i386/support.s	Mon Mar 28 10:39:57 2016
X ***************
X *** 70,76 ****
X   	addl	$4096,%eax
X   	xor	%ebx,%ebx
X   1:
X   	movnti	%ebx,(%ecx)
X ! 	addl	$4,%ecx
X   	cmpl	%ecx,%eax
X   	jne	1b
X --- 70,83 ----
X   	addl	$4096,%eax
X   	xor	%ebx,%ebx
X + 	jmp	1f
X + 	/*
X + 	 * The loop takes 14 bytes.  Ensure that it doesn't cross a 16-byte
X + 	 * cache line.
X + 	 */
X + 	.p2align 4,0x90
X   1:
X   	movnti	%ebx,(%ecx)
X ! 	movnti	%ebx,4(%ecx)
X ! 	addl	$8,%ecx
X   	cmpl	%ecx,%eax
X   	jne	1b

Misalignment of this loop made it almost twice as slow on old Turion2 with
slow DDR2 memory.  It made no difference on Haswell.  I added an extra
movnti, but that makes little or no differences.  2 more movnti's wouldn't
fit in a 16-byte cache line so are slower unless even more care is taken
with alignment (or with less care, 4 with misalignment are not less than
twice as slow as 1 with alignment).

I thought that alignment and unrolling didn't matter here, because movnti
has to wait for memory and almost any loop runs fast enough to keep up.
The timing on my old system is something like: CPUs at 2 GHz; main memory
at 4 GB/sec; movnti is only 4 bytes wide on i386 (so this problem
only affects i386, at least with slow memory).  So sustaining 4 GB/sec
requires 1 G movnti's/sec, so the loop needs to run at 2 cycles/iteration
to keep up.  But when it is misaligned, it runs at 3-4 cycles/iteration.
Alignment makes it take about 2, and the extra movnti is for safety and
to work with faster memory.

On Haswell with CPUs at 4 GHz, 2 cycles/iteration gives 8 GB/sec on
i386 and 16 GB/sec on amd64 with wider movnti.  IIRC, 16 GB/sec is about
the main memory speed so nothing better is possible but just 1 extra
movnti gives more with faster memory.  This is just worse than bzero()
except when bzero() goes to main memory it is about the same speed.
bzero() is perhaps 20% faster on average for my makeworld benchmark.

The difference is too small to really matter: makeworld of an old world
takes 115-150 seconds (depending on tuning) on i4790K Haswell
overclocked slightly.  It pagezero()s about 128 GB and at main memory
bandwidth that takes 8 seconds, but the average is slightly faster so
the buildworld time is reduced by 1-2 seconds by using bzero() for
pagezero().

On Turion2, pagezero() of 128 GB always takes about 32 seconds.  In
the misaligned version, it took 48 (?) seconds for 3 cycles/iteration
instead of 2.  The makeworld time is 804-892 seconds and 16 extra
seconds is very noticeable.  It hit a plateau at 823-840 seconds
many years ago but came down to 804-816 mainly with the with alignment
optimization and delicate/accidental scheduling.

Bruce