svn commit: r280279 - head/sys/sys

Bruce Evans brde at optusnet.com.au
Tue Mar 31 04:49:34 UTC 2015


On Mon, 30 Mar 2015, Konstantin Belousov wrote:

> On Mon, Mar 30, 2015 at 11:57:08AM -0400, John Baldwin wrote:
>> On Sunday, March 22, 2015 11:32:51 AM Konstantin Belousov wrote:
>>> ...
>>> So anybody has to compile his own kernel to get popcnt optimization ?
>>> We do care about trivial things that improve time.

I'm not sure we do.  The microoptimizations give speed improvements
in the 1-10% range, while general bloat gives speed unimprovements in
the 10-500% range.

> I updated the pmap patch, see the end of the message.

Seems OK, except for verbose names.

>>> BTW, I have the following WIP change, which popcnt xorl is a piece of.
>>> It emulates the ifuncs with some preprocessing mess.  It is much better
>>> than runtime patching, and is a prerequisite to properly support more
>>> things, like SMAP.  I did not published it earlier, since I wanted to
>>> convert TLB flush code to this.
>>
>> This looks fine to me.  It seems to be manually converting certain symbols
>> to use a dynamic lookup that must be explicitly resolved before first
>> use?

It looks a bit overengineered to me.  A bit like my function pointers
for the bcopy() family on i386.  bcopy() is a bulk operation, so in
theory you can do it much faster by selecting the best available
version at runtime.  In practice, the gains were not large and are
too machine-dependent to maintain.  It is even harder to get large
gains and maintain them by selecting individual instructions at runtime.

> diff --git a/sys/amd64/amd64/pmap.c b/sys/amd64/amd64/pmap.c
> index 6a4077c..fcfba56 100644
> --- a/sys/amd64/amd64/pmap.c
> +++ b/sys/amd64/amd64/pmap.c
> @@ -412,7 +416,7 @@ static caddr_t crashdumpmap;
> static void	free_pv_chunk(struct pv_chunk *pc);
> static void	free_pv_entry(pmap_t pmap, pv_entry_t pv);
> static pv_entry_t get_pv_entry(pmap_t pmap, struct rwlock **lockp);
> -static int	popcnt_pc_map_elem(uint64_t elem);
> +static int	popcnt_pc_map_elem_pq(uint64_t elem);

popcnt_pc_map_elem_pq() is the most verbose name used here.  It is not
necessary to encrypt a function's man page in its name.

> @@ -2980,20 +3002,27 @@ retry:
>
> /*
>  * Returns the number of one bits within the given PV chunk map element.
> + *
> + * The erratas for Intel processors state that "POPCNT Instruction May
> + * Take Longer to Execute Than Expected".  It is believed that the
> + * issue is the spurious dependency on the destination register.
> + * Provide a hint to the register rename logic that the destination
> + * value is overwritten, by clearing it, as suggested in the
> + * optimization manual.  It should be cheap for unaffected processors
> + * as well.
> + *
> + * Reference numbers for erratas are
> + * 4th Gen Core: HSD146
> + * 5th Gen Core: BDM85
>  */
> static int
> -popcnt_pc_map_elem(uint64_t elem)
> +popcnt_pc_map_elem_pq(uint64_t elem)

Here the function has a specialized use, but internally it is just the
popcntq() cpufunc, except that doesn't have the xorl workaround or the
above documentation for the workaround.

> {
> -	int count;
> +	u_long result;
>
> -	/*
> -	 * This simple method of counting the one bits performs well because
> -	 * the given element typically contains more zero bits than one bits.
> -	 */
> -	count = 0;
> -	for (; elem != 0; elem &= elem - 1)
> -		count++;
> -	return (count);
> +	__asm __volatile("xorl %k0,%k0;popcntq %1,%0"
> +	    : "=&r" (result) : "rm" (elem));
> +	return (result);
> }

Style:
- there should be a space after ";" in the asm.
- loading 0 should be in the operands.  Then you don't need the "k"
   modification.  It should be able to load 0 using xorl iff that is
   best for the current arch.  Perhaps a register is already 0, and it
   is best to do a null load so as to use that register.

The patch is missing removal of the popcntq() cpufunc.  This function was
only used here, and is now unused.

cpufunc.h is supposed to contain only wrappers for single instructions,
to be combined if necessary to create larger functions.  popcntq()
follows this rule (some other functions are broken).  popcntq() with
the load of 0 wouldn't satisify it strictly, but there is no other way
to fix it since register allocation is unavailable across inline asms.
Actually, if you move the load to the operands, it would follow the 
rule strictly enough -- many of the asms need more than a single
instruction to load things.

Bruce


More information about the svn-src-head mailing list