svn commit: r280279 - head/sys/sys

Jung-uk Kim jkim at FreeBSD.org
Mon Apr 13 20:04:47 UTC 2015


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On 04/13/2015 13:36, Alan Cox wrote:
> On 03/30/2015 10:50, John Baldwin wrote:
>> On Sunday, March 22, 2015 09:41:53 AM Bruce Evans wrote:
>>> On Sat, 21 Mar 2015, John Baldwin wrote:
>>> 
>>>> On 3/21/15 12:35 PM, Konstantin Belousov wrote:
>>>>> On Sat, Mar 21, 2015 at 12:04:41PM -0400, John Baldwin
>>>>> wrote:
>>>>>> On 3/20/15 9:02 AM, Konstantin Belousov wrote:
>>>>>>> On Fri, Mar 20, 2015 at 10:27:06AM +0000, John Baldwin
>>>>>>> wrote:
>>>>>>>> Author: jhb Date: Fri Mar 20 10:27:06 2015 New
>>>>>>>> Revision: 280279 URL:
>>>>>>>> https://svnweb.freebsd.org/changeset/base/280279
>>>>>>>> 
>>>>>>>> Log: Expand the bitcount* API to support 64-bit
>>>>>>>> integers, plain ints and longs and create a "hidden"
>>>>>>>> API that can be used in other system headers without 
>>>>>>>> adding namespace pollution. - If the POPCNT
>>>>>>>> instruction is enabled at compile time, use 
>>>>>>>> __builtin_popcount*() to implement __bitcount*(),
>>>>>>>> otherwise fall back to software implementations.
>>>>>>> Are you aware of the Haswell errata HSD146 ?  I see the
>>>>>>> described behaviour on machines back to SandyBridge,
>>>>>>> but not on Nehalems. HSD146.   POPCNT Instruction May
>>>>>>> Take Longer to Execute Than Expected Problem: POPCNT
>>>>>>> instruction execution with a 32 or 64 bit operand may
>>>>>>> be delayed until previous non-dependent instructions
>>>>>>> have executed.
>>>>>>> 
>>>>>>> Jilles noted that gcc head and 4.9.2 already provides a
>>>>>>> workaround by xoring the dst register.  I have some
>>>>>>> patch for amd64 pmap, see the end of the message.
>>>>>> No, I was not aware, but I think it's hard to fix this
>>>>>> anywhere but the compiler.  I set CPUTYPE in src.conf on
>>>>>> my Ivy Bridge desktop and clang uses POPCOUNT for this
>>>>>> function from ACPI-CA:
>>>>>> 
>>>>>> static UINT8 AcpiRsCountSetBits ( UINT16
>>>>>> BitField) { UINT8                   BitsSet;
>>>>>> 
>>>>>> 
>>>>>> ACPI_FUNCTION_ENTRY ();
>>>>>> 
>>>>>> 
>>>>>> for (BitsSet = 0; BitField; BitsSet++) { /* Zero the
>>>>>> least significant bit that is set */
>>>>>> 
>>>>>> BitField &= (UINT16) (BitField - 1); }
>>>>>> 
>>>>>> return (BitsSet); }
>>>>>> 
>>>>>> (I ran into this accidentally because a kernel built on
>>>>>> my system failed to boot in older qemu because the kernel
>>>>>> paniced with an illegal instruction fault in this
>>>>>> function.)
>>> Does it do the same for the similar home made popcount in
>>> pmap?:
>> Yes:
>> 
>> ffffffff807658d4:       f6 04 25 46 e2 d6 80    testb
>> $0x80,0xffffffff80d6e246 ffffffff807658db:       80 
>> ffffffff807658dc:       74 32                   je
>> ffffffff80765910 <pmap_demote_pde_locked+0x4d0> ffffffff807658de:
>> 48 89 4d b8             mov    %rcx,-0x48(%rbp) ffffffff807658e2:
>> f3 48 0f b8 4d b8       popcnt -0x48(%rbp),%rcx ffffffff807658e8:
>> 48 8b 50 20             mov    0x20(%rax),%rdx ffffffff807658ec:
>> 48 89 55 b0             mov    %rdx,-0x50(%rbp) ffffffff807658f0:
>> f3 48 0f b8 55 b0       popcnt -0x50(%rbp),%rdx ffffffff807658f6:
>> 01 ca                   add    %ecx,%edx ffffffff807658f8:
>> 48 8b 48 28             mov    0x28(%rax),%rcx ffffffff807658fc:
>> 48 89 4d a8             mov    %rcx,-0x58(%rbp) ffffffff80765900:
>> f3 48 0f b8 4d a8       popcnt -0x58(%rbp),%rcx ffffffff80765906:
>> eb 1b                   jmp    ffffffff80765923
>> <pmap_demote_pde_locked+0x4e3> ffffffff80765908:       0f 1f 84
>> 00 00 00 00    nopl   0x0(%rax,%rax,1) ffffffff8076590f:       00
>>  ffffffff80765910:       f3 48 0f b8 c9          popcnt
>> %rcx,%rcx ffffffff80765915:       f3 48 0f b8 50 20       popcnt
>> 0x20(%rax),%rdx ffffffff8076591b:       01 ca
>> add    %ecx,%edx ffffffff8076591d:       f3 48 0f b8 48 28
>> popcnt 0x28(%rax),%rcx ffffffff80765923:       01 d1
>> add    %edx,%ecx
>> 
>> It also uses popcnt for this in blist_fill() and
>> blist_meta_fill():
>> 
>> 742             /* Count the number of blocks we're about to
>> allocate */ 743             bitmap = scan->u.bmu_bitmap & mask; 
>> 744             for (nblks = 0; bitmap != 0; nblks++) 745
>> bitmap &= bitmap - 1;
>> 
>>> Always using new API would lose the micro-optimizations given
>>> by the runtime decision for default CFLAGS (used by
>>> distributions for portability).  To keep them, it seems best to
>>> keep the inline asm but replace popcnt_pc_map_elem(elem) by
>>> __bitcount64(elem).  -mno-popcount can then be used to work
>>> around slowness in the software (that is actually hardware)
>>> case.
>> I'm not sure if bitcount64() is strictly better than the loop in
>> this case even though it is O(1) given the claimed nature of the
>> values in the comment.
>> 
> 
> 
> I checked.  Even with zeroes being more common than ones,
> bitcount64() is faster than the simple loop.  Using bitcount64,
> reserve_pv_entries() takes on average 4265 cycles during
> "buildworld" on my test machine.  In contrast, with the simple
> loop, it takes on average 4507 cycles.  Even though bitcount64 is a
> lot larger than the simple loop, we do the 3 bit count operations
> many times in a loop, so the extra i-cache misses are being made up
> for by the repeated execution of the faster code.
> 
> However, in the popcnt case, we are spilling the bit map to memory
> in order to popcnt it.  That's rather silly:
> 
> 3570:       48 8b 48 18             mov    0x18(%rax),%rcx 3574:
> f6 04 25 00 00 00 00    testb  $0x80,0x0 357b:       80 357c:
> 74 42                   je     35c0 <pmap_demote_pde_locked+0x2f0> 
> 357e:       48 89 4d b8             mov    %rcx,-0x48(%rbp) 3582:
> 31 c9                   xor    %ecx,%ecx 3584:       f3 48 0f b8 4d
> b8       popcnt -0x48(%rbp),%rcx 358a:       48 8b 50 20
> mov    0x20(%rax),%rdx 358e:       48 89 55 b0             mov
> %rdx,-0x50(%rbp) 3592:       31 d2                   xor
> %edx,%edx 3594:       f3 48 0f b8 55 b0       popcnt
> -0x50(%rbp),%rdx 359a:       01 ca                   add
> %ecx,%edx 359c:       48 8b 48 28             mov
> 0x28(%rax),%rcx 35a0:       48 89 4d a8             mov
> %rcx,-0x58(%rbp) 35a4:       31 c9                   xor
> %ecx,%ecx 35a6:       f3 48 0f b8 4d a8       popcnt
> -0x58(%rbp),%rcx 35ac:       01 d1                   add
> %edx,%ecx 35ae:       e9 12 01 00 00          jmpq   36c5 
> <pmap_demote_pde_locked+0x3f5>

Please try the attached patch.

Jung-uk Kim
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQEcBAEBCAAGBQJVLCFZAAoJEHyflib82/FGOp0H/1+Jr+cKUn/MnV5O5SghPw9f
XzTM4+BV9BcWabLRjFe1LR065SfLDXqKLuU4h5lmVSlXQaxElAXxaMeyO3mrMzR4
Sb1xr0rf+ZfUARJeEJWI65Wpn+gEH+7XxXAIAetYGMwwclBOBgbZIoDXITnCaUFa
/pi3zQIey8EzbvlzhQcffLDV8oF4f8HNEMoSxMRtOiZNNPu/8ECnyGeHZhOd++kh
pwZNsSbcCw3RXMheuErTpKPrJSEXgMNmWG3G00aP7L8IjcObgOqMUQt+8eT8Ge8B
tEv40kgm2G/OG2akONh4/6bX3hyodW3IHcb6AYhqZogiDIqd/eXD4jDup/kkVxU=
=1Ca9
-----END PGP SIGNATURE-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pmap.diff
Type: text/x-patch
Size: 2181 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/svn-src-all/attachments/20150413/39f35620/attachment.bin>


More information about the svn-src-all mailing list