Re: git: 32a2fed6e71f - stable/13 - openssl: Fix detection of ARMv7 and ARM64 CPU features

From: Allan Jude <allanjude_at_freebsd.org>
Date: Thu, 25 Nov 2021 15:09:25 UTC
On 11/25/2021 2:38 AM, Helge Oldach wrote:
> Hi,
> 
> Allan Jude wrote on Wed, 24 Nov 2021 19:02:47 +0100 (CET):
>> On 11/24/2021 3:30 AM, Emmanuel Vadot wrote:
>>> On Tue, 23 Nov 2021 20:36:40 +0100 (CET)
>>> freebsd@oldach.net (Helge Oldach) wrote:
>>>
>>>> Allan Jude wrote on Tue, 23 Nov 2021 20:14:53 +0100 (CET):
>>>>> On 11/23/2021 5:00 AM, Helge Oldach wrote:
>>>>>> Allan Jude wrote on Mon, 22 Nov 2021 19:14:13 +0100 (CET):
>>>>>> Hmmm. On a RPi4/8G:
>>>>>>
>>>>>> Before (FreeBSD 13.0-STABLE (GENERIC) #366 stable/13-n248173-d16fbc488e6):
>>>>>> | type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
>>>>>> | aes-256-gcm      35791.98k    38533.57k    39986.77k    41397.59k    39840.43k    39638.36k
>>>>>>
>>>>>> After (FreeBSD 13.0-STABLE (GENERIC) #367 stable/13-n248176-f085bb0e621)
>>>>>>
>>>>>> | type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
>>>>>> | aes-256-gcm      21277.62k    23226.64k    23613.90k    23687.51k    23892.93k    23947.95k
>>>>>>
>>>>>> It seems that AES throughput is actually cut by almost half?
>>>>>
>>>>> Do you know which of the CPU optimizations your RPi4 supports?
>>>>
>>>> Is this what you need?
>>>>
>>>>    Instruction Set Attributes 0 = <CRC32>
>>>
>>>    So there is no AES+PMULL instruction set on RPI4, I guess that openssl
>>> uses them for aes-gcm.
>>>
>>>    I wonder what it uses before that make it have this boost.
>>>
>>>    On my rockpro64 I do see the improvement btw :
>>> root@generic:~ # cpuset -l 4,5 openssl speed -evp aes-256-gcm
>>> ...
>>> aes-256-gcm     122861.59k   337938.39k   565408.44k   661223.09k   709175.19k   712327.25k
>>> root@generic:~ # cpuset -l 4,5 env OPENSSL_armcap=0 openssl speed -evp aes-256-gcm
>>> ...
>>> aes-256-gcm      34068.11k    38068.62k    39435.24k    39818.75k    39905.34k    39922.35k
>>>
>>>    Running on the big cores at max freq.
>>>
>>>>    Instruction Set Attributes 1 = <>
>>>>            Processor Features 0 = <AdvSIMD,FP,EL3 32,EL2 32,EL1 32,EL0 32>
>>>>            Processor Features 1 = <>
>>>>         Memory Model Features 0 = <TGran4,TGran64,SNSMem,BigEnd,16bit ASID,16TB PA>
>>>>         Memory Model Features 1 = <8bit VMID>
>>>>         Memory Model Features 2 = <32bit CCIDX,48bit VA>
>>>>                Debug Features 0 = <DoubleLock,2 CTX BKPTs,4 Watchpoints,6 Breakpoints,PMUv3,Debugv8>
>>>>                Debug Features 1 = <>
>>>>            Auxiliary Features 0 = <>
>>>>            Auxiliary Features 1 = <>
>>>> AArch32 Instruction Set Attributes 5 = <CRC32,SEVL>
>>>> AArch32 Media and VFP Features 0 = <FPRound,FPSqrt,FPDivide,DP VFPv3+v4,SP VFPv3+v4,AdvSIMD>
>>>> AArch32 Media and VFP Features 1 = <SIMDFMAC,FPHP DP Conv,SIMDHP SP Conv,SIMDSP,SIMDInt,SIMDLS,FPDNaN,FPFtZ>
>>>>
>>>>> You can set the environment variable OPENSSL_armcap to override
>>>>> OpenSSL's detection.
>>>>>
>>>>> Try: env OPENSSL_armcap=0 openssl speed -evp aes-256-gcm
>>>>
>>>> On FreeBSD 13.0-STABLE (GENERIC) #367 stable/13-n248176-f085bb0e621 again (i.e. after this commit):
>>>>
>>>> hmo@p48 ~ $ env OPENSSL_armcap=0 openssl speed -evp aes-256-gcm
>>>> Doing aes-256-gcm for 3s on 16 size blocks: 6445704 aes-256-gcm's in 3.08s
>>>> Doing aes-256-gcm for 3s on 64 size blocks: 1861149 aes-256-gcm's in 3.00s
>>>> Doing aes-256-gcm for 3s on 256 size blocks: 479664 aes-256-gcm's in 3.01s
>>>> Doing aes-256-gcm for 3s on 1024 size blocks: 122853 aes-256-gcm's in 3.04s
>>>> Doing aes-256-gcm for 3s on 8192 size blocks: 15181 aes-256-gcm's in 3.00s
>>>> Doing aes-256-gcm for 3s on 16384 size blocks: 7796 aes-256-gcm's in 3.07s
>>>> OpenSSL 1.1.1l-freebsd  24 Aug 2021
>>>> built on: reproducible build, date unspecified
>>>> options:bn(64,64) rc4(int) des(int) aes(partial) idea(int) blowfish(ptr)
>>>> compiler: clang
>>>> The 'numbers' are in 1000s of bytes per second processed.
>>>> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
>>>> aes-256-gcm      33504.57k    39704.51k    40825.01k    41394.83k    41454.25k    41601.52k
>>>> hmo@p48 ~ $ openssl speed -evp aes-256-gcm
>>>> Doing aes-256-gcm for 3s on 16 size blocks: 4066201 aes-256-gcm's in 3.00s
>>>> Doing aes-256-gcm for 3s on 64 size blocks: 1087387 aes-256-gcm's in 3.00s
>>>> Doing aes-256-gcm for 3s on 256 size blocks: 280110 aes-256-gcm's in 3.03s
>>>> Doing aes-256-gcm for 3s on 1024 size blocks: 70412 aes-256-gcm's in 3.04s
>>>> Doing aes-256-gcm for 3s on 8192 size blocks: 8762 aes-256-gcm's in 3.00s
>>>> Doing aes-256-gcm for 3s on 16384 size blocks: 4402 aes-256-gcm's in 3.02s
>>>> OpenSSL 1.1.1l-freebsd  24 Aug 2021
>>>> built on: reproducible build, date unspecified
>>>> options:bn(64,64) rc4(int) des(int) aes(partial) idea(int) blowfish(ptr)
>>>> compiler: clang
>>>> The 'numbers' are in 1000s of bytes per second processed.
>>>> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
>>>> aes-256-gcm      21686.41k    23197.59k    23656.30k    23725.04k    23926.10k    23916.23k
>>>> hmo@p48 ~ $
>>>>
>>>> Kind regards,
>>>> Helge
>>>
>>>
>>
>> So based on results from Manu, and Mark Millard, it seems almost every
>> ARM platform is faster when it takes advantage of the CPU features,
>> except the RPi4(B).
>>
>> As Manu pointed out, it doesn't appear to have the AES+PMULL feature,
>> which means it must be something else that is slowing it down.
>>
>> What might help, is to try each feature in turn, and figure out which
>> one is causing slower results.
>>
>> #define HWCAP_FP                0x00000001
>> #define HWCAP_ASIMD             0x00000002
>> #define HWCAP_EVTSTRM           0x00000004
>> #define HWCAP_AES               0x00000008
>> #define HWCAP_PMULL             0x00000010
>> #define HWCAP_SHA1              0x00000020
>> #define HWCAP_SHA2              0x00000040
>> #define HWCAP_CRC32             0x00000080
>>
>> So try:
>> env OPENSSL_armcap=1 openssl speed -evp aes-256-gcm
>> as well as with armcap=2, 3 (both FP and ASIMD), 8 (just AES) etc.
> 
> hmo@p48 ~ $ for f in 0 1 2 3 8 16 32 64 128 ; do echo -n $f:; env OPENSSL_armcap=$f openssl speed -evp aes-256-gcm 2>&1 | tail -1 | cut -wf7; done
> 0:42295.15k
> 1:23891.19k
> 2:42208.57k
> 3:23970.56k
> 8:42354.98k
> 16:42199.06k
> 32:size
> Illegal instruction (core dumped)
> 64:42322.42k
> 128:42275.00k
> hmo@p48 ~ $
> 
> So I guess HWCAP_FP is the culprit? Maybe related to hard/soft floating
> point math which indeed is kind of special on the Pi?
> 
>> For ones where the CPU lacks the feature, it will crash with 'Illegal
>> instruction'
>>
>> Separately, it might also be interesting to see the results of `openssl
>> speed -evp sha256` before/after/with the different OPENSSL_armcap values
> 
> Please let me know in case you still require this.
> 
> Kind regards
> Helge
> 

So yeah, the issue seems to be that floating point on the RPi4 is slower 
than not, but now openssl (properly) detects that the CPU advertises 
support for it.

As seen elsewhere in the thread, most other ARM platforms get a very 
significant speed boost.

-- 
Allan Jude