Re: git: 32a2fed6e71f - stable/13 - openssl: Fix detection of ARMv7 and ARM64 CPU features

From: Mark Millard via arm <arm_at_freebsd.org>
Date: Wed, 24 Nov 2021 21:19:16 UTC

On 2021-Nov-24, at 01:51, Mark Millard <marklmi@yahoo.com> wrote:

> [Actually, the main [so: 14] equivalent.]
> 
> All Cortex-A72 based . . .
> 
> First, older system versions (before that update)
> then after the update:
> 
> 
> RPi4B 8 GiByte (older FreeBSD first, otherwise new),
> Cortex-A72's:
> 
> # openssl speed -evp aes-256-gcm
> . . .
> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
> aes-256-gcm      51925.92k    58449.46k    60430.32k    61050.13k    61180.98k    61482.75k
> 
> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
> aes-256-gcm      28880.07k    30837.33k    31630.29k    31855.62k    31921.54k    32034.53k
> 
> So: slowed down, unlike the other examples below.
> 
> # env OPENSSL_armcap=0 openssl speed -evp aes-256-gcm
> . . .
> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
> aes-256-gcm      51894.33k    58540.45k    60815.22k    61534.47k    61906.84k    62042.10k
> 
> So: back to the prior speed.
> 
> But all these are based on config.txt containing:
> 
> over_voltage=6 
> arm_freq=2000 
> sdram_freq_min=3200 
> force_turbo=1
> 
> (The RPi4B has a heat-sink and a fan.)
> 
> Note: See later about the RPi4B CPU features.
> 
> 
> MACCHIATObin Double Shot (older first), Cortex-A72's:
> 
> # openssl speed -evp aes-256-gcm
> . . .
> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
> aes-256-gcm      50808.49k    58466.08k    60769.11k    61444.92k    61767.94k    61707.61k
> 
> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
> aes-256-gcm     163579.14k   456319.27k   786544.01k   940234.41k  1003230.55k  1005671.31k
> 
> 
> HoneyComb (older first), Cortex-A782's:
> 
> # openssl speed -evp aes-256-gcm
> . . .
> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
> aes-256-gcm      57659.60k    64599.05k    67719.81k    68373.74k    68724.24k    68793.80k
> 
> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
> aes-256-gcm     177925.57k   502311.65k   866287.95k  1036500.35k  1106598.06k  1106721.91k
> 
> Rock64 (older first), Cortex-A53's:
> 
> # openssl speed -evp aes-256-gcm
> . . .
> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
> aes-256-gcm      18378.23k    23401.45k    24834.99k    25206.10k    25337.86k    25258.19k
> 
> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
> aes-256-gcm      52711.29k   163586.49k   318738.69k   420277.93k   461373.44k   463192.06k
> 
> 
> OPi+2E (older first), Cortex-A7's (so armv7):
> 
> # openssl speed -evp aes-256-gcm
> . . .
> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
> aes-256-gcm       9343.10k    11156.39k    11827.64k    11995.30k    12025.86k    12031.32k
> 
> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
> aes-256-gcm      11013.41k    13598.44k    14034.26k    15045.97k    15262.90k    15302.66k
> 
> 
> 
> For reference:
> 
> For the RPi4B examples (2 notes added):
> 
> CPU  0: ARM Cortex-A72 r0p3 affinity:  0
>                   Cache Type = <64 byte D-cacheline,64 byte I-cacheline,PIPT ICache,64 byte ERG,64 byte CWG>
> Instruction Set Attributes 0 = <CRC32>
> *** NOTE the lack of ",SHA2,SHA1,AES+PMULL" above ***
> Instruction Set Attributes 1 = <>
>         Processor Features 0 = <AdvSIMD,FP,EL3 32,EL2 32,EL1 32,EL0 32>
>         Processor Features 1 = <>
>      Memory Model Features 0 = <TGran4,TGran64,SNSMem,BigEnd,16bit ASID,16TB PA>
>      Memory Model Features 1 = <8bit VMID>
>      Memory Model Features 2 = <32bit CCIDX,48bit VA>
>             Debug Features 0 = <DoubleLock,2 CTX BKPTs,4 Watchpoints,6 Breakpoints,PMUv3,Debugv8>
>             Debug Features 1 = <>
>         Auxiliary Features 0 = <>
>         Auxiliary Features 1 = <>
> AArch32 Instruction Set Attributes 5 = <CRC32,SEVL>
> *** NOTE the lack of ",SHA2,SHA1,AES+VMULL" above ***
> AArch32 Media and VFP Features 0 = <FPRound,FPSqrt,FPDivide,DP VFPv3+v4,SP VFPv3+v4,AdvSIMD>
> AArch32 Media and VFP Features 1 = <SIMDFMAC,FPHP DP Conv,SIMDHP SP Conv,SIMDSP,SIMDInt,SIMDLS,FPDNaN,FPFtZ>
> 
> For the MACCHIATObin Double Shot examples:
> 
> CPU  0: ARM Cortex-A72 r0p1 affinity:  0  0
>                   Cache Type = <64 byte D-cacheline,64 byte I-cacheline,PIPT ICache,64 byte ERG,64 byte CWG>
> Instruction Set Attributes 0 = <CRC32,SHA2,SHA1,AES+PMULL>
> Instruction Set Attributes 1 = <>
>         Processor Features 0 = <AdvSIMD,FP,EL3 32,EL2 32,EL1 32,EL0 32>
>         Processor Features 1 = <>
>      Memory Model Features 0 = <TGran4,TGran64,SNSMem,BigEnd,16bit ASID,16TB PA>
>      Memory Model Features 1 = <8bit VMID>
>      Memory Model Features 2 = <32bit CCIDX,48bit VA>
>             Debug Features 0 = <DoubleLock,2 CTX BKPTs,4 Watchpoints,6 Breakpoints,PMUv3,Debugv8>
>             Debug Features 1 = <>
>         Auxiliary Features 0 = <>
>         Auxiliary Features 1 = <>
> AArch32 Instruction Set Attributes 5 = <CRC32,SHA2,SHA1,AES+VMULL,SEVL>
> AArch32 Media and VFP Features 0 = <FPRound,FPSqrt,FPDivide,DP VFPv3+v4,SP VFPv3+v4,AdvSIMD>
> AArch32 Media and VFP Features 1 = <SIMDFMAC,FPHP DP Conv,SIMDHP SP Conv,SIMDSP,SIMDInt,SIMDLS,FPDNaN,FPFtZ>
> 
> 
> For the HoneyComb examples:
> 
> CPU  0: ARM Cortex-A72 r0p3 affinity:  0  0
>                   Cache Type = <64 byte D-cacheline,64 byte I-cacheline,PIPT ICache,64 byte ERG,64 byte CWG>
> Instruction Set Attributes 0 = <CRC32,SHA2,SHA1,AES+PMULL>
> Instruction Set Attributes 1 = <>
>         Processor Features 0 = <GIC,AdvSIMD,FP,EL3 32,EL2 32,EL1 32,EL0 32>
>         Processor Features 1 = <>
>      Memory Model Features 0 = <TGran4,TGran64,SNSMem,BigEnd,16bit ASID,16TB PA>
>      Memory Model Features 1 = <8bit VMID>
>      Memory Model Features 2 = <32bit CCIDX,48bit VA>
>             Debug Features 0 = <DoubleLock,2 CTX BKPTs,4 Watchpoints,6 Breakpoints,PMUv3,Debugv8>
>             Debug Features 1 = <>
>         Auxiliary Features 0 = <>
>         Auxiliary Features 1 = <>
> AArch32 Instruction Set Attributes 5 = <CRC32,SHA2,SHA1,AES+VMULL,SEVL>
> AArch32 Media and VFP Features 0 = <FPRound,FPSqrt,FPDivide,DP VFPv3+v4,SP VFPv3+v4,AdvSIMD>
> AArch32 Media and VFP Features 1 = <SIMDFMAC,FPHP DP Conv,SIMDHP SP Conv,SIMDSP,SIMDInt,SIMDLS,FPDNaN,FPFtZ>
> 
> 
> 
> 
> For the Rock64 examples:
> 
> CPU  0: ARM Cortex-A53 r0p4 affinity:  0
>                   Cache Type = <64 byte D-cacheline,64 byte I-cacheline,VIPT ICache,64 byte ERG,64 byte CWG>
> Instruction Set Attributes 0 = <CRC32,SHA2,SHA1,AES+PMULL>
> Instruction Set Attributes 1 = <>
>         Processor Features 0 = <AdvSIMD,FP,EL3 32,EL2 32,EL1 32,EL0 32>
>         Processor Features 1 = <>
>      Memory Model Features 0 = <TGran4,TGran64,SNSMem,BigEnd,16bit ASID,1TB PA>
>      Memory Model Features 1 = <8bit VMID>
>      Memory Model Features 2 = <32bit CCIDX,48bit VA>
>             Debug Features 0 = <DoubleLock,2 CTX BKPTs,4 Watchpoints,6 Breakpoints,PMUv3,Debugv8>
>             Debug Features 1 = <>
>         Auxiliary Features 0 = <>
>         Auxiliary Features 1 = <>
> AArch32 Instruction Set Attributes 5 = <CRC32,SHA2,SHA1,AES+VMULL,SEVL>
> AArch32 Media and VFP Features 0 = <FPRound,FPSqrt,FPDivide,DP VFPv3+v4,SP VFPv3+v4,AdvSIMD>
> AArch32 Media and VFP Features 1 = <SIMDFMAC,FPHP DP Conv,SIMDHP SP Conv,SIMDSP,SIMDInt,SIMDLS,FPDNaN,FPFtZ>
> C
> 
> 
> For the OPi+2E examples:
> 
> CPU: ARM Cortex-A7 r0p5 (ECO: 0x00000000)
> CPU Features: 
>  Multiprocessing, Thumb2, Security, Virtualization, Generic Timer, VMSAv7,
>  PXN, LPAE, Coherent Walk
> Optional instructions: 
>  SDIV/UDIV, UMULL, SMULL, SIMD(ext)
> LoUU:2 LoC:3 LoUIS:2 
> Cache level 1:
> 32KB/64B 4-way data cache WB Read-Alloc Write-Alloc
> 32KB/32B 2-way instruction cache Read-Alloc
> Cache level 2:
> 512KB/64B 8-way unified cache WB Read-Alloc Write-Alloc

Note: as the issue applies to stable/13 and main [so: 14]
(for example), I continue to use the freebsd-arm list
instead of a list that reports commits to stable/* but
not to main.

Relative to:

#define HWCAP_FP                0x00000001
#define HWCAP_ASIMD             0x00000002
#define HWCAP_EVTSTRM           0x00000004
#define HWCAP_AES               0x00000008
#define HWCAP_PMULL             0x00000010
#define HWCAP_SHA1              0x00000020
#define HWCAP_SHA2              0x00000040
#define HWCAP_CRC32             0x00000080

The single-bit enabled OPENSSL_armcap that gets the slow
result is:

# env OPENSSL_armcap=1 openssl speed -evp aes-256-gcm
. . .
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-256-gcm      28427.04k    30712.32k    31446.00k    31683.40k    31829.10k    31839.55k

The illegal instruction ones for aes-256-gcm were:

# env OPENSSL_armcap=4 openssl speed -evp aes-256-gcm
Doing aes-256-gcm for 3s on 16 size blocks: Illegal instruction (core dumped)

env OPENSSL_armcap=32 openssl speed -evp aes-256-gcm
Doing aes-256-gcm for 3s on 16 size blocks: Illegal instruction (core dumped)

(sha256 does not match for what is illegal.)

Ignoring the illegal-instruction producing bits, HWCAP_FP mixed
with any one of the other bits was also similarly slow.

As for all the non-illegal-instruction producing bits: also similarly
slow:

# env OPENSSL_armcap=219 openssl speed -evp aes-256-gcm
. . .
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-256-gcm      28922.63k    30711.51k    31522.15k    31722.15k    31788.97k    31845.03k

Disabling just HWCAP_FP from that got the fast category of
result:

# env OPENSSL_armcap=218 openssl speed -evp aes-256-gcm
. . .
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-256-gcm      49543.14k    58068.22k    60236.56k    60724.37k    61216.09k    61212.99k


As for sha256 . . .

# env OPENSSL_armcap=0 openssl speed -evp sha256
. . .
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha256           22434.19k    59895.91k   117258.16k   156264.31k   172624.81k   173848.52k

(I'll not list all the similar performing ones but
will list all illegal-instruction producing ones.)

# env OPENSSL_armcap=4 openssl speed -evp sha256
Doing sha256 for 3s on 16 size blocks: 4082055 sha256's in 2.99s
Doing sha256 for 3s on 64 size blocks: 2752520 sha256's in 3.02s
Doing sha256 for 3s on 256 size blocks: 1372584 sha256's in 3.03s
Doing sha256 for 3s on 1024 size blocks: 470215 sha256's in 3.11s
Doing sha256 for 3s on 8192 size blocks: 64700 sha256's in 3.07s
Doing sha256 for 3s on 16384 size blocks: 31847 sha256's in 3.00s
Illegal instruction (core dumped)

# env OPENSSL_armcap=16 openssl speed -evp sha256
Doing sha256 for 3s on 16 size blocks: Illegal instruction (core dumped)

(16 worked for aes-256-gcm but 32 did not.)

So: no significantly slower examples of single enabled
bit cases.

No (non-illegal-instruction) 2-enabled-bits examples were
dissimilar for the speed.

For reference (avoiding illegal-instructions):

# env OPENSSL_armcap=235 openssl speed -evp sha256
. . .
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha256           23185.66k    62689.73k   125814.72k   167981.88k   187833.65k   188968.95k

So: also similar speed.

Need any other specific bit combinations?

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)