Re: Cores of different performance vs. time spent creating threads: Windows Dev Kit 2023 example [Oddity is back!]

From: Mark Millard <marklmi_at_yahoo.com>
Date: Tue, 16 May 2023 09:03:50 UTC
On May 15, 2023, at 12:14, Mark Millard <marklmi@yahoo.com> wrote:

> On May 9, 2023, at 19:19, Mark Millard <marklmi@yahoo.com> wrote:
> 
>> First some context that reaches an oddity that seems to
>> be involved in the time to create threads . . .
>> 
>> The Windows Dev Kit 2023 (WDK23 abbrevation here) boot reports:
>> 
>> CPUs (cores) 0..3: cortex-a78c (the slower cores)
>> CPUs (cores) 4..7: cortex-x1c  (the faster cores)
>> 
>> Building a kernel explicitly via involving -mcpu= use
>> gets the following oddity relative to cpu numbering
>> when the kernel is used:
>> 
>> -mcpu=cortex-x1c or -mcpu=cortex-a78c:
>>   Benchmarking tracks that number/performance pairing.
>> 
>> -mcpu=cortex-a72:
>>   The slower vs. faster gets swapped number blocks.
>> 
>> So, for -mcpu=cortex-a72 , 0..3 are the faster cores.
>> 
>> This sets up for the following . . .
>> 
>> But I also observe (a relative comparison of contexts
>> via some benchmark-like activity):
>> 
>> -mcpu=cortex-x1c or -mcpu=cortex-a78c based kernel:
>>   threads take more time to create
>> 
>> -mcpu=cortex-a72 based kernel:
>>   threads take less time to create
>> 
>> The difference is not trivial for the activity involved
>> for this WDK23 context.
>> 
>> If there is a bias as to which core(s) are involved in part
>> of thread creation generally, it would appear to be important
>> that the bias to be to the more performant cores (for what the
>> activity involves). The above suggests that such is possibly
>> not necessarily the case for FreeBSD as is. BIG/little (and
>> analogous?) cause this to become more relevant.
>> 
>> Does this hypothesis about what type of thing is going on
>> fit with how FreeBSD actually works?
>> 
>> As stands, I'm going to experiment with the WDK23 using
>> a cortex-a72 targeted kernel but a cortex-x1c/cortex-a78c
>> targeted world for my general operation of the WDK23.
>> 
>> 
>> Note: While the benchmark results allow seeing in plots
>> what traces back to thread creation time contributions,
>> the benchmark itself does not directly measure that time.
>> It is more like, the average work rate for a time changes
>> based on the fraction of the time involved in the thread
>> creations for each given problem size. The actual definition
>> of work here involves a mathematical quantity for a
>> mathematical problem (that need not be limited to computers
>> doing the work).
>> 
>> The benchmark results are more useful for discovering that
>> there is something to potentially investigate than to
>> actually do an investigation with.
>> 
> 
> Never  mind:

I was wrong about that . . . its back.
(See later below.)

> Starting over did not reproduce the oddity. So:
> operator oddity/error, though I've no clue of how
> to reproduce the odd swap of which cpu number ranges
> took more vs. less time for each given size problem.
> (Or any other aspect that might be considered also
> odd, such as specific performance figures.)
> 
> Retry details:
> 
> I booted the WDK23 via UFS media set up for
> cortex-a72, media that I use for UFS activities on
> the HoneyComb (for example). I built the benchmark
> and ran it.
> 
> As stands, I've only done the "cpu lock down" case.
> It produces less messy data by avoiding cpu
> migration once the lockdown completes (singleton
> cpuset for the thread). I'll also run the variant
> that does not have the cpu lock downs (standard
> C++ code without FreeBSD specifics added).

I got the swapped number blocks vs. performance again,
but not for cortext-a72 tailored FreeBSD, but for
cortex-x1c/cortex-a78c +nolse tailored FreeBSD.

Not rebooting for now, the oddity exists for the
benchmark built with each of:

clang 16 plus libc++
g++   13 plus libc++
g++   13 plus libstdc++

As before, top shows the name CPU<n>'s for STATE that
the benchmark does for the cpuset based cpu id (bit
numbering).

As before, the measured performance for "faster"
is also higher than normal.


As a cross check: Avoiding use of my benchmark program . . .

# cpuset -l0-3 openssl speed
Doing mdc2 for 3s on 16 size blocks: 1705580 mdc2's in 3.10s
. . .
vs.
# cpuset -l4-7 openssl speed
Doing mdc2 for 3s on 16 size blocks: 1079870 mdc2's in 3.03s
. . .

So, openssl speed also shows the oddity: 0-3 usage being faster
than 4-7 usage. The 1705580 is also somewhat large compared to
a normal "4-7 is faster" context: 1705580/3.1 approx= 550187/sec .
Compare to the similar calculation results in the below.

For example: Shutting down, powering off, powering on,
booting, and doing the openssl speed type of examples:

# cpuset -l0-3 openssl speed
Doing mdc2 for 3s on 16 size blocks: 997679 mdc2's in 3.09s
. . .
# cpuset -l4-7 openssl speed
Doing mdc2 for 3s on 16 size blocks: 1360400 mdc2's in 3.02s
. . .
# cpuset -l0-3 openssl speed
Doing mdc2 for 3s on 16 size blocks: 967253 mdc2's in 3.00s
. . .
# cpuset -l4-7 openssl speed
Doing mdc2 for 3s on 16 size blocks: 1406978 mdc2's in 3.08s
. . .

So (2 similar calculations to earlier above):

About 550187/sec vs. about 450463/sec and 456811/sec

That is about 1.2 times faster.


I've no clue about the cause or what stage(s) lead
to the odd context happening.

===
Mark Millard
marklmi at yahoo.com