An aarch64 4-core-SBC FreeBSD performance oddity: Rock64 and RPi4B examples

Wed Aug 26 03:56:33 UTC 2020

On 2020-Aug-24, at 23:20, Mark Millard <marklmi at yahoo.com> wrote:

> The point here is more the systematic bias that might not be
> expected and might indicate something is not working as expected.
> (The performance details that suggest such are not directly
> the point but are the suggestive-evidence of some sort of
> bias in the implementation.)
> 
> I have found that, under at least FreeBSD head -r363590 , it turns
> out that what pair of cpus (cores) are used for a 2 thread activity
> in the program I'm using affects the performance measurably and
> systematically (but not necessarily by a large amount).
> 
> Basically when the cpu-pair involves cpu2 it goes slower than
> otherwise when in a context where memory caches help with
> performance. (Otherwise RAM slowness or thread creation time
> dominates the measurement involved.) But there are 3 contexts
> for this:
> 
> A) The cpu pair does not involve cpu2. These perform similar
>   to each other [relative to (B) and (C) below].
> 
> B) cpu2 is in use with one of cpu1 or cpu3. These are different
>   from (A) for performance: slower. But the two (B) cases are
>   similar to each other [relative to (A) and (C)].
> 
> C) cpu2 is in use with cpu0. This slower than both (A) and (B).
>   This case also seems to have somewhat more variability in the
>   performance compared to (A) and (B).
> 
> The Rock64 and RPi4B have very different memory-subsystem
> performance-related-behavior overall but the above still applies
> to both as a summary. I've not seen such differences for, say,
> a RPi4B ubuntu context. I've not tested other example contexts.
> 
> I limit the cpus/cores via cpuset use on the command line. I can
> build the program involved with or without it locking down each
> thread in the test to a distinct cpu/core within what cpuset
> is told to allow vs. allowing migration to occur between those
> cpus/cores. (No cpuset use would be needed for 4 cores for the
> example SBCs.) The effect is measurable both ways.

Based on further experiments in a more general context, I
retract that "locking down each thread in the test to a
distinct cpu/core within what cpuset is told to allow" is
not needed to observe (B) or (C): in general the lock down
activity is needed to see such.

Still using cpuset from the command line, but not doing the
locking down to specific cores at all in the program, does not
follow the described pattern generally.

This leads me to guess that the cpuset_setaffinity used
to do the lock-down contributes to (B) and (C) happening.
The cpuset_setaffinity use looks like:

    if  (0 != cpuset_setaffinity( CPU_LEVEL_WHICH
                                , CPU_WHICH_TID
                                , id_t{-1} // current thread
                                , sizeof(cpuset_t)
                                , &cpus_info.singleton_sets.at(c).cpu_set
                                )
        )
        throw std::runtime_error("failed to set cpu");

It turns out that, for a cortex-a72 system with two 512 KiByte
L2 caches (one per pair of cores) and an overall 1 MiByte
L3 cache (exclusive), I was also able to see an ordering
inside (A) for the cpu-lock-down based testing [as well as
seeing (B) and (C)].

> I test both at the boot -s command prompt and for normal login
> command prompts. (Variations in competing activity, including
> for RAM cache use.) The effect is measurable both ways.
> 
> I have tested two distinct RPi4B's but only have access to
> one Rock64. All 3 contexts show the general structure
> reported.
> 
> 
> As for graphs showing examples . . .
> 
> In the graphs for the results the colored curves are the
> cpu-pair curves (green, blue, red). I provide dark grey
> for single-threaded and 4-core as context for comparison.
> If I have any 3 thread examples included for comparison:
> light grey.
> 
> green: cpu pair does not involve cpu2 (fastest)
> red:   cpu pair is cpu0 and cpu2 (slowest)
> blue:  cpu pair involved cpu2 but not cpu0 (between)
> 
> (The single-threaded curve(s) are the most different
> from the others on each SBC so they stand out.)
> 
> I'll note that for the x-axis and multi-threaded, being to
> the left means thread creation is a larger fraction of the
> overall time for the size and that limits the y-axis figure
> for the size. (For multi-threaded, thread creations are part
> of what is measured for each size problem.)
> 
> x-axis: logarithmic for "kernel vectors: total Bytes", base 4
>        (a computer oriented indication of the size of the problem)
> 
> y-axis: linear for the type of speed figure
> 
> 
> A rock64 .png image of an example context's graph is at:
> 
> https://github.com/markmi/acpphint/blob/master/acpphint_example_data/Rock64-cpu-pairs-oddity.png
> 
> 
> A RPi4B .png image of an example context's graph is at:
> (The y-axis range is different from Rock64's y-range.)
> 
> https://github.com/markmi/acpphint/blob/master/acpphint_example_data/RPi4B-cpu-pairs-oddity.png
> 
> For the RPi4B graph, there is a peak for each color and
> both sides of the peak show the issue, but more so on
> the left side.
> 
> 
> Notes:
> 
> The program is a c++17 variant of some of the old
> HINT benchmarks. For reference for the data types
> involved in the graphed data:
> 
> ull: unsigned long long (64 bits here)
> ul:  unsigned long      (also 64 bits here)
> 
> So variations between the two give some idea of
> the degree of other sources of variability in the
> measurements (ull and ul are essentially equivalent).
> 
> Without the cpu lock down code being built, the
> program is not system-specific c++17 code. But
> building with the cpu lock down code does add
> system-specific code (FreeBSD specific here).
> 
> I build with g++ (even when using the system
> libc++ and such instead of g++'s libraries). This
> is because the program resulting happens to be more
> performant in any case that I've compared. Being
> more performant makes things easier to notice
> when checking for oddities.
> 
> Other than building for comparisons to Linux that
> uses g++'s libraries, I use the FreeBSD libc++ and
> such because they are more performant at creating
> threads under FreeBSD (for example). Being more
> performant . . .
> 

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)