Re: HoneyComb first-boot notes [a L3/L2/L1/RAM performance oddity]

Reply: Mark Millard via freebsd-arm : "Re: HoneyComb first-boot notes [a L3/L2/L1/RAM performance oddity]"
In reply to: Mark Millard via freebsd-arm : "Re: HoneyComb first-boot notes [buildworld buildkernel timing example for stable/13]"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Mark Millard via freebsd-arm <freebsd-arm_at_freebsd.org>
Date: Sun, 11 Jul 2021 05:09:50 UTC
On 2021-Jun-24, at 16:25, Mark Millard <marklmi at yahoo.com> wrote:

> On 2021-Jun-24, at 16:00, Mark Millard <marklmi at yahoo.com> wrote:
> 
>> On 2021-Jun-24, at 13:39, Mark Millard <marklmi at yahoo.com> wrote:
>> 
>>> Repeating here what I've reported on teh solidrun discord:
>>> 
>>> I decided to experiment with monitoring the temperatures reported
>>> as things are. For the default heat-sink/fan and the 2 other fans
>>> in the case, buildworld with load average 16.? for some time has
>>> stayed with tz0 through tz6 reporting between 61.0degC and 66.0degC,
>>> say about 20degC for ambiant. (tz7 and tz8 report 0.1C.) During
>>> stages with lower load averages, the tz0..tz6 tempuratures back off
>>> some. So it looks like my default context keeps the system
>>> sufficiently cool for such use.
>>> 
>>> I'll note that the default heat-sink's fan is not operating at rates
>>> that I hear it upstairs. I've heard the noisy mode from there during
>>> early parts of booting for Fedora 34 server, for example.
>> 
>> So I updated my stable/13 source and built and installed
>> the update, then did a rm -fr of the build directory
>> tree context and started a from-scratch build. The
>> build had:
>> 
>> SYSTEM_COMPILER: Determined that CC=cc matches the source tree.  Not bootstrapping a cross-compiler.
>> and:
>> SYSTEM_LINKER: Determined that LD=ld matches the source tree.  Not bootstrapping a cross-linker.
>> 
>> as is my standard context for doing such "how long does
>> it take" buildworld buildkernel testing.
>> 
>> On aarch64 I do not build for targeting non-arm architectures.
>> This does save some time on the builds.
> 
> I should have mentioned that my builds are based on tuning
> for the cortex-a72 via -mcpu=cortex-a72 being used. This
> was also true of the live system that was running, kernel
> and world.
> 
>> The results for the HoneyComb configuration I'm using:
>> 
>> World build completed on Thu Jun 24 15:30:11 PDT 2021
>> World built in 3173 seconds, ncpu: 16, make -j16
>> Kernel build for GENERIC-NODBG-CA72 completed on Thu Jun 24 15:34:45 PDT 2021
>> Kernel(s)  GENERIC-NODBG-CA72 built in 274 seconds, ncpu: 16, make -j16
>> 
>> So World+Kernel took a a little under 1 hr to build (-j16).
>> 
>> 
>> 
>> Comparison/contrast to prior aarch64 systems that I've used
>> for buildworld buildkernel . . .
>> 
>> 
>> By contrast, the (now failed) OverDrive 1000's last timing
>> was (building releng/13 instead of stable/13):
>> 
>> World build completed on Tue Apr 27 02:50:52 PDT 2021
>> World built in 12402 seconds, ncpu: 4, make -j4
>> Kernel build for GENERIC-NODBG-CA72 completed on Tue Apr 27 03:08:04 PDT 2021
>> Kernel(s)  GENERIC-NODBG-CA72 built in 1033 seconds, ncpu: 4, make -j4
>> 
>> So World+Kernel took a a little under 3.75 hrs to build (-j4).
>> 
>> 
>> The MACCHIATObin Double Shot's last timing was
>> (building a 13-CURRENT):
>> 
>> World build completed on Tue Jan 19 03:44:59 PST 2021
>> World built in 14902 seconds, ncpu: 4, make -j4
>> Kernel build for GENERIC-NODBG completed on Tue Jan 19 04:04:25 PST 2021
>> Kernel(s)  GENERIC-NODBG built in 1166 seconds, ncpu: 4, make -j4
>> 
>> So World+Kernel took a little under 4.5 hrs to build (-j4).
>> 
>> 
>> The RPi4B 8GiByte's last timing was
>> ( arm_freq=2000, sdram_freq_min=3200, force_turbo=1, USB3 SSD
>> building releng/13 ):
>> 
>> World build completed on Tue Apr 20 14:34:38 PDT 2021
>> World built in 22104 seconds, ncpu: 4, make -j4
>> Kernel build for GENERIC-NODBG completed on Tue Apr 20 15:03:24 PDT 2021
>> Kernel(s)  GENERIC-NODBG built in 1726 seconds, ncpu: 4, make -j4
>> 
>> So World+Kernel took somewhat under 6 hrs 40 min to build.
> 
> The -mcpu=cortex-a72 use note also applies to the OverDrive 1000,
> MACCHIATObin Double Shot, and RPi4B 8 GiByte contexts.
> 

I've run into an issue where what FreeBSD calls cpu 0 has
significantly different L3/L2/L1/RAM subsystem performance
than all the other cores (cpu 0 being worse). Similarly for
compared/contrasted to all 4 MACCHIATObin Double Shot cores.

A plot with curves showing the issue is at:

https://github.com/markmi/acpphint/blob/master/acpphint_example_data/HoneyCombFreeBSDcpu0RAMAccessPerformanceIsOdd.png

The dark red curves in the plot show the expected general
shape for such and are for cpu 0. The lighter colored
curves are the MACCHIATObin curves. The darker ones are
the HoneyComb curves, where the L3/L2/L1 is relatively
effective (other than cpu 0).

My notes on Discord (so far) are . . .

The curves are from my C++ variant of the old Hierarchical
INTegration benchmark (historically abbreviated HINT). You
can read the approximate size of a level of cache  from 
the x-axis for where the curve drops faster. So, right
(most obvious) to left (least obvious): L3 8 MiByte, L2 1
MiByte (per core pair, as it turns out), L1 32 KiByte.

The curves here are for single thread  benchmark
configurations with cpuset used to control which CPU is
used. I first noticed this via odd performance variations
in multithreading with more cores allowed than in use (so
migrations to a variety of cpus over time).

I explored all the CPUs (cores), not just what I plotted.
Only the one gets the odd performing memory access
structure in its curve.

FYI: The FreeBSD boot is UEFI/ACPI based for both systems,
not U-Boot based.

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)