7950X3D: using 1 hardware thread per core vs. 2 hardware threads per core: a fairly large difference

From: Mark Millard <marklmi_at_yahoo.com>
Date: Fri, 19 Jan 2024 08:22:01 UTC
I do not know how much the below generalizes as I do not
have access to other rather modern FreeBSD amd64 systems
to test, just the 7950X3D system with 192 GiBytes of RAM.

The gist:

https://gist.github.com/markmi/193423c6fd6f534a72725d7d5cd0236a

is an image showing performance curves for a benchmark.
Each curve is for 8 hardware threads in use. The x axis
is for the problem size (Bytes, logarithmic scaling). The
y axis is performance (linear). (It is a mathematical
definition in a mathematical approximation problem that
is handled a specific way in the benchmark.) As the
problem size grows signficantly larger than a RAM cache,
the access pattern makes the RAM-cache become notably less
effective. The benchmark variant restricts each software
thread to a specific hardware thread (singleton cpuset)
after the thread starts, generally avoiding losing
structural information to thread migration variability
in the structures used.

The major performance difference ends up being tied to:

1 hardware thread per core
vs.
2 hardware threads per core

A quick textual summary giving a clue is:

1 per core, 8 cores: around 800*(10^6) to 850*(10^6) peak.
2 per core, 4 cores: around 500*(10^6) to 550*(10^6) peak.
                                              (same units)

But far more than the peaks show large differences in
the same orientation for the same caching generally.
Think of an area under a curve for a size range being
important for that size range.

Each hardware thread does independent processing. (But
the threads' results are combined to get the overall
result for a problem size.) So more RAM cache sharing
and other resource sharing is involved for 2 threads
per core --and it has non-trivial performance
consequences from the competition for shared resources.

The far right of each curve [around 150*(10^6)] vs. the
peaks of the curve suggest how much the RAM-caching
helps the performance (or how much the processor waits
for RAM when RAM-caching is not very effective vs. when
RAM-caching is more effective).

The RAM is DDR5-5200, 2 DIMMS per channel, 2 channels,
48 GiBytes per DIMM.


Note: The benchmark can also be built to not have the
CPU LockDown used, allowing general migration of
software threads across the hardware threads in
a cpuset. Seeing the CPU LockDown results first can
help interpret the messier with-migration results.

===
Mark Millard
marklmi at yahoo.com