Re: Ryzen 9 7950X3D bulk -a times: adding an example with SMT disabled (so 16 hardware threads, not 32)

From: Mark Millard <marklmi_at_yahoo.com>
Date: Thu, 16 Nov 2023 04:50:21 UTC
On Nov 12, 2023, at 18:00, Mark Millard <marklmi@yahoo.com> wrote:

> On Nov 9, 2023, at 17:26, Mark Millard <marklmi@yahoo.com> wrote:
> 
>> Reading some benchmark results for compilation activity that showed some
>> SMT vs. not examples and also using my C++ variant of the old HINT
>> benchmark, I ended up curious how a non-SMT from scratch bulk -a would
>> end up (ZFS context) compared my prior SMT based run.
>> 
>> I use a high load average style of bulk -a activity that has USE_TMPFS=all
>> involved. The system has 96 GiBytes of RAM (total across the 2 DIMMs).
>> The original under 1.5 day time definitely had significant swap space use
>> (RAM+SWAP = 96 GiBYtes + 364 GiBytes == 460 GiBytes == 471040 MiBytes).
>> The media was (and is) a PCIe based Optane 905P 1.5T. ZFS on a single
>> partition on the single drive, ZFS used just for bectl reasons, not other
>> typical use-ZFS reasons. I've not controlled the ARC size-range explicitly.
>> 
>> So less swap partition use is part of contribution to the results.
>> 
>> The original bulk -a spent a couple of hours at the end where it was
>> just fetching and building textproc/stardict-quick . I have not cleared
>> out /usr/ports/distfiles or updated anything.
>> 
>> So fetch time is also a difference here.
>> 
>> SMT (32 hardware threads, original bulk -a):
>> 
>> [33:10:00] [32] [04:37:23] Finished emulators/libretro-mame | libretro-mame-20220124_1: Success
>> [35:36:51] [23] [03:44:04] Finished textproc/stardict-quick | stardict-quick-2.4.2_9: Success
>> . . .
>> [main-amd64-bulk_a-default] [2023-11-01_07h14m50s] [committing:] Queued: 34683 Built: 33826 Failed: 179   Skipped: 358   Ignored: 320   Fetched: 0     Tobuild: 0      Time: 35:37:55
>> 
>> Swap-involved MaxObs (Max Observed) figures:
>> 173310Mi MaxObsUsed
>> 256332Mi MaxObs(Act+Lndry+SwapUsed)
>> 265551Mi MaxObs(Act+Wir+Lndry+SwapUsed)
>> (So 265551Mi of 471040Mi RAM+SWAP.)
>> 
>> Just-RAM MaxObs figures:
>> 81066Mi MaxObsActive
>> (Given the complications of getting usefully comparable wired figures for ZFS (ARC): omit.)
>> 94493Mi MaxObs(Act+Wir+Lndry)
>> 
>> Note: MaxObs(A+B+C) <= MaxObs(A)+MaxObs(B)+MaxObs(C)
>> 
>> ALLOW_MAKE_JOBS=yes was used. No explicit restriction on PARALLEL_JOBS
>> or MAKE_JOBS_NUMBER (or analogous). So 32 builders allowed, each allowed
>> 32 make jobs. This explains the high load averages of the bulk -a :
>> 
>> load averages . . . MaxObs: 360.70, 267.63, 210.84
>> (Those need not be all from the same time frame during the bulk -a .)
>> 
>> As for the ports vintage:
>> 
>> # ~/fbsd-based-on-what-commit.sh -C /usr/ports/
>> 6ec8e3450b29 (HEAD -> main, freebsd/main, freebsd/HEAD) devel/sdts++: Mark DEPRECATED
>> Author:     Muhammad Moinur Rahman <bofh@FreeBSD.org>
>> Commit:     Muhammad Moinur Rahman <bofh@FreeBSD.org>
>> CommitDate: 2023-10-21 19:01:38 +0000
>> branch: main
>> merge-base: 6ec8e3450b29462a590d09fb0b07ed214d456bd5
>> merge-base: CommitDate: 2023-10-21 19:01:38 +0000
>> n637598 (--first-parent --count for merge-base)
>> 
>> I do have a environment that avoids various LLVM builds taking
>> as long to build :
>> 
>> llvm1[3-7]  : no MLIR, no FLANG
>> llvm1[4-7]  : use BE_NATIVE
>> other llvm* : use defaults (so, no avoidance)
>> 
>> I also prevent the builds from using strip on most of the install
>> materials built (not just toolchain materials).
>> 
>> 
>> non-SMT (16 hardware threads):
>> 
>> Note one builder (math/fricas), the last still present, was
>> stuck and I had to kill processes to have it stop unless I
>> was willing to wiat for my large timeout figures. The last
>> builder normal-finish was:
>> 
>> [39:48:10] [09] [00:16:23] Finished devel/gcc-msp430-ti-toolchain | gcc-msp430-ti-toolchain-9.3.1.2.20210722_1: Success
>> 
>> So, trying to place some bounds for comparing to SMT (32 hw threads)
>> and non-SMT (16 hw threads):
>> 
>> 33:10:00 SMT -> 39:48:10 non-SMT would be over 6.5 hrs longer for non-SMT
>> 35:36:51 SMT -> 39:48:10 non-SMT would be over 4 hrs longer for non-SMT
>> 
>> As for SMT vs. non-SMT Maximum Observed figures:
>> 
>> SMT     load averages . . . MaxObs: 360.70, 267.63, 210.84
>> non-SMT load averages . . . MaxObs: 152.89, 100.94,  76.28
>> 
>> Swap-involved MaxObs figures for SMT (32 hw threads) vs not (16):
>> 173310Mi vs.  33003Mi MaxObsUsed
>> 256332Mi vs. 117221Mi MaxObs(Act+Lndry+SwapUsed)
>> 265551Mi vs. 124776Mi MaxObs(Act+Wir+Lndry+SwapUsed)
>> 
>> Just-RAM MaxObs figures for SMT (32 hw threads) vs not (16):
>> 81066Mi vs. 69763Mi MaxObsActive
>> (Given the complications of getting usefully comparable wired figures for ZFS (ARC): omit.)
>> 94493Mi vs. 94303Mi MaxObs(Act+Wir+Lndry)
>> 
> 
> I've added a section for a plot for the 7950X3D to the end of:
> 
> https://github.com/markmi/acpphint/blob/master/Some_acpphint_curves_with_notes.md
> 
> It is from a C++ variant of the old HINT benchmark and includes
> showing RAM caching consequences for the benchmark. The about
> 32 MiByte and about 96 MiByte cache sizes for the 2 CCDs are
> observable.
> 
> I'll also note that for the devices present (active and not),
> at fully active the 7950X3D seems to use 225 Watts .. 235 Watts
> at the power cable for FreeBSD. Idle FreeBSD: more like 96
> Watts.
> 
> (No video card. 2 forms of Optane 905P 1.5TB, one active. One
> Samsung 960 Pro 2TB, inactive. One Samsung 970 EVO Plus 2TB,
> inactive. 96 GiBytes of RAM total across 2 DIMMs. Fans and
> AIO cooling. Keyboard and mouse USB powered. USB3 Ethernet
> dongle. Monitor connection.)
> 
> 
> ThreadRipper 1950X "bulk -a" test in progress:
> 
> I'm running a from-scratch USE_TMPFS=all "bulk -a" on the
> ThreadRipper 1950X (128 GiBytes of RAM). From what I've seen
> so far, it looks to likely take over 72 hr, so 2x+ as long
> as the 7950X3D. (Samgsung 960 Pro 1TB system media and
> Optane 900 480 GB swap space media in use, 447 GiByte I as I
> remember). The ZFS partition on the 960 Pro has ashift=14 .)
> It has a slightly modified copy of the ZFS from the 7950X3D
> as far as starting content goes. It does have openzfs-2.2
> compatibility fully enabled for its pool, including block
> cloning, unlike any other ZFS I have around
> (openzfs-2.1-freebsd).

ThreadRipper 1950X:

. . .
[85:21:50] [27] [02:06:01] Finished databases/mongodb60 | mongodb60-6.0.11: Success
[85:34:00] [28] [03:23:06] Finished biology/ncbi-cxx-toolkit | ncbi-cxx-toolkit-27.0.0_1: Success
[85:46:31] [30] [08:19:30] Finished cad/kicad-library-packages3d | kicad-library-packages3d-7.0.2_2: Success
[87:07:02] [03] [13:00:45] Finished emulators/libretro-mame | libretro-mame-20220124_1: Success

But one port that normally takes little time got stuck (in kqread,
apparently against a <defunct> child process), resulting in (later):

# poudriere status -b
[main-amd64-bulk_a-default] [2023-11-11_17h59m25s] [parallel_build:] Queued: 34683 Built: 33807 Failed: 173   Skipped: 382   Ignored: 320   Fetched: 0     Tobuild: 1      Time: 88:17:59
 ID  TOTAL        ORIGIN   PKGNAME                PHASE PHASE    TMPFS    CPU% MEM%
[05] 17:27:25 ftp/curlie | curlie-1.6.7_15 check-sanity 17:27:15 1.28 GiB          
=>> Logs: /usr/local/poudriere/data/logs/bulk/main-amd64-bulk_a-default/2023-11-11_17h59m25s

So it looks like:

Ryzen 9 7950X       96 GiBytes RAM (5600MT/s): 33 hr or so.
ThreadRipper 1950X 128 GiBytes RAM (2400MT/s): 87 hr or so.

For reference (both 32 hardware threads):

Ryzen 9      7950X: 265551Mi MaxObs(Act+Wir+Lndry+SwapUsed)
ThreadRipper 1950X: 245564Mi MaxObs(Act+Wir+Lndry+SwapUsed)

(The 96 GiByte vs. 128 GiByte RAM size difference makes other
figures messier to compare.)

I have updated the 7950X UEFI and am rerunning the from-scratch
bulk -a test in the ZFS context to check on system stability
for such.

===
Mark Millard
marklmi at yahoo.com