performance regressions in 15.0

From: Mateusz Guzik <mjguzik_at_gmail.com>
Date: Sat, 06 Dec 2025 10:50:08 UTC
I got pointed at phoronix: https://www.phoronix.com/review/freebsd-15-amd-epyc

While I don't treat their results as gospel, a FreeBSD vs FreeBSD test
showing a slowdown most definitely warrants a closer look.

They observed slowdowns when using iperf over localhost and when compiling llvm.

I can confirm both problems and more.

I found the profiling tooling for userspace to be broken again so I
did not investigate much and I'm not going to dig into it further.

Test box is AMD EPYC 9454 48-Core Processor, with the 2 systems
running as 8 core vms under kvm.

I. iperf

Package is: iperf3-3.19.1

Tested with: iperf3 -s + iperf3 -c localhost

While the rates fluctuate, 14.3 is overall faster:

[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.01   sec  2.70 GBytes  23.1 Gbits/sec
[  5]   1.01-2.07   sec  1.92 GBytes  15.5 Gbits/sec
[  5]   2.07-3.01   sec  1.76 GBytes  16.1 Gbits/sec
[  5]   3.01-4.02   sec  1.86 GBytes  15.9 Gbits/sec
[  5]   4.02-5.01   sec  2.84 GBytes  24.5 Gbits/sec
[  5]   5.01-6.02   sec  2.54 GBytes  21.7 Gbits/sec
[  5]   6.02-7.07   sec  2.18 GBytes  17.8 Gbits/sec
[  5]   7.07-8.02   sec  1.76 GBytes  15.9 Gbits/sec
[  5]   8.02-9.01   sec  1.88 GBytes  16.3 Gbits/sec
[  5]   9.01-10.02  sec  1.90 GBytes  16.2 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.02  sec  21.3 GBytes  18.3 Gbits/sec                  receiver

vs 15.0:
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.01   sec  1.85 GBytes  15.7 Gbits/sec
[  5]   1.01-2.02   sec  3.23 GBytes  27.5 Gbits/sec
[  5]   2.02-3.03   sec  1.84 GBytes  15.7 Gbits/sec
[  5]   3.03-4.01   sec  1.86 GBytes  16.3 Gbits/sec
[  5]   4.01-5.01   sec  1.64 GBytes  14.1 Gbits/sec
[  5]   5.01-6.07   sec  1.87 GBytes  15.1 Gbits/sec
[  5]   6.07-7.01   sec  1.23 GBytes  11.3 Gbits/sec
[  5]   7.01-8.01   sec  1.85 GBytes  15.8 Gbits/sec
[  5]   8.01-9.01   sec  1.42 GBytes  12.2 Gbits/sec
[  5]   9.01-10.01  sec  1.81 GBytes  15.5 Gbits/sec
[  5]  10.01-10.07  sec  99.9 MBytes  14.1 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.07  sec  18.7 GBytes  16.0 Gbits/sec                  receiver

This is reliably repeatable.

II. compilation speed

The the real and serious problem. Both versions of the system ship the
same clang version:
FreeBSD clang version 19.1.7 (https://github.com/llvm/llvm-project.git
llvmorg-19.1.7-0-gcd708029e0b2)
Target: x86_64-unknown-freebsd14.3
Thread model: posix
InstalledDir: /usr/bin

FreeBSD clang version 19.1.7 (https://github.com/llvm/llvm-project.git
llvmorg-19.1.7-0-gcd708029e0b2)
Target: x86_64-unknown-freebsd15.0
Thread model: posix
InstalledDir: /usr/bin

I found that compiling the will-it-scale suite about doubles in real
time needed, along with doubling time spent in userspace.

will-it-scale needs a little bit of massaging to work, diff at the end.

check this out (repeabale): while true; do gmake -s clean && time
gmake -s -j 8; done

14.3:
gmake -s -j 8  8.93s user 2.03s system 769% cpu 1.42s (1.424) total
gmake -s -j 8  9.02s user 2.16s system 757% cpu 1.48s (1.475) total
gmake -s -j 8  9.29s user 1.95s system 774% cpu 1.45s (1.450) total
gmake -s -j 8  8.97s user 2.46s system 770% cpu 1.48s (1.484) total
gmake -s -j 8  9.13s user 2.30s system 773% cpu 1.48s (1.477) total

15.0:
gmake -s -j 8  19.90s user 3.02s system 773% cpu 2.96s (2.963) total
gmake -s -j 8  19.90s user 3.18s system 774% cpu 2.98s (2.979) total
gmake -s -j 8  20.24s user 2.90s system 770% cpu 3.00s (3.005) total
gmake -s -j 8  19.92s user 3.25s system 771% cpu 3.00s (3.003) total
gmake -s -j 8  20.25s user 2.95s system 772% cpu 3.01s (3.006) total

user time *skyrocketed*

This is not some weird scheduling anomaly either: while true; do gmake
-s clean && time cpuset -l 1 gmake -s ; done

14.3:
cpuset -l 1 gmake -s  8.88s user 1.11s system 99% cpu 10.00s (10.003) total
cpuset -l 1 gmake -s  8.94s user 1.12s system 99% cpu 10.07s (10.067) total
cpuset -l 1 gmake -s  9.00s user 1.06s system 99% cpu 10.07s (10.072) total
cpuset -l 1 gmake -s  8.88s user 1.17s system 99% cpu 10.07s (10.069) total
cpuset -l 1 gmake -s  8.88s user 1.23s system 99% cpu 10.13s (10.127) total

15.0:
cpuset -l 1 gmake -s  21.58s user 2.33s system 99% cpu 23.96s (23.961) total
cpuset -l 1 gmake -s  21.16s user 2.54s system 99% cpu 23.76s (23.759) total
cpuset -l 1 gmake -s  19.90s user 1.90s system 99% cpu 21.85s (21.854) total
cpuset -l 1 gmake -s  19.76s user 1.74s system 99% cpu 21.55s (21.554) total
cpuset -l 1 gmake -s  19.72s user 1.75s system 99% cpu 21.53s (21.526) total

Per my previous remark I found userspace profiling to be
non-operational and I did not try to fight it.

It did however do few sanity checks mostly with will-its-scale:
1. syscall rate is down over 7% (tested with getppid1_processes)
2. malloc also got a slowdown(!). there are 2 benches, one ends up
issuing syscalls, the other does not.

Results in ops/s:

malloc1_processes (malloc/free of 128MB):
14.3: 1960769
15.0: 1376087 (-30%)

malloc2_processes (malloc/free of 1kB):
14.3: 156034491
15.0: 51645759 (-67%)

Apart from that the kernel is overall slower, for example negative
path lookups also regressed (-12%).

Another issue is execve rate. To bench that I borrowed the following:
http://apollo.backplane.com/DFlyMisc/doexec.c
cc -O2 doexec.c
cpuset -l 1 ./a.out 1

In ops/s:
14.3: 4905
15.0: 3672 (-25%)

The clang thing might happen to be clang-specific. Whatever it is, I
think the total slowdown is serious enough that it warrants
investigation and an errata notice. But you do you, I am *not* going
to work on this.

will-it-scale howto:
pkg install gmake hwloc
git clone https://github.com/antonblanchard/will-it-scale

add this:
diff --git a/Makefile b/Makefile
index 8dd0717..d779705 100644
--- a/Makefile
+++ b/Makefile
@@ -1,9 +1,11 @@
-CFLAGS+=-Wall -O2 -g
-LDFLAGS+=-lhwloc
+CFLAGS+=-Wall -O2 -g -I/usr/local/include
+LDFLAGS+=-lhwloc -L/usr/local/lib

 processes := $(patsubst tests/%.c,%_processes,$(wildcard tests/*.c))
 threads := $(patsubst tests/%.c,%_threads,$(wildcard tests/*.c))

+threadspawn1_processes_FLAGS+=-lpthread
+
 all: processes threads

 processes: $(processes)
diff --git a/tests/malloc1.c b/tests/malloc1.c
index 14d4c3b..05737bb 100644
--- a/tests/malloc1.c
+++ b/tests/malloc1.c
@@ -12,6 +12,7 @@ void testcase(unsigned long long *iterations,
unsigned long nr)
        while (1) {
                void *addr = malloc(SIZE);
                assert(addr != NULL);
+               asm volatile("" :: "m" (addr));
                free(addr);

                (*iterations)++;
diff --git a/tests/malloc2.c b/tests/malloc2.c
index c24aceb..e769dd3 100644
--- a/tests/malloc2.c
+++ b/tests/malloc2.c
@@ -12,6 +12,7 @@ void testcase(unsigned long long *iterations,
unsigned long nr)
        while (1) {
                void *addr = malloc(SIZE);
                assert(addr != NULL);
+               asm volatile("" :: "m" (addr));
                free(addr);

                (*iterations)++;