Re: It's time to kill statistical profiling

From: John Baldwin <>
Date: Fri, 18 Jun 2021 15:12:53 UTC
On 6/18/21 12:36 AM, Poul-Henning Kamp wrote:
> Warners work to document the kernel timers in D30802 brought stathz up again.
> To give a representative result, statistical profiling needs to
> sample no less than approx 0.1% of instructions.
> On a VAX that meant running the statistical profiling at O(1kHz).
> On my 4 CPU, two thread, 2GHz laptop that means statistical profiling
> needs to run at O(10 MHz), which is barely doable.
> But it is worse:
> The samples must be unbiased with respect to the system activity,
> which was already a problem on the VAX and which is totally impossible
> on modern hardware, with message based interrupts, deep pipelines
> and telegraphic distance memory[1].
> Therefore statistical profiling is worse than useless: it is downright
> misleading, which is why modern CPUs have hardware performance counters.
> Instead of documenting stathz, I suggest we retire statistical
> profiling and convert the profiled libraries to code-coverage
> profiling (-fprofile-arcs and -ftest-coverage)
> Poul-Henning
> [1] One could *possibly* approch unbiased samples, by locking the
> stathz code path in L1 cache and disable L1 updates, but then
> the results would be from an entirely different system.

Note that only profhz is what you could kill.  stathz is used for
statclock to compute rusage and the %CPU for ps(1) as well as the
cp_time stats for system-wide (and per-CPU) time stats.

What I would like to do for rusage is to have an option to split
up rux_runtime into separate "raw" iruntime, sruntime, and
uruntime and switch between them on kernel entry/exit similar to
what we do now in mi_switch().  This would remove the need for
iticks/uticks/sticks and the need for calcru() to try to do
subdividing and then playing games to prevent individual times
going backwards.  Instead, it would just do a straightforward
conversion of the component <x>runtime to the value getrusage()
wants.  I've just never gotten around to doing that.

However, even with that, you are still stuck with providing
whatever events the schedule wants to set %CPU for ps(1).  You
also still need something to provide the kern.cp_time arrays
used for CPU usage.  statclock might still be the simplest way
to provide those.

I agree that hwpmc is what one should use for real profiling, but
there's actually not much that you get to axe in the kernel when
removing the kernel-side support for the old profiling.

As Konstantin has noted, we already no longer build or ship
-pg libraries by default.  I'd be fine with removing the build
glue for that outright, or with generalizing it as Konstantin
suggests, though I would probably not even want to keep -pg as
one of the variants for the generalization.  To that end, I
would be fine with just removing all the -pg support and if
someone wants to add a a new variant they can deal with making
it more general at that time.  I'd much rather someone spend
time on adding support for PGO and LTO to our build infrastructure
than trying to keep -pg alive.

John Baldwin