Re: It's time to kill statistical profiling

From: Stefan Esser <se_at_freebsd.org>
Date: Sat, 19 Jun 2021 09:12:01 UTC
Am 18.06.21 um 17:12 schrieb John Baldwin:
> Note that only profhz is what you could kill.  stathz is used for
> statclock to compute rusage and the %CPU for ps(1) as well as the
> cp_time stats for system-wide (and per-CPU) time stats.
> 
> What I would like to do for rusage is to have an option to split
> up rux_runtime into separate "raw" iruntime, sruntime, and
> uruntime and switch between them on kernel entry/exit similar to
> what we do now in mi_switch().  This would remove the need for
> iticks/uticks/sticks and the need for calcru() to try to do
> subdividing and then playing games to prevent individual times
> going backwards.  Instead, it would just do a straightforward
> conversion of the component <x>runtime to the value getrusage()
> wants.  I've just never gotten around to doing that.
> 
> However, even with that, you are still stuck with providing
> whatever events the schedule wants to set %CPU for ps(1).  You
> also still need something to provide the kern.cp_time arrays
> used for CPU usage.  statclock might still be the simplest way
> to provide those.

If any major changes are considered in this area, I'd really
want to see our CPU statistics become SMT aware:

If one thread is executing on a SMT capable CPU per core, it
does consume 100% of the cycles (which are also shown in top
for that process), but the total CPU% value shown is 50% since
it seems that half of the CPUs (the alternate thread supported
by each one) is idle.

The correct way to deal with SMT would be to assume that a CPU
that is executing a single thread does so at 100% nominal clock
(which is not constant, to make matters worse), while with two
threads we get two "virtual" CPUs executing at 60% (or whatever)
of the nominal clock rate, delivering 120% combined throughput.

We do collect topology information that describes the system in
sufficient detail, but do not account for the reduced throughput
of each thread of a SMT thread pair. And I'm not sure that the
scheduler will prefer allocating CPUs in such a way that only
one thread is executing on each CPU until the load goes above
the number of cores and it becomes preferable to actually use
more than one thread per CPU.

This does also mess up statistics used by the scheduler, which
assumes that all cores run at the same speed all the time. Due
to frequency variations depending on load and other factors,
this is not true, anyway, but the difference between the single
core maximum clock rate and the all cores loaded clock rate is
not that large. And since the scheduler has to make correct
decisions under high load - low load situations do not suffer
as much under bad scheduling) it can be assumed that the CPU
is running at the base frequency, anyway.

One possibility could be to count actual cycles in the different
execution contexts (user, system, interrupt) and store them with
information about the relative CPU performance (whether running
as one thread of an SMT thread pair or not). For a start it would
suffice to assume a CPU executing 2 threads would spend half its
cycles on each one (i.e. as if it was 2 cores at half the clock,
ignoring the actual combined throughput of 130% that I see on my
Ryzen based system). The fraction of time that two threads were
assigned to that CPU needs to be known, too. Or have one counter
that get updated if there is only one thread on a core, and 2
more for the two "halves" of that core when executing 2 threads.

I do not suppose to modify the cycle accounting in such a
direction, now. But any change that is made could as well take
SMT into consideration and at least make it possible to go
into that direction.

Regards, STefan