observations on Ryzen 5xxx (Zen 3) processors

From: Andriy Gapon <avg_at_FreeBSD.org>
Date: Wed, 22 Dec 2021 12:42:48 UTC
There have been some reports on strange / unexpected things with Ryzen 5xxx 
processors.  I think I have seen 5950X, 5900X and 5800X mentioned, not sure 
about others.

Since I have 5800X myself I looked into a couple of issues that have 
straightforward demonstrators.  I would like to share my findings and 
observations on those issues.

Issue 1.  High wake-up latency for CPU idle states.

This seems to be related to the so called CC6 idle state.
The official information on it is very sparse.
The state is not explicitly exposed to the OS, at least, though ACPI interfaces 
that FreeBSD currently supports.

In my tests I see that if all logical processors enter an idle state then an 
external interrupt can be delayed by 500+ us.  Specifically, I observed this 
with an MSI-X interrupt from a discrete network chip.  Interrupts from internal 
components seem to be affected as well, but to a lesser degree.

The deep state in question can be entered regardless of whether C2 (via I/O) is 
enabled, C1 (via hlt) is sufficient.  In fact, with machdep.idle=hlt it works 
the same.
The state is not entered if at least one logical CPU is not idle.
The state is not entered if machdep.idle=mwait is used.  Apparently, the 
processors do not attempt to automatically enter as deep idle modes with mwait 
as they do with hlt.
Finally, the state is not entered if zenstates.py utility is used to disable C6 
/ CC6 state via an undocumented (publicly) MSR.

For me personally that state does not cause any annoyances but anyone who 
experiences problems related to "stuttering", "jitter", latency might want to 
look into this.

Issue 2.  Uneven performance of CPU intensive tasks, especially with SCHED_ULE, 
when SMT is enabled.

I found out that at least on my hardware all even numbered logical CPUs can 
perform much better than odd numbered logical CPUs.  It seems that hardware 
threads within a core are not equal.  Maybe this is related to ability to use 
boosted frequencies, but maybe something else, I am not sure.
 From a brief look at the ULE code it looks that the selection of a hw thread 
within a core is intentionally random when all other things are equal.
I suspect that the hardware + firmware may actually describe that performance 
disparity via ACPI CPPC (_CPC object, etc), but right now we do not support 
querying that or making use of it.


It would interesting to see if other owners of similar processors can confirm or 
provide counter-examples to my observations.

Simple tests for issue 1:
- ping a host attached to the same switch (so, with very low expected latency)
- ping 127.0.0.1

For issue 2: take some CPU intensive single-threaded task and bind it (with 
cpuset -l) to different logical CPUs.  Multiple such tasks can be run 
concurrently on different logical CPUs.

References:
- https://forums.freebsd.org/threads/variable-ping-latency-on-ryzen-setup.82791/
- https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=256594
- https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254040
- https://github.com/r4m0n/ZenStates-Linux
- https://github.com/meowthink/ZenStates-FreeBSD --  has a bug
   - https://github.com/avg-I/ZenStates-FreeBSD -- has a fix
- https://www.kernel.org/doc/html/latest/admin-guide/acpi/cppc_sysfs.html
- https://static.linaro.org/connect/lvc21/presentations/lvc21-219.pdf
- 
https://uefi.org/specs/ACPI/6.4/14_Platform_Communications_Channel/Platform_Comm_Channel.html

-- 
Andriy Gapon