FreeBSD on Ryzen

Thu Mar 23 23:42:08 UTC 2017

On 23 Mar, Stefan Esser wrote:
> Am 22.03.17 um 21:30 schrieb Don Lewis:
>> I put together a Ryzen 1700X machine over the weekend and installed the
>> 12.0-CURRENT r315413 snapshot on it a couple of days ago.  The RAM is
>> DDR4 2400.
>> 
>> First impression is that it's pretty zippy.  Compared to my previous
>> fastest machine:
>>   CPU: AMD FX-8320E Eight-Core Processor (3210.84-MHz K8-class CPU)
>> make -j8 buildworld using tmpfs is a bit more than 2x faster.  Since the
>> Ryzen has SMT, it's eight cores look like 16 CPUs to FreeBSD, I get
>> almost a 2.6x speedup with -j16 as compared to my old machine.
>> 
>> I do see that the reported total CPU time increases quite a bit at -j16
>> (~19900u) as compared to -j8 (~13600u) so it is running into some
>> hardware bottlenecks that are slowing down instruction execution.  It
>> could be the resources shared by both SMT threads that share each core,
> 
> It is the resources shared by the cores. Under full CPU load, SMT makes
> a 3.3 GHz 8 core CPU "simulate" a ~2 GHz 16 core CPU.
> 
> The throughput is (in 1st order) proportional to cores * CPU clock, and
> comes out as
> 
> 	8 * 3.3 = 26.4  vs.  16 * ~2 = ~32  (estimated)
> 
> I'm positively surprised by the observed gain of +30% due to SMT. This

Don't forget that the -j8 case is also paying some penalty for SMT.  We
don't currently recognize that Ryzen uses SMT and we think that there
are 16 independent CPUs.  In a test that I mentioned earlier today, I
disabled SMT in the BIOS so that the chip only looks like it has 8 cores
improved the performance in the -j8 case by 5%.

> seems to match the reported user times:
> 
> 13,600 /  8 = 1,700 seconds user time per physical core (on average)
> 19,900 / 16 = 1,244 seconds per virtual (SMT) core
> 
> vs. an estimate of the throughput with a CPU with SMT but without any
> gain in throughput:
> 
> 27,200 / 16 = 1,700 seconds per virtual core with ineffective SMT
> 
> (i.e. assuming SMT that does not increase effective IPC, resulting
> in identical real time compared to the non-SMT case)
> 
> This result seems to match the increased performance when going from
> -j 8 to -j 16:
> 
> 27,200 / 19,900 = 2.7  ~  2.6 / 2.0
> 
>> or it could be cache or memory bandwidth related.  The Ryzen topology is
>> a bit complicated. There are two groups of four cores, where each group
>> of four cores shares half of the L3 cache, with a slowish interconnect
>> bus between the groups.  This probably causes some NUMA-like issues.  I
>> wonder if the ULE scheduler could be tweaked to handle this better.
> 
> I've been wondering whether it is possible to teach the scheduler about
> above mentioned effect, i.e. by distinguishing a SMT core that executes
> only 1 runnable thread from one that executes 2. The latter one should
> be assumed to run at an estimated 60% clock (which makes both threads
> proceed at 120% of the non-SMT speed).
> 
> OTOH, the lower "effective clock rate" should be irrelevant under high
> load (when all cores are executing 2 threads), or under low load, when
> some cores are idle (assuming, that the scheduler prefers to assign only
> 1 thread per each core until there are more runnable threads then cores.
> 
> If you assume that user time accounting is a raw measure of instructions
> executed, then assuming a reduced clock rate would lead to "fairer"
> results.

Interesting, though it sounds complicated.

Under light load, it seems like we would want to assign threads to idle
cores rather than assigning a new thread to a core that already has one
running thread.  If there are no more than four threads on the current
Ryzen chips, we would probably want to run them all on the same CCX to
avoid the the Infinity Fabric overhead.  Things get fuzzier with more
than four threads.  Should we try to keep them on the same CCX to avoid
using the Infinity Fabric and pay the SMT overhead, or do the opposite?

According to my first edition copy of _The Design and Implementation of
the FreeBSD Operating System_, which covers FreeBSD 5.2, it seems that
in the SMT case, the ULE scheduler prefers to migrate threads to another
CPU in the same processor group.  That would seem to indicate that on
Ryzen it would prefer to keep threads on the same CPU core where they
would compete, rather than spread them out across different cores.  Is
that (still) the case?