FreeBSD on Ryzen

Wed Mar 22 22:37:02 UTC 2017

On 22 Mar, Freddie Cash wrote:
> On Wed, Mar 22, 2017 at 1:30 PM, Don Lewis <truckman at freebsd.org> wrote:
> 
>> I put together a Ryzen 1700X machine over the weekend and installed the
>> 12.0-CURRENT r315413 snapshot on it a couple of days ago.  The RAM is
>> DDR4 2400.
>>
>> First impression is that it's pretty zippy.  Compared to my previous
>> fastest machine:
>>   CPU: AMD FX-8320E Eight-Core Processor (3210.84-MHz K8-class CPU)
>> make -j8 buildworld using tmpfs is a bit more than 2x faster.  Since the
>> Ryzen has SMT, it's eight cores look like 16 CPUs to FreeBSD, I get
>> almost a 2.6x speedup with -j16 as compared to my old machine.
>>
>> I do see that the reported total CPU time increases quite a bit at -j16
>> (~19900u) as compared to -j8 (~13600u) so it is running into some
>> hardware bottlenecks that are slowing down instruction execution.  It
>> could be the resources shared by both SMT threads that share each core,
>> or it could be cache or memory bandwidth related.  The Ryzen topology is
>> a bit complicated. There are two groups of four cores, where each group
>> of four cores shares half of the L3 cache, with a slowish interconnect
>> bus between the groups.  This probably causes some NUMA-like issues.  I
>> wonder if the ULE scheduler could be tweaked to handle this better.
>>
> 
> ?The interconnect, aka Infinity Fabric, runs at the speed of the memory
> controller, so if you put faster RAM into the system, the fabric runs
> faster, and inter-CCX latency should drop to match.

Unfortunately ECC RAM seems to max out at DDR4 2400, so I'm already at
the end of that road.

> There's 2 MB of L3 cache shared between every two cores, but any core can
> access data in the L3 cache of any other core.  Latency for those requests
> depends on whether it's within the same CCX (4-core cluster), or in the
> other CCX? (going across the Infinity Fabric).

I missed the extra level of L3 segmentation when I first read this:
<http://www.overclock.net/t/1624566/theories-on-why-the-smt-hurts-the-performance-of-gaming-in-ryzen-and-some-recommendations-for-the-future>
My "slowish" remark was about the speed of Infinity Fabric vs. QPI as
mentioned in this article.

> There's a lot of finicky timing issues with L3 cache accesses, and with
> thread migration (in-CCX vs across the fabric).
> 
> This is a whole other level of NUMA fun.  And it'll get even more fun when
> the server version ships where you have 4 CCXes in a single CPU, with
> multiple sockets on a motherboard, and Infinity Fabric joining everything
> together.  :)

Yeah, given that FreeBSD is pretty weak in terms of NUMA, I wasn't
getting all that excited by the upcoming server stuff.

> I feel sorry for the scheduler devs who get to figure all this out.  :D
>  Supposedly, the Linux folks have this mostly figured out in kernel 4.10,
> but I'll wait for the benchmarks to believe it.  There's a bunch up on
> Phoronix ... but, well, it's Phoronix.  :)

I saw it mentioned that the Linux change was to fix a Bulldozer
optimization that de-optimizes performance on Ryzen.

All in all, I'm pretty impressed with the performance improvement,
especially once I noticed the relatively small clock speed difference
between the 1700X (3394 MHz) vs. my FX-8320E (3211 MHz).  Both have a
95W TDP.

I'm not too thrilled with the 20C offset that AMD adds to Tctl on the
1700X and 1800X (but not the 1700).  It makes the CPU fan sound like a
vacuum cleaner even when the CPU is idle because the motherboard thinks
the CPU is running in the mid-50 degree range.