FreeBSD on Ryzen

Fri Mar 24 03:08:24 UTC 2017

On Thu, 23 Mar 2017, Stefan Esser wrote:

> Am 22.03.17 um 21:30 schrieb Don Lewis:
>> I put together a Ryzen 1700X machine over the weekend and installed the
>> 12.0-CURRENT r315413 snapshot on it a couple of days ago.  The RAM is
>> DDR4 2400.
>>
>> First impression is that it's pretty zippy.  Compared to my previous
>> fastest machine:
>>   CPU: AMD FX-8320E Eight-Core Processor (3210.84-MHz K8-class CPU)
>> make -j8 buildworld using tmpfs is a bit more than 2x faster.  Since the
>> Ryzen has SMT, it's eight cores look like 16 CPUs to FreeBSD, I get
>> almost a 2.6x speedup with -j16 as compared to my old machine.
>>
>> I do see that the reported total CPU time increases quite a bit at -j16
>> (~19900u) as compared to -j8 (~13600u) so it is running into some
>> hardware bottlenecks that are slowing down instruction execution.  It
>> could be the resources shared by both SMT threads that share each core,
>
> It is the resources shared by the cores. Under full CPU load, SMT makes
> a 3.3 GHz 8 core CPU "simulate" a ~2 GHz 16 core CPU.

This seems to be normal for all x86.  See below.  However, Don measured
mixed methods by only using -j8.  This gives less SMT use than -j16 or
-j32, but still a lot, since make doesn't really understand scheduling
doesn't limit the number of active threads to keep them on unshared cores.
I use cpuset -l 0,2,4,6 to test this with 4x2 cores to test this manually.
I also use SCHED_4BSD, which needs more manual scheduling to reduce sharing
of cores.

About 10 years ago, large -j for makeworld seemed to cost a lot for
lock contention, but I didn't understand the overheads from core sharing
at the time, and perhaps schedulers didn't either (SCHED_4BSD still
doesn't, except I use a small hack for it), so the extra user time may
have always been for SMT.  Now large -j doesn't cost much for makeworld.
I still try it (up to about 16 times as many threads as cores), but
it just gives small pessimizations when it is larger than the number
needed to keep all cores usually active.

My slowest makeworld times on Haswell (4.08GHz no turbo 4x2) are 24
times faster than the above (572u instead of 13600u), but this is due
to my world being ~5.2 and current worlds having clang and other
slowness.  The time scales almost perfectly inversely with the CPU
clock and by a factor of about your 3.3/2 with SMT.  E.g., on Haswell
with no SMT (4x1), my makeworld time reduces to 354, which corresponds
to a factor of 3.2/2.  makeworld has a non-parallized install section
near the end (about 10% of the real time of 140 seconds on Haswell),
so this factor of 3.2/2 is smaller than the SMT factor (I would have
expected even smaller).

The SMT scaling for makeworld is similar on Sandybridge.  The SMT
scaling for buildworld is similar on most all of the FreeBSD-cluster's
Xeons.  However, not very long ago, the FreeBSD cluster and more
trailing edge Xeons which had an SMT scaling factor of 4/2.

The scaling for a simple benchmark that uses only integer resources
for a countdown loop has an SMT scaling factor of only about 4/3.  I
think the scaling has precise factors like 4 and 3 since maximum
throughput is 3 or 4 instructions/cycle, and when this throughput is
achieved it uses all of a critical resource, leaving none to spare for
SMT.

> The throughput is (in 1st order) proportional to cores * CPU clock, and
> comes out as
>
> 	8 * 3.3 = 26.4  vs.  16 * ~2 = ~32  (estimated)

This must be very CPU-dependent.  x86 CPUs are still optimized for !SMT
and/or low power, so they don't try for more than 3 or 4 instructions/cycle
since this is rarely reached for a single thread, so they don't have many
spare resources to use for SMT.

> I'm positively surprised by the observed gain of +30% due to SMT. This
> seems to match the reported user times:
>
> 13,600 /  8 = 1,700 seconds user time per physical core (on average)
> 19,900 / 16 = 1,244 seconds per virtual (SMT) core

This is probably from the mixed methods.  I'm surprised that -j8 doesn't
keep closer to >= 16  CPUs than 8 runnable most of the time, so that it
uses SMT just as much as -j16 most of the time.

> vs. an estimate of the throughput with a CPU with SMT but without any
> gain in throughput:
>
> 27,200 / 16 = 1,700 seconds per virtual core with ineffective SMT
>
> (i.e. assuming SMT that does not increase effective IPC, resulting
> in identical real time compared to the non-SMT case)
>
> This result seems to match the increased performance when going from
> -j 8 to -j 16:
>
> 27,200 / 19,900 = 2.7  ~  2.6 / 2.0
>
>> or it could be cache or memory bandwidth related.  The Ryzen topology is
>> a bit complicated. There are two groups of four cores, where each group
>> of four cores shares half of the L3 cache, with a slowish interconnect
>> bus between the groups.  This probably causes some NUMA-like issues.  I
>> wonder if the ULE scheduler could be tweaked to handle this better.
>
> I've been wondering whether it is possible to teach the scheduler about
> above mentioned effect, i.e. by distinguishing a SMT core that executes
> only 1 runnable thread from one that executes 2. The latter one should
> be assumed to run at an estimated 60% clock (which makes both threads
> proceed at 120% of the non-SMT speed).
>
> OTOH, the lower "effective clock rate" should be irrelevant under high
> load (when all cores are executing 2 threads), or under low load, when
> some cores are idle (assuming, that the scheduler prefers to assign only
> 1 thread per each core until there are more runnable threads then cores.
> 
> If you assume that user time accounting is a raw measure of instructions
> executed, then assuming a reduced clock rate would lead to "fairer"
> results.

I thought that schedulers didn't understand SMT at all.  SCHED_4BSD
certainly doesn't.  I use the following hack to reduce sharing in it.
It is almost useless for the reasons that you state:
- low load: makes little difference.  A random choice of CPU from many
   free CPUs has a low chance of contending with an active CPU.
- high load: makes little difference.  There are no spare CPUs, and a
   random choice is less bad than a smart choice since it hard to do
   better but easy to do worse by making perfectly pessimal choices
   and sticking with them.

X Index: sched_4bsd.c
X ===================================================================
X --- sched_4bsd.c	(revision 315658)
X +++ sched_4bsd.c	(working copy)
X @@ -1237,6 +1261,11 @@
X  }
X  #endif
X 
X +#ifdef SMP
X +static int evenhack;
X +SYSCTL_INT(_kern_sched, OID_AUTO, evenhack, CTLFLAG_RW, &evenhack, 0, "");
X +#endif
X +
X  void
X  sched_add(struct thread *td, int flags)
X  #ifdef SMP
X @@ -1307,6 +1336,23 @@
X  		    td);
X  		cpu = NOCPU;
X  		ts->ts_runq = &runq;
X +if (evenhack == mp_maxid) {
X +		int id;
X +
X +		cpuid = PCPU_GET(cpuid);
X +		if (CPU_ISSET(cpuid ^ 1, &idle_cpus_mask))
X +			goto found;
X +		for (id = 0; id <= mp_maxid; id += 2) {
X +			if (CPU_ISSET(id, &idle_cpus_mask) &&
X +			    CPU_ISSET(id ^ 1, &idle_cpus_mask)) {
X +				cpu = id;
X +				ts->ts_runq = &runq_pcpu[cpu];
X +				single_cpu = 1;
X +				break;
X +			}
X +		}
X +found: ;
X +}
X  	}
X 
X  	if ((td->td_flags & TDF_NOLOAD) == 0)

For makeworld, this seems to give improvements in the range of 0.1-0.5%,
but it is hard to be sure since the variance in the real time is about
3% (and that is with some temperature control and closer to 100 than
10 other details to keep the test environment constant).  System time
is is ~80 seconds on Haswell and it is hard to get excited about
improvements of even 1% in it (0.8 second divided by 8 cores = 0.1
second in real time).  Using SCHED_ULE instead of SCHED_BSD gives
improvements in the +-5% range (worse on older CPUs).

Bruce