[Bug 221029] AMD Ryzen: strange compilation failures using poudriere or plain buildkernel/buildworld

bugzilla-noreply at freebsd.org bugzilla-noreply at freebsd.org
Sat Aug 19 06:10:27 UTC 2017


--- Comment #73 from Don Lewis <truckman at FreeBSD.org> ---
When I've examined a ghc core file, gdb thought that rip was pointing at code
and allowed me to disassemble it.   I didn't see anything that looked like it
could cause SIGBUS.

I don't think I've ever had a successful ghc build on my Ryzen machine.

For a while I've been suspicious that the problems are triggered by the
migration of threads between CPU cores.  One thing that made me suspect this is
that most of the early tests that people did like running games and synthetic
tests like prime95 would create a fixed number of threads that probably always
stayed running on the same cores.  Parallel sofware builds are a lot more
chaotic with lots of processes being created and destroyed, with a lot of
thread migration being necessary to keep the load on all cores roughly

For the last week or so I've been running experiments where I start multiple
parallel buildworlds at the same time but with different MAKEOBJDIRPREFIX
values and different cpuset cpu masks.  I was looking for any evidence that
migration between different threads on the same core, or between different
cores in the same CCX, or migrating between different CCXs would trigger build
failures.  The interesting result is that I observed no failures at all!  One
possibility is that my test script was buggy and was missing build failures. 
Another is that the value that I used for "make -j" vs. the number of logical
cpus in the cpuset was not resulting in much migration.  A third is that the
use of cpuset was inhibiting the ability of for the scheduler to migrate
threads to balance the the load across all cores.

I started looking at the scheduler code to see if I could understand what might
be going on, but the code is pretty confusing.  I did stumble across some nice
sysctl tuning knobs that looked like they might be interesting to experiment
with.  The first is kern.sched.balance "Enables the long-term load balancer". 
This is enabled by default and periodically moves threads from the most loaded
CPU to the least loaded CPU.  I disabled this.  The next knob is
kern.sched.steal_idle "Attempts to steal work from other cores before idling". 
I disabled this as well. The last is kern.sched.affinity, "Number of hz ticks
to keep thread affinity for".  I think if the previous two knobs are turned
off, this will only come into play if a thread has been sleeping more than the
specified time.  If so, it probably gets scheduled on the CPU with the least
load when the thread wakes up.  The default value is 1.  I cranked it up to
1000, which should be long enough for any of its state in cache to have been
fully flushed.

After using this big hammer, I started a poudriere run to build my set of ~1700
ports.  The result was interesting.  The only two failures were the typical ghc
SIGBUS failure, and chromium failed to build with the rename problem.  CPU
utilization wasn't great due to some cores running out of work to do, so I
typically saw 5%-10% idle times during the poudriere run.

I think that the affinity knob is probably the key one here.  I'll try cranking
it down to something a bit lower and re-enabling the balancing algorithms to
see what happens.

You are receiving this mail because:
You are the assignee for the bug.

More information about the freebsd-bugs mailing list