[Bug 219399] System panics after several hours of 14-threads-compilation orgies using poudriere on AMD Ryzen...

bugzilla-noreply at freebsd.org bugzilla-noreply at freebsd.org
Sun Jul 23 20:52:55 UTC 2017


https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=219399

--- Comment #89 from Don Lewis <truckman at FreeBSD.org> ---
Created attachment 184641
  --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=184641&action=edit
patch to move amd64 shared page to a lower address to avoid Ryzen problem with
executing code near user address upper limit

I've been doing a number of experiments with openjdk7 builds to try to better
characterize the Ryzen problem.

First I did a number of openjdk7 builds using cpuset to pin the build to
individual cores.  Using cpuset -l 0 to pin the build to the first thread on
core 0 would consistently cause a silent reboot on the first or second try. 
Pinning  the build to any of the other cores allowed me to successfully build
openjdk7.  I ran four builds on each of the other cores to make sure that I
wasn't just getting a successful build by chance.  Surprisingly, pinning the
build to the second thread on core 0 was also successful.  In any case, the
results were consistent with my earlier tests where I disabled SMT and also all
but two cores in the BIOS, since those tests always used the first thread on
core 0.

I tried building openjdk7 on all cores except the first thread of core 0 by
using cpuset -l 1-15 and was also successful.

Based on that positive result, I tried building my default set of ~1600 ports
with cpuset -l 1-15.  A little over two hours into the build, the llvm40 build
failed with the:
  _arena.c:821: Failed assertion: "nstime_compare(&decay->epoc h, &time) <= 0")
causing the ports that depend on it to be skipped, but everything else built
successfully.  When I restarted poudriere, the llvm40 build succeeded, but the
system hung after about an hour while running java as part of the openjdk7
build.

Next I tried building with cpuset -l 2-15.  The only problem that I ran into is
that the gcc build failed with SIGBUS, causing its dependencies to be skipped. 
When I restarted poudriere, gcc5 and the remaining ports build successfully.

I wanted to try to eliminate the possibility of a subtle defect in core 0 as a
potential cause of the problem, so I tried adding
 hint.lapic.0.disabled=1
 hint.lapic.1.disabled=1
to /boot/loader.conf, but FreeBSD does not allow the BSP to be disabled B-(

The other thing that is unique about core 0 on my machine is that it looks like
all of the external interrupts (but not interprocessor interrupts) go there. 
The biggest source of those seemed to be hpet, but I couldn't figure out how to
disable that (other than maybe disabling ACPI totally).  When I tried
hint.hpet.0.clock=0, all of the CPUs got assigned interrupts from another
timer.

The next thing I tried was inspired by the Dragonfly patch.  At least some
thread implementations use signals to communicate between threads.  I'm not
familiar with OpenJDK, but it is possible that it is such an implementation, so
it might be a heavy signal user and spend a lot of cycles in the signal
trampoline code.  Our signal trampoline code is in a different location than
Dragonfly uses, but it is still close to (in the top page of) the top of user
memory.  Even though I got the impression that the Dragonfly patch addresses an
issue with SMT, it does involve an interaction between interrupts and execution
of code near the top of user memory.

As an experiment, I patched the kernel to move the location of the shared page
lower by PAGE_SIZE.  I'm not sure if it is necessary, but the page at the old
location has the same rwx permissions and is zero filled.  I don't know if the
bug is triggered by executing code close to the upper address boundary or close
to a permission boundary.  The preliminary results so far are very promising. 
With the patch applied, I am able to successfully build openjdk7 either
unpinned or pinned to the first thread of core 0.

I just kicked off an unpinned ~1600 port poudriere run.  I should have results
of that late today.

The patch is attached.

-- 
You are receiving this mail because:
You are the assignee for the bug.


More information about the freebsd-bugs mailing list