[Bug 219399] System panics after several hours of 14-threads-compilation orgies using poudriere on AMD Ryzen...

bugzilla-noreply at freebsd.org bugzilla-noreply at freebsd.org
Thu Jul 20 00:00:56 UTC 2017


https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=219399

--- Comment #70 from Don Lewis <truckman at FreeBSD.org> ---
I now think that AGESA 1006 actually didn't fix anything for me.  I must have
gotten lucky with that first poudriere run after the BIOS upgrade.  The next
time I ran poudriere, I got a silent reboot after ~3 hours.  The times to
failure just looked too consistent for me, so I looked at the poudriere build
logs to see what was being built at the time of the crash.  One of them was
openjdk7.  One of the ports that got built when I restarted poudriere to build
the remaining ports that failed after the BIOS upgrade was openoffice, which
uses java, so things started making sense.

If I try try building openjdk7, I can pretty much consistently trigger a system
reboot, even with SMT off, only two cores enabled in the BIOS, the CPU clock
speed lowered to 3 GHz, and the RAM clock cranked down from 2400 MHz to 1866
MHz.

Then I marked openjdk7 BROKEN so that poudriere doesn't build it and skips the
ports that depend on it, the system stayed up and poudriere ran for almost 9
hours, though two ports failed with the jemalloc assertion failure that I
previously mentioned.

I also now think that the Dragonfly patch isn't needed on FreeBSD and
potentially could be harmful.  It is meant to work around what looks like a
Ryzen SMT bug.  The problem appears to be triggered by executing code close to
the top of user address space.  On Dragonfly, the signal trampoline code is
located just above the stack and very close to the top of user address space. 
By adding space to the end of sigtramp.S, the trampoline code is moved to a
lower starting address.  On FreeBSD, the signal trampoline code was moved to a
separate memory page so that the stack could be marked non-executable.  This
page is located at the very top of user address space.  I haven't looked at
what all is in this page, but if the contents are loaded started at the bottom
of the page, then the start of the signal trampoline is likely to be at a lower
address than on Dragonfly.  If other code is loaded in this page after the
signal trampoline, then adding space at the end could move that code closer to
the danger zone.  In any case, I had been doing much of my testing with SMT
disabled, so I removed this patch from my kernel.

After backing out the Dragonfly patch and also marking bootstrap-openjdk as
BROKEN to eliminate any vestige of java, setting the RAM and CPU clocks back to
auto, I ran poudriere again and the run was mostly successful, though I did see
a lang/go build failure due to a runaway build problem.

I then enabled SMT and core performance boost and ran poudriere again.  I
observed build failures of lang/go, gdb, and cairo.  I didn't see any obvious
problems with the latter two, it looked like something in each just returned
the wrong exit status.  Restarted poudriere successfully built the latter two,
but go failed again.  The go failures appeared to be caused by some sort of
corruption of its malloc state.  Note: go is multi-threaded.

Just for grins, I decided to try building ports in an i386 jail.  I got no
unexpected failures.  The results were the same when I re-enabled the java
ports.  It successfully built 1594 ports in 8 hours 33 minutes.  I was even
able to build lang/ghc on i386.  That one always had segfaults in the bootstrap
compiler for me on amd64.  I have no idea if it uses threads, though.

At least on my hardware there are one or more problems with amd64 code.  It
might just be multi-threaded processes.  The java problem could also be caused
by the hotspot compiler, which may look like self-modifying code.  In any case,
it can cause system hangs or reboots and may also corrupt the state of other
processes.  I finally received the hardware to set up a serial console
yesterday, but I haven't had time to install it yet.  The reboots that I've
seen don't seem to leave any trace in the logs, don't seem to trigger ddb, and
don't leave crash dumps.

-- 
You are receiving this mail because:
You are the assignee for the bug.


More information about the freebsd-bugs mailing list