Re: [package - 130arm64-default][lang/gcc12-devel] Failed for gcc12-devel-12.0.1.s20220306_2 in build/runaway

From: Mark Millard <marklmi_at_yahoo.com>
Date: Sat, 02 Apr 2022 18:27:51 UTC
On 2022-Mar-27, at 09:02, Mark Millard <marklmi@yahoo.com> wrote:

. . .
> On 26 Mar 2022, at 15:16, pkg-fallout@freebsd.org <pkg-fallout@FreeBSD.org> wrote:
>> 
>> . . .
>> Log URL:        http://ampere3.nyi.freebsd.org/data/130arm64-default/60ab72786154/logs/gcc12-devel-12.0.1.s20220306_2.log
>> . . .

Turns out that log (and other examples of lang/gcc12-devel runaway
kills) does have a hint about what timeouts to change:

QUOTE
=>> Killing runaway build after 7200 seconds with no output
END QUOTE

Quarterly's build for 12.3 also got a kill, with the same message.
See:

https://lists.freebsd.org/archives/freebsd-toolchain/2022-April/000478.html

I'll note that the after the message, the kill can be hours later
in the build's activity, depending on the size of the log file: the
log file is evaluated before the kill is done and the involved scans
(plural!) of huge log files can be on that kind of time scale.

The message is from:

# grep -r "seconds with no output" /usr/local/share/poudriere/ | more
/usr/local/share/poudriere/common.sh:                                   msg "Killing runaway build after ${NOHANG_TIME} seconds with no output"

So, at least NOHANG_TIME needs to increase as long as bootstrap-lto-noplugin
is in use. (There may be more.) Note that NOHANG_TIME is not specific to
the individual port.

As I remember, the kills tend to happen between 11 and 12 hours into the
aarch64 build but the successful builds take 20..24 hours. It is not great
evidence, but it might suggest more than doubling NOHANG_TIME (for aarch64
jails?). Looking at it differently, since it does sometimes build a smaller
increase might avoid most of the kills that are now happening.

For poudriere, there are:

NOHANG_TIME
MAX_EXECUTION_TIME
MAX_EXECUTION_TIME_EXTRACT
MAX_EXECUTION_TIME_INSTALL
MAX_EXECUTION_TIME_PACKAGE
MAX_EXECUTION_TIME_DEINSTALL
QEMU_MAX_EXECUTION_TIME
QEMU_NOHANG_TIME

These are not independent, however. Setting a larger MAX_EXECUTION_TIME*
value can be ineffective with a small NOHANG_TIME, for example, if the
activity that takes the extra time happens to not output periodically.
(I've run into that before when I tried a bulk -a for WITH_DEBUG= in
use.)

One of the issues with poudriere's timeouts is that they do not auto-scale
to match machine performance. Figures for slower environments may be
time/power wasters on faster hardware when a build process really does
runaway.

Another point is that there is no scaling based on expected/historical time
frames. So runaways of port builds that should not take much time instead
run for a long time before being killed --in order to allow ports that
are expected to take a long time to build instead of being killed.

Another issue is that, for multiple builders doing a build over the same time
frame, the other activity can lead to longer times of "seconds with no output".

Part of the issue for lang/gcc* is that part of the bootstrap-lto-noplugin
processing does not respect the limits on parallel activity. Having multiple
bootstrap-lto-noplugin going because of multiple lang/gcc* building at the same
time apparently can lead to very high load averages for a time. In fact, even
just one lang/gcc* doing bootstrap-lto-noplugin can have a load average for a
time that is something like 1.5 * (# hardware threads) when the build indicated
to use the # hardware threads as the limitation. (Cores in this context.)

When changes are made to how things build, who is supposed to determine how to
adjust poudriere's settings to match on the various build architectures? Is this
something "exp run" sort of experiments are appropriate for determining for
each build architecture --or at least for tier 1 architectures? (Just me
pondering, given that what *TIME* settings to use is not obvious.)


===
Mark Millard
marklmi at yahoo.com