Silent hang in buildworld, was Re: Invoking -v for clang during buildworld

Sun Jan 17 20:30:58 UTC 2021

On 2021-Jan-17, at 09:40, bob prohaska <fbsd at www.zefox.net> wrote:

> On Sat, Jan 16, 2021 at 03:04:04PM -0800, Mark Millard wrote:
>> 
>> Other than -j1 style builds (or equivalent), one pretty much
>> always needs to go looking around for a non-panic failure. It
>> is uncommon for all the material to be together in the build
>> log in such contexts.
> 
> Running make cleandir twice and restarting -j4 buildworld brought
> the process full circle: A silent hang, no debugger response, no
> console warnings. That's what sent me down the rabbit hole of make
> without clean, which worked at least once...

Unfortunately, such a hang tends to mean that log files and such
were not completely written out to media. We do not get to see
evidence of the actual failure time frame, just somewhat before.
(compiler/linker output and such can have the same issues of
ending up with incomplete updates.)

So, pretty much my notes are unlikely to be strongly tied to
any solid evidence: more like alternatives to possibly explore
that could be far off the mark.

It is not clear if you were using:

LDFLAGS.lld+= -Wl,--threads=1

or some such to limit the multi-thread linking and its memory.
I'll note that if -j4 gets 4 links running in parallel it used
to be each could have something like 5 threads active on a 4
core machine, so 20 or so threads. (I've not checked llvm11's
lld behavior. It might avoid such for defaults.)

You have not reported any testing of -j2 or -j3 so far, just
-j4 . (Another way of limiting memory use, power use, temperature,
etc. .)

You have not reported if your boot complained about the swap
space size or if you have adjusted related settings to make
non-default tradeoffs for swap amanagment for these specific
tests. I recommend not tailoring and using a swap size total
that is somewhat under what starts to complain when there is
no tailoring.

> The residue of the top screen shows
> 
> last pid: 63377;  load averages:  4.29,  4.18,  4.15                         up 1+07:11:07  04:46:46
> 60 processes:  5 running, 55 sleeping
> CPU: 70.7% user,  0.0% nice, 26.5% system,  2.8% interrupt,  0.0% idle
> Mem: 631M Active, 4932K Inact, 92M Laundry, 166M Wired, 98M Buf, 18M Free
> Swap: 2048M Total, 119M Used, 1928M Free, 5% Inuse, 16K In, 3180K Out
> packet_write_wait: Connection to 50.1.20.26 port 22: Broken pipe
> bob at raspberrypi:~ $ ssh www.zefox.com    RES STATE    C   TIME    WCPU COMMAND
> ssh: connect to host www.zefox.com port 22: Connection timed out86.17% c++
> bob at raspberrypi:~ $ 1  99    0   277M   231M RUN      0   3:26  75.00% c++
> 63245 bob           1  99    0   219M   173M CPU0     0   2:10  73.12% c++
> 62690 bob           1  98    0   354M   234M RUN      3   9:42  47.06% c++
> 63377 bob           1  30    0  5856K  2808K nanslp   0   0:00   3.13% gstat
> 38283 bob           1  24    0  5208K   608K wait     2   2:00   0.61% sh
>  995 bob           1  20    0  6668K  1184K CPU3     3   8:46   0.47% top
>  990 bob           1  20    0    12M  1060K select   2   0:48   0.05% sshd
> ....

This does not look like ld was in use as of the last top
display update's content. But the time between reasonable
display updates is fairly long relative to CPU activity
so it is only suggestive.

> [apologies for typing over the remnants]
> 
> I've put copies of the build and swap logs at
> 
> http://www.zefox.net/~fbsd/rpi2/buildworld/
> 
> The last vmstat entry (10 second repeat time) reports:
> procs     memory       page                      disks     faults       cpu
> r b w     avm     fre  flt  re  pi  po    fr   sr da0 sd0   in   sy   cs us sy id
> 4  0 14  969160   91960   685   2   2   1   707  304   0   0 11418   692  1273 45  5 50
> 
> Does that point to the memory exhaustion suggested earlier in the thread?
> At this point /boot/loader.conf contains vm.pfault_oom_attempts="-1", but 
> that's a relic of long-ago attempts to use USB flash for root and swap.
> Might removing it stimulate more warning messages?
> 

vm.pfault_oom_attempts="-1" should only be used in contexts where
running out of swap will not happen. Otherwise a deadlocked system
can result if it does run out of swap. (Run-out has more senses the
just the swap partition being fully used: other internal resources
for keeping track of the swap can run into its limits.) I've no
evidence that the -1 was actually a problem.

I do not find any 1000+ ms/w or ms/r figures in swapscript.log .
I found 3 examples of a little under 405 (on sdda0*), 3 between
340 and 345 (da0*), 4 in the 200s (da0*), under 60 in the 100s
(da0*). It does not look to me like the recorded part had problems
with the long latencies that you used to have happen.

So I've not found any specific evidence about what led to the
hangup. So my earlier questions/suggestions are basically
arbitrary and I would not know what to do with any answers
to the questions.

The only notes that are fairly solid are about the hangup leading
to there being some files that were likely incompletely updated
(logs, compiler output files, etc.).

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)