Re: llvm10 build failure on Rpi3

From: Mark Millard via freebsd-ports <freebsd-ports_at_freebsd.org>
Date: Thu, 24 Jun 2021 06:02:02 UTC
On 2021-Jun-23, at 21:30, bob prohaska <fbsd T www.zefox.net> wrote:

> On Wed, Jun 23, 2021 at 04:22:35PM -0700, Mark Millard wrote:
>> On 2021-Jun-23, at 15:28, bob prohaska <fbsd at www.zefox.net> wrote:
>> . . .
> 
>> 
> [snipped for brevity]
>> 
>>>> For example, 0xA5u byte values might be the value that newly
>>>> allocated memory is initialized to. Looking . . . man jemalloc
>>>> (the memory allocator implementation used by FreeBSD) reports:
>>>> 
>>>>      opt.junk (const char *) r- [--enable-fill]
>>>>          Junk filling. If set to ???alloc???, each byte of uninitialized
>>>>          allocated memory will be initialized to 0xa5. If set to ???free???, all
>>>>          deallocated memory will be initialized to 0x5a. If set to ???true???,
>>>>          both allocated and deallocated memory will be initialized, and if
>>>>          set to ???false???, junk filling be disabled entirely. This is intended
>>>>          for debugging and will impact performance negatively. This option
>>>>          is ???false??? by default unless --enable-debug is specified during
>>>>          configuration, in which case it is ???true??? by default.
>>>> 
>>>> So, if you have junk filling enabled, I expect that you ran
>>>> into a legitimate defect in the llvm-tblgen in use. Having
>>>> Junk Filling disabled might be a workaround.
>>>> 
>>>> There is /etc/malloc.conf as a way of controlling the behavior:
>>>> 
>>>> ln -s 'junk:false' /usr/local/poudriere/poudriere-system/etc/malloc.conf
>>>> 
>>>> I suggest you retry building after getting the above in place.
>>>> If it does not get the 0xA5A5A5A5u value, that would be
>>>> more evidence of a uninitialized-memory defect in the llvm-tblgen
>>>> involved.
>>>> 
>>> Done and running now. In the interim I tried building llvm10 using
>>> make in /usr/ports, but it failed with another python conflict.
>> 
> The poudriere session just ended, with a somewhat different error:
> 
> In file included from /wrkdirs/usr/ports/devel/llvm10/work/llvm-10.0.1.src/lib/Target/AArch64/AArch64InstructionSelector
> .cpp:312:
> lib/Target/AArch64/AArch64GenGlobalISel.inc:1900:41: error: expected expression
>        /*GIM_CheckRegBankForClass: @0*/, /*MI*/1, /*Op*/2, /*RC*//*AArch64::FPR64RegClassID: @0*/,
>                                        ^
> lib/Target/AArch64/AArch64GenGlobalISel.inc:1900:99: error: expected expression
>        /*GIM_CheckRegBankForClass: @0*/, /*MI*/1, /*Op*/2, /*RC*//*AArch64::FPR64RegClassID: @0*/,
>                                                                                                  ^
> 2 errors generated.
> [ 25% 1396/5364]
> 
> The last line is included as a fiducial indicator.  Two errors instead of
> four, nothing about AMDGPU. 

You have a prior run that also showed only 2 errors:

http://www.zefox.org/~bob/poudriere/data/logs/bulk/main-default/2021-06-21_12h55m51s/logs/errors/llvm10-10.0.1_5.log

has:

lib/Target/AMDGPU/AMDGPUGenGlobalISel.inc:15822:50: error: expected expression
        /*GIM_CheckRegBankForClass: @2779096485*/, /*MI*/0, /*Op*/0, /*RC*//*AMDGPU::VGPR_32RegClassID: @2779096485*/,
                                                 ^
lib/Target/AMDGPU/AMDGPUGenGlobalISel.inc:15822:118: error: expected expression
        /*GIM_CheckRegBankForClass: @2779096485*/, /*MI*/0, /*Op*/0, /*RC*//*AMDGPU::VGPR_32RegClassID: @2779096485*/,
                                                                                                                     ^
2 errors generated.

And a prior one that shows 6 errors but for AArch64 instead of AMDGPU:

http://www.zefox.org/~bob/poudriere/data/logs/bulk/main-default/2021-06-18_19h00m47s/logs/errors/llvm10-10.0.1_5.log

has:

lib/Target/AArch64/AArch64GenGlobalISel.inc:3760:50: error: expected expression
        /*GIM_CheckRegBankForClass: @2779096485*/, /*MI*/1, /*Op*/1, /*RC*//*AArch64::FPR64RegClassID: @2779096485*/,
                                                 ^
lib/Target/AArch64/AArch64GenGlobalISel.inc:3760:117: error: expected expression
        /*GIM_CheckRegBankForClass: @2779096485*/, /*MI*/1, /*Op*/1, /*RC*//*AArch64::FPR64RegClassID: @2779096485*/,
                                                                                                                    ^
lib/Target/AArch64/AArch64GenGlobalISel.inc:5735:50: error: expected expression
        /*GIM_CheckRegBankForClass: @2779096485*/, /*MI*/0, /*Op*/1, /*RC*//*AArch64::GPR64RegClassID: @2779096485*/,
                                                 ^
lib/Target/AArch64/AArch64GenGlobalISel.inc:5735:117: error: expected expression
        /*GIM_CheckRegBankForClass: @2779096485*/, /*MI*/0, /*Op*/1, /*RC*//*AArch64::GPR64RegClassID: @2779096485*/,
                                                                                                                    ^
lib/Target/AArch64/AArch64GenGlobalISel.inc:22981:50: error: expected expression
        /*GIM_CheckRegBankForClass: @2779096485*/, /*MI*/0, /*Op*/1, /*RC*//*AArch64::GPR64spRegClassID: @2779096485*/,
                                                 ^
lib/Target/AArch64/AArch64GenGlobalISel.inc:22981:119: error: expected expression
        /*GIM_CheckRegBankForClass: @2779096485*/, /*MI*/0, /*Op*/1, /*RC*//*AArch64::GPR64spRegClassID: @2779096485*/,
                                                                                                                      ^
6 errors generated.
ninja: build stopped: subcommand failed.
*** Error code 1

It appears that the bug does not have reproducible details
but all of the examples that do not have junk:false show
@2779096485 . (And the only junk:false tried so far has @0
instead.)

Something is providing and/or using initialized memory.

There is the possibility that swapping out and back in is
sometimes not provides pages with the intended content.
I state that as an example that we really can not claim
to know that llvm-tblgen itself is doing something wrong.
I'm not claiming to know what is actually happening. But
such would fit with contexts that have more RAM that
end up avoiding much of the paging/swapping also not
seeing the problem.

But as in some past examples, you may have exposed a
problem with FreeBSD.

>> Intersting. I'm unable to see a:
>> 
>> /usr/local/poudriere/poudriere-system/etc/malloc.conf
>> 
>> via what you have published. But I've no clue if such
>> an odd symbolic link would be expected to show up.

Still true, but . . .

Well, now: http://www.zefox.org/~bob/poudriere/
shows a: junk:false

Note that this is at the same level as poudriere-system/
is shown. You might want to look and see if the file
system shows such a file at that level as well.

This did not show up until after the build attempt had
finished from what I can tell.

> The link seems visible to find and ls: 
> root@www:/usr/local/poudriere # find . -name malloc.conf
> ./poudriere-system/etc/malloc.conf
> root@www:/usr/local/poudriere # more ./poudriere-system/etc/malloc.conf
> ./poudriere-system/etc/malloc.conf: No such file or directory
> root@www:/usr/local/poudriere # ls -l ./poudriere-system/etc/malloc.conf
> lrwxr-xr-x  1 root  wheel  10 Jun 23 14:27 ./poudriere-system/etc/malloc.conf -> junk:false
> root@www:/usr/local/poudriere # 
> 
> The link seems invisible to cat and more, reporting "No such file...."

The link is looking for a file called junk:false in the same
directory. It is not expected to find such a file.

> I'm not sure what might be profitably tried next..... Suggestions welcome!

First off, if the point is to get the RPi3B+ going
more than it is to get evidence about the problem,
I'd suggest booting an RPi4B with the same media
(adjusting config.txt as necessary) and trying the
build from that boot. If it builds, the media can
be moved back to the RPi3B+ for other activity.
The failed vs. built status does give some
information about the problem. Built would suggest
that paging/swapping was involved in the problem.
Failed might suggest otherwise. (I do not know
if there would be much paging/sapping, depending on
how much RAM the RPi4B had.)

One experiment would be to use the same boot media on
an RPi4B but that had been told in config.txt to limit
itself to 1 GiByte of RAM --and to also try with all
the RAM being allowed. If the first fails but the
second works, that is probably nice evidence. If both
fail, that also is probably nice evidence. The other
two combinations are less clear what any implications
would be.

(I'm not claiming that you have such a RPi4B that can
be made available for the duration of such experiments.)

Another direction is messy: testing under stable/13 and/or
releng/13.0 vintages to see if it is somehow specific
to main [so: 14], having an analogous context to what is
known to fail under main (as much as reasonable). The
RPi4B two-RAM-sizes comparison/contrast type of test could
also be used.

There is also just repeating with junk:false a couple of
times to see if there is evidence of variability like
there is for without junk:false. Simplest of the
suggested tests, but likely the least informative.

None of this would be likely to get close to a short,
small test that shows the problem. I've no clue how
to target that at this point.

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)