Re: llvm10 build failure on Rpi3

From: Mark Millard via freebsd-ports <freebsd-ports_at_freebsd.org>
Date: Thu, 24 Jun 2021 17:41:38 UTC
On 2021-Jun-24, at 09:01, bob prohaska <fbsd at www.zefox.net> wrote:

> [What about trying a new kernel? details at end]
> On Wed, Jun 23, 2021 at 11:02:02PM -0700, Mark Millard wrote:
>> On 2021-Jun-23, at 21:30, bob prohaska <fbsd T www.zefox.net> wrote:
>> 
>>> On Wed, Jun 23, 2021 at 04:22:35PM -0700, Mark Millard wrote:
>>>> On 2021-Jun-23, at 15:28, bob prohaska <fbsd at www.zefox.net> wrote:
>>>> . . .
>>> 
>>>> 
>>> [snipped for brevity]
>>>> 
>>>>>> For example, 0xA5u byte values might be the value that newly
>>>>>> allocated memory is initialized to. Looking . . . man jemalloc
>>>>>> (the memory allocator implementation used by FreeBSD) reports:
>>>>>> 
>>>>>>     opt.junk (const char *) r- [--enable-fill]
>>>>>>         Junk filling. If set to ???alloc???, each byte of uninitialized
>>>>>>         allocated memory will be initialized to 0xa5. If set to ???free???, all
>>>>>>         deallocated memory will be initialized to 0x5a. If set to ???true???,
>>>>>>         both allocated and deallocated memory will be initialized, and if
>>>>>>         set to ???false???, junk filling be disabled entirely. This is intended
>>>>>>         for debugging and will impact performance negatively. This option
>>>>>>         is ???false??? by default unless --enable-debug is specified during
>>>>>>         configuration, in which case it is ???true??? by default.
>>>>>> 
>>>>>> So, if you have junk filling enabled, I expect that you ran
>>>>>> into a legitimate defect in the llvm-tblgen in use. Having
>>>>>> Junk Filling disabled might be a workaround.
>>>>>> 
>>>>>> There is /etc/malloc.conf as a way of controlling the behavior:
>>>>>> 
>>>>>> ln -s 'junk:false' /usr/local/poudriere/poudriere-system/etc/malloc.conf
>>>>>> 
>>>>>> I suggest you retry building after getting the above in place.
>>>>>> If it does not get the 0xA5A5A5A5u value, that would be
>>>>>> more evidence of a uninitialized-memory defect in the llvm-tblgen
>>>>>> involved.
>>>>>> 
>>>>> Done and running now. In the interim I tried building llvm10 using
>>>>> make in /usr/ports, but it failed with another python conflict.
>>>> 
>>> The poudriere session just ended, with a somewhat different error:
>>> 
>>> In file included from /wrkdirs/usr/ports/devel/llvm10/work/llvm-10.0.1.src/lib/Target/AArch64/AArch64InstructionSelector
>>> .cpp:312:
>>> lib/Target/AArch64/AArch64GenGlobalISel.inc:1900:41: error: expected expression
>>>       /*GIM_CheckRegBankForClass: @0*/, /*MI*/1, /*Op*/2, /*RC*//*AArch64::FPR64RegClassID: @0*/,
>>>                                       ^
>>> lib/Target/AArch64/AArch64GenGlobalISel.inc:1900:99: error: expected expression
>>>       /*GIM_CheckRegBankForClass: @0*/, /*MI*/1, /*Op*/2, /*RC*//*AArch64::FPR64RegClassID: @0*/,
>>>                                                                                                 ^
>>> 2 errors generated.
>>> [ 25% 1396/5364]
>>> 
>>> The last line is included as a fiducial indicator.  Two errors instead of
>>> four, nothing about AMDGPU. 
>> 
>> You have a prior run that also showed only 2 errors:
>> 
>> http://www.zefox.org/~bob/poudriere/data/logs/bulk/main-default/2021-06-21_12h55m51s/logs/errors/llvm10-10.0.1_5.log
>> 
>> has:
>> 
>> lib/Target/AMDGPU/AMDGPUGenGlobalISel.inc:15822:50: error: expected expression
>>        /*GIM_CheckRegBankForClass: @2779096485*/, /*MI*/0, /*Op*/0, /*RC*//*AMDGPU::VGPR_32RegClassID: @2779096485*/,
>>                                                 ^
>> lib/Target/AMDGPU/AMDGPUGenGlobalISel.inc:15822:118: error: expected expression
>>        /*GIM_CheckRegBankForClass: @2779096485*/, /*MI*/0, /*Op*/0, /*RC*//*AMDGPU::VGPR_32RegClassID: @2779096485*/,
>>                                                                                                                     ^
>> 2 errors generated.
>> 
>> And a prior one that shows 6 errors but for AArch64 instead of AMDGPU:
>> 
>> http://www.zefox.org/~bob/poudriere/data/logs/bulk/main-default/2021-06-18_19h00m47s/logs/errors/llvm10-10.0.1_5.log
>> 
>> has:
>> 
>> lib/Target/AArch64/AArch64GenGlobalISel.inc:3760:50: error: expected expression
>>        /*GIM_CheckRegBankForClass: @2779096485*/, /*MI*/1, /*Op*/1, /*RC*//*AArch64::FPR64RegClassID: @2779096485*/,
>>                                                 ^
>> lib/Target/AArch64/AArch64GenGlobalISel.inc:3760:117: error: expected expression
>>        /*GIM_CheckRegBankForClass: @2779096485*/, /*MI*/1, /*Op*/1, /*RC*//*AArch64::FPR64RegClassID: @2779096485*/,
>>                                                                                                                    ^
>> lib/Target/AArch64/AArch64GenGlobalISel.inc:5735:50: error: expected expression
>>        /*GIM_CheckRegBankForClass: @2779096485*/, /*MI*/0, /*Op*/1, /*RC*//*AArch64::GPR64RegClassID: @2779096485*/,
>>                                                 ^
>> lib/Target/AArch64/AArch64GenGlobalISel.inc:5735:117: error: expected expression
>>        /*GIM_CheckRegBankForClass: @2779096485*/, /*MI*/0, /*Op*/1, /*RC*//*AArch64::GPR64RegClassID: @2779096485*/,
>>                                                                                                                    ^
>> lib/Target/AArch64/AArch64GenGlobalISel.inc:22981:50: error: expected expression
>>        /*GIM_CheckRegBankForClass: @2779096485*/, /*MI*/0, /*Op*/1, /*RC*//*AArch64::GPR64spRegClassID: @2779096485*/,
>>                                                 ^
>> lib/Target/AArch64/AArch64GenGlobalISel.inc:22981:119: error: expected expression
>>        /*GIM_CheckRegBankForClass: @2779096485*/, /*MI*/0, /*Op*/1, /*RC*//*AArch64::GPR64spRegClassID: @2779096485*/,
>>                                                                                                                      ^
>> 6 errors generated.
>> ninja: build stopped: subcommand failed.
>> *** Error code 1
>> 
>> It appears that the bug does not have reproducible details
>> but all of the examples that do not have junk:false show
>> @2779096485 . (And the only junk:false tried so far has @0
>> instead.)
>> 
>> Something is providing and/or using initialized memory.
>> 
>> There is the possibility that swapping out and back in is
>> sometimes not provides pages with the intended content.
>> I state that as an example that we really can not claim
>> to know that llvm-tblgen itself is doing something wrong.
>> I'm not claiming to know what is actually happening. But
>> such would fit with contexts that have more RAM that
>> end up avoiding much of the paging/swapping also not
>> seeing the problem.
>> 
>> But as in some past examples, you may have exposed a
>> problem with FreeBSD.
>> 
>>>> Intersting. I'm unable to see a:
>>>> 
>>>> /usr/local/poudriere/poudriere-system/etc/malloc.conf
>>>> 
>>>> via what you have published. But I've no clue if such
>>>> an odd symbolic link would be expected to show up.
>> 
>> Still true, but . . .
>> 
>> Well, now: http://www.zefox.org/~bob/poudriere/
>> shows a: junk:false
>> 
>> Note that this is at the same level as poudriere-system/
>> is shown. You might want to look and see if the file
>> system shows such a file at that level as well.
>> 
>> This did not show up until after the build attempt had
>> finished from what I can tell.
>> 
>>> The link seems visible to find and ls: 
>>> root@www:/usr/local/poudriere # find . -name malloc.conf
>>> ./poudriere-system/etc/malloc.conf
>>> root@www:/usr/local/poudriere # more ./poudriere-system/etc/malloc.conf
>>> ./poudriere-system/etc/malloc.conf: No such file or directory
>>> root@www:/usr/local/poudriere # ls -l ./poudriere-system/etc/malloc.conf
>>> lrwxr-xr-x  1 root  wheel  10 Jun 23 14:27 ./poudriere-system/etc/malloc.conf -> junk:false
>>> root@www:/usr/local/poudriere # 
>>> 
>>> The link seems invisible to cat and more, reporting "No such file...."
>> 
>> The link is looking for a file called junk:false in the same
>> directory. It is not expected to find such a file.
>> 
>>> I'm not sure what might be profitably tried next..... Suggestions welcome!
>> 
>> First off, if the point is to get the RPi3B+ going
>> more than it is to get evidence about the problem,
>> I'd suggest booting an RPi4B with the same media
>> (adjusting config.txt as necessary) and trying the
>> build from that boot. If it builds, the media can
>> be moved back to the RPi3B+ for other activity.
>> The failed vs. built status does give some
>> information about the problem. Built would suggest
>> that paging/swapping was involved in the problem.
>> Failed might suggest otherwise. (I do not know
>> if there would be much paging/sapping, depending on
>> how much RAM the RPi4B had.)
>> 
>> One experiment would be to use the same boot media on
>> an RPi4B but that had been told in config.txt to limit
>> itself to 1 GiByte of RAM --and to also try with all
>> the RAM being allowed. If the first fails but the
>> second works, that is probably nice evidence. If both
>> fail, that also is probably nice evidence. The other
>> two combinations are less clear what any implications
>> would be.
>> 
>> (I'm not claiming that you have such a RPi4B that can
>> be made available for the duration of such experiments.)
>> 
>> Another direction is messy: testing under stable/13 and/or
>> releng/13.0 vintages to see if it is somehow specific
>> to main [so: 14], having an analogous context to what is
>> known to fail under main (as much as reasonable). The
>> RPi4B two-RAM-sizes comparison/contrast type of test could
>> also be used.
>> 
>> There is also just repeating with junk:false a couple of
>> times to see if there is evidence of variability like
>> there is for without junk:false. Simplest of the
>> suggested tests, but likely the least informative.
>> 
>> None of this would be likely to get close to a short,
>> small test that shows the problem. I've no clue how
>> to target that at this point.
>> 
> How about booting an older kernel so see if that makes a difference?

An interesting point that I'd not thought about was that
if paging/swapping (or other I/O) was a source of the
problem, then, not only world, but also kernel code would
have to be tracking the status of /etc/malloc.conf . It
is not obvious to me that the kernel would directly track
that. But if the kernel was not replacing the content of
some pages like it should, it might be that we are just
seeing the world code's prior initialization of the
memory.

> ls -dl /boot/kernel* reports
> drwxr-xr-x  2 root  wheel  13824 Jun 18 18:15 /boot/kernel
> drwxr-xr-x  2 root  wheel  13312 Jan  9 15:57 /boot/kernel.main-c255664-g4d64c7243d26
> drwxr-xr-x  2 root  wheel  13312 Aug 29  2020 /boot/kernel.mmccam
> drwxr-xr-x  2 root  wheel  13824 Jun  9 18:52 /boot/kernel.old
> drwxr-xr-x  2 root  wheel  13312 Aug 27  2020 /boot/kernel.r364346
> drwxr-xr-x  2 root  wheel  13312 Aug 29  2020 /boot/kernel.r364895
> drwxr-xr-x  2 root  wheel  13312 Sep  7  2020 /boot/kernel.r365355
> 
> Most of these are probably too old to work at all, but Jun 9 and Jan 9
> might possibly work, I'd expect kernel.old to work as well. ISTR the
> previous success building chromium was early 2021 or before. 
> 

I'll note that:

QUOTE (from 2021-06-12 01:53:02 +0000 commit)
param.h: Bump __FreeBSD_version to 1400022
Commit e1a907a25cfa changed the internal KAPI between the krpc
and nfsserver.  As such, both modules must be rebuilt from
sources.  Bump __FreeBSD_version to 1400022.
END QUOTE

So: Even going back to June 9 may messed up nfs
use. (I've no clue what services you depend on
or in what contexts.) You might need to disable
nfs even trying to start at the next boot before
booting into such an older kernel.


Jan 9 predates 14 and 13.0-RELEASE: sys/sys/param.h got
#define __FreeBSD_version 1400000 back on Jan-22.

Running newer worlds on older kernels is not supported.
Generally folks to not track the KBI changes vs. the
consequences of not having the right KBI. This makes
interpreting results difficult even when it appears to
work. There can be mixes like NFS not working but other
things working. There could be corruptions but such
may not be likely. Do you have what you consider
sufficient backups it case things get messed up? (That
might be the status of being okay with starting over
if something really bad happens.)

If you try the combination you might want to review
the boot messages for any evidence of problems to
worry about before starting a poudriere run or
otherwise causing the system to be busy (or even,
just leaving it running but basically idle).

If the world/kernel combination happened to work well
for the specific activity, I do think the experiment
could be useful. But, if it were me, I'd not want to
run that way beyond the experiment(s), even if the
specific problem seems to go away.

If anything else odd happens with an old kernel in use,
interpreting the result usefully will be unlikely.

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)