Re: armv7 lang/gcc12 "no bootstrap" build via system clang 15.0.7 based poudriere build ends up stuck in a small loop

From: Mark Millard <marklmi_at_yahoo.com>
Date: Tue, 07 Mar 2023 11:43:53 UTC
On Mar 7, 2023, at 03:12, Lorenzo Salvadore <developer@lorenzosalvadore.it> wrote:
> 
> ------- Original Message -------
> On Tuesday, March 7th, 2023 at 11:26 AM, Mark Millard <marklmi@yahoo.com> wrote:
> 
> 
>> 
>> 
>> Below is a small example C source showing the clang 15+ armv7
>> problem that leads to the unbounded looping in later code in
>> the lang/gcc12+ builds: a data structure is mis-initialized,
>> breaking its invariant properties used by the later code
>> structure.
>> 
>> # more partition.c
>> // Minor varation of part of some gcc source code!
>> 
>> // For system-clang 15: cc -g -O2 partition.c ; ./a.out
>> // For devel/llvm16: clang16 -g -O2 partition.c ; ./a.out
>> 
>> #include <stdio.h>
>> 
>> 
>> #define NUM_ELEMENTS 32
>> 
>> struct partition_elem
>> {
>> struct partition_elem* next;
>> int class_element;
>> unsigned class_count;
>> };
>> 
>> typedef struct partition_def
>> {
>> int num_elements;
>> struct partition_elem elements[NUM_ELEMENTS];
>> } *partition;
>> 
>> struct partition_def partition_storage;
>> 
>> partition
>> partition_new (int num_elements)
>> {
>> int e;
>> 
>> if (NUM_ELEMENTS < num_elements) num_elements = NUM_ELEMENTS;
>> 
>> partition part= &partition_storage;
>> part->num_elements = num_elements;
>> 
>> for (e = 0; e < num_elements; ++e)
>> {
>> part->elements[e].class_element = e;
>> 
>> part->elements[e].next = &(part->elements[e]);
>> 
>> part->elements[e].class_count = 1;
>> 
>> }
>> 
>> for (e = 0; e < num_elements; ++e)
>> printf("%d: %p : next?: %p\n",e,(void*)&part->elements[e],(void*)part->elements[e].next);
>> 
>> 
>> return part;
>> }
>> 
>> int main(void)
>> {
>> partition part;
>> part= partition_new(NUM_ELEMENTS);
>> 
>> return !part;
>> }
>> 
>> In the output below, note the blocks of 4 "next"
>> values that do not change. Each should match the
>> earlier hexadecimal value on the same line: point
>> back to same element of the array. 3 of 4 do not.
>> 
>> # cc -g -O2 partition.c
>> # ./a.out
>> 0: 0x40a84 : next?: 0x40a84
>> 1: 0x40a90 : next?: 0x40a84
>> 2: 0x40a9c : next?: 0x40a84
>> 3: 0x40aa8 : next?: 0x40a84
>> 4: 0x40ab4 : next?: 0x40ab4
>> 5: 0x40ac0 : next?: 0x40ab4
>> 6: 0x40acc : next?: 0x40ab4
>> 7: 0x40ad8 : next?: 0x40ab4
>> 8: 0x40ae4 : next?: 0x40ae4
>> 9: 0x40af0 : next?: 0x40ae4
>> 10: 0x40afc : next?: 0x40ae4
>> 11: 0x40b08 : next?: 0x40ae4
>> 12: 0x40b14 : next?: 0x40b14
>> 13: 0x40b20 : next?: 0x40b14
>> 14: 0x40b2c : next?: 0x40b14
>> 15: 0x40b38 : next?: 0x40b14
>> 16: 0x40b44 : next?: 0x40b44
>> 17: 0x40b50 : next?: 0x40b44
>> 18: 0x40b5c : next?: 0x40b44
>> 19: 0x40b68 : next?: 0x40b44
>> 20: 0x40b74 : next?: 0x40b74
>> 21: 0x40b80 : next?: 0x40b74
>> 22: 0x40b8c : next?: 0x40b74
>> 23: 0x40b98 : next?: 0x40b74
>> 24: 0x40ba4 : next?: 0x40ba4
>> 25: 0x40bb0 : next?: 0x40ba4
>> 26: 0x40bbc : next?: 0x40ba4
>> 27: 0x40bc8 : next?: 0x40ba4
>> 28: 0x40bd4 : next?: 0x40bd4
>> 29: 0x40be0 : next?: 0x40bd4
>> 30: 0x40bec : next?: 0x40bd4
>> 31: 0x40bf8 : next?: 0x40bd4
>> 
>> Turns out that the -O2 is important: no other that I
>> tried got the problem, including -O3 not getting the
>> problem. lang/gcc12+ builds happen to use -O2 , at
>> least in my environment.
>> 
>> -g is not required for the problem.
> 
> This last point about optimization is interesting.
> It is just a guess, but maybe when you enable bootstrap
> in lang/gcc12 you build the first compiler without
> optimization, while if you disable it you do use -O2.

The bootstrap sequence does not build a full,
general-purpose C compiler via clang (or whatever),
just something simpler that is enough to build
the next stage. So more than the just the
optimization level likely contributes to why
bootstrap builds still work.

> I have taken your example C code and tested it in
> FreeBSD amd64 and in a virtual machine running Linux
> (OpenSuse) amd64: I have got the same failure
> in both cases. I used clang15. So the bug does not
> depend on the OS nor on the architecture.

Thanks for the Linux tests. While I'm not well set up
for building gcc (much less in unusual ways), I do
have enough context/knowledge to test my simple test
on aarch64 Fedora. You saved me the effort. Although,
may be I should check independently, given the below.

But on FreeBSD but not for armv7:

aarch64 FreeBSD system-clang 15 worked fine:

cc      -g -O2 partition.c ; ./a.out
0: 0x230d00 : next?: 0x230d00
1: 0x230d10 : next?: 0x230d10
2: 0x230d20 : next?: 0x230d20
3: 0x230d30 : next?: 0x230d30
4: 0x230d40 : next?: 0x230d40
5: 0x230d50 : next?: 0x230d50
6: 0x230d60 : next?: 0x230d60
7: 0x230d70 : next?: 0x230d70
8: 0x230d80 : next?: 0x230d80
9: 0x230d90 : next?: 0x230d90
10: 0x230da0 : next?: 0x230da0
11: 0x230db0 : next?: 0x230db0
12: 0x230dc0 : next?: 0x230dc0
13: 0x230dd0 : next?: 0x230dd0
14: 0x230de0 : next?: 0x230de0
15: 0x230df0 : next?: 0x230df0
16: 0x230e00 : next?: 0x230e00
17: 0x230e10 : next?: 0x230e10
18: 0x230e20 : next?: 0x230e20
19: 0x230e30 : next?: 0x230e30
20: 0x230e40 : next?: 0x230e40
21: 0x230e50 : next?: 0x230e50
22: 0x230e60 : next?: 0x230e60
23: 0x230e70 : next?: 0x230e70
24: 0x230e80 : next?: 0x230e80
25: 0x230e90 : next?: 0x230e90
26: 0x230ea0 : next?: 0x230ea0
27: 0x230eb0 : next?: 0x230eb0
28: 0x230ec0 : next?: 0x230ec0
29: 0x230ed0 : next?: 0x230ed0
30: 0x230ee0 : next?: 0x230ee0
31: 0x230ef0 : next?: 0x230ef0

amd64 FreeBSD system-clang 15 worked fine:

# cc      -g -O2 partition.c ; ./a.out
0: 0x203ca0 : next?: 0x203ca0
1: 0x203cb0 : next?: 0x203cb0
2: 0x203cc0 : next?: 0x203cc0
3: 0x203cd0 : next?: 0x203cd0
4: 0x203ce0 : next?: 0x203ce0
5: 0x203cf0 : next?: 0x203cf0
6: 0x203d00 : next?: 0x203d00
7: 0x203d10 : next?: 0x203d10
8: 0x203d20 : next?: 0x203d20
9: 0x203d30 : next?: 0x203d30
10: 0x203d40 : next?: 0x203d40
11: 0x203d50 : next?: 0x203d50
12: 0x203d60 : next?: 0x203d60
13: 0x203d70 : next?: 0x203d70
14: 0x203d80 : next?: 0x203d80
15: 0x203d90 : next?: 0x203d90
16: 0x203da0 : next?: 0x203da0
17: 0x203db0 : next?: 0x203db0
18: 0x203dc0 : next?: 0x203dc0
19: 0x203dd0 : next?: 0x203dd0
20: 0x203de0 : next?: 0x203de0
21: 0x203df0 : next?: 0x203df0
22: 0x203e00 : next?: 0x203e00
23: 0x203e10 : next?: 0x203e10
24: 0x203e20 : next?: 0x203e20
25: 0x203e30 : next?: 0x203e30
26: 0x203e40 : next?: 0x203e40
27: 0x203e50 : next?: 0x203e50
28: 0x203e60 : next?: 0x203e60
29: 0x203e70 : next?: 0x203e70
30: 0x203e80 : next?: 0x203e80
31: 0x203e90 : next?: 0x203e90

(The systems were all built from copies of
the same FreeBSD source code.)

> However, my results have a difference from yours:
> in my case tests fail with any level of optimization.

I get the same sort of aarach64 and amd64 results for
the other optimization levels that I tried: no problems.

> At this point, I would say that the issue is in clang.

Yep, but I've no evidence of problems but for targeting
armv7 via -O2 use --only tested on FreeBSD.

You may have more general information than I do at this
point.

> I think you should file a bug upstream.

I'll leave it up to Brooks if he wants to do the
initial upstream activity. He should be well
recognized and would likely be the one dealing
with the later activity tied to getting a fix in
place for FreeBSD.




===
Mark Millard
marklmi at yahoo.com