Re: pkg server for current/arm64 stopped ? [main-armv7 on ampere2, . . .] [Update to Host OSVERSION 1500018 did not help]

From: Philip Paeps <philip_at_freebsd.org>
Date: Thu, 09 May 2024 00:28:55 UTC
On 2024-05-08 23:53:57 (+0800), Mark Millard wrote:

> On Apr 29, 2024, at 20:16, Mark Millard <marklmi@yahoo.com> wrote:
>
>> On Apr 29, 2024, at 20:11, Mark Millard <marklmi@yahoo.com> wrote:
>>
>>> On Apr 29, 2024, at 19:54, Mark Millard <marklmi@yahoo.com> wrote:
>>>
>>>> On Apr 28, 2024, at 18:06, Philip Paeps <philip@freebsd.org> wrote:
>>>>
>>>>> On 2024-04-18 23:14:22 (+0800), Mark Millard wrote:
>>>>>> On Apr 18, 2024, at 08:02, Mark Millard <marklmi@yahoo.com> 
>>>>>> wrote:
>>>>>>> void <void_at_f-m.fm> wrote on
>>>>>>> Date: Thu, 18 Apr 2024 14:08:36 UTC :
>>>>>>>
>>>>>>>> Not sure where to post this..
>>>>>>>>
>>>>>>>> The last bulk build for arm64 appears to have happened around
>>>>>>>> mid-March on ampere2. Is it broken?
>>>>>>>
>>>>>>> main-armv7 building is broken and the last completed build
>>>>>>> was the one started on Mon, 19 Feb 2024 12:32:10 GMT. It
>>>>>>> gets stuck making no progress until manually forced to stop,
>>>>>>> which leads to huge elapsed times for the incomplete builds:
>>>>>>>
>>>>>>> [...]
>>>>>>>
>>>>>>> My guess is that FreeBSD has something that broken after 
>>>>>>> bd45bbe440
>>>>>>> that was broken as of f5f08e41aa and was still broken at 
>>>>>>> 75464941dc .
>>>>>>>
>>>>>>
>>>>>> One thing of possible note:
>>>>>>
>>>>>> Failing . . .
>>>>>>
>>>>>> Host OSVERSION: 1500006
>>>>>> Jail OSVERSION: 1500014
>>>>>
>>>>> I have finished a package builder refresh this morning.  All our 
>>>>> builder hosts (except PowerPC - I don't touch those) are now on 
>>>>> main-n269671-feabaf8d5389 (OSVERSION 1500018).
>>>>>
>>>>> ampere1 successfully finished its 140releng-armv7-quarterly build, 
>>>>> so it looks like the problem with stuck builds was limited to 
>>>>> ampere2 building main-armv7.  I'll keep a close eye on this one 
>>>>> when it starts its next build.
>>>>>
>>>>
>>>> I see that main-armv7 started.
>>>>
>>>> It queued only 31935 instead of the prior 34528 (or more): it is 
>>>> doing an
>>>> incremental build instead of a full build. For example, pkg was not 
>>>> built
>>>> but instead the prior build is in use. Thus bad results from the 
>>>> prior
>>>> build might be involved in this new build.
>>>>
>>>> I'd recommend forcing a full "poudriere bulk -c -a" that does a 
>>>> from-scratch
>>>> build for the purposes of the main-armv7 test.
>>>
>>> Actually the test is not going to previde the information we are
>>> after as things are.
>>>
>>> giflib-5.2.2 failed to build, which leads to devel/doxygen being
>>> skipped. devel/doxygen was the first one to hang up in the prior
>>> 2 failing attempts, if I remember right.
>>>
>>> giflib-5.2.2 also causes graphics/graphviz to be skipped.
>>> graphics/graphviz was installed just before the hangup in all of
>>> the example hanups. So the context will not be replicated.
>>>
>>> We need graphics/giflib to build to actually do the test.
>>
>> Looks like:
>>
>> https://cgit.freebsd.org/ports/commit/graphics/giflib?id=5007109903fc271e3ef0ba01d78781c1fed99f3f
>>
>> is the fix for the graphic/giflib build failure.
>
> Well, main-armv7 is building again and things are still
> getting stuck. So much for my idea. For reference I
> list the over 10-hr-so-far ones:
>
> doxygen-1.9.6_1,2   build-depends 13:03:54
> py39-pydot-2.0.0    run-depends   12:24:04
> py39-pygraphviz-1.6 lib-depends   12:10:38
>
> "ps -alxdww" would likely be appropriate to get a copy
> of the otuput of.
>
> "procstat -k -k" usage and the like on stuck processes
> would probably be appropriate.
>
> Does anyone with appropriate investigative background
> have login access to ampere2 to take a look at what
> is getting stuck?

This is unfortunate.  I'm sure I have the appropriate background, but 
I'm spread very thin!  I'll get as much information as I can about this 
machine while it's stuck, before I bounce it again.

I think it may be worth a try building those ports in isolation on 
ref14-aarch64, and see what they're trying to do.  I'll also set up a 
set of refX-armv7 jails on that machine.

Hopefully we can get to the bottom of this soon.  This is a very tedious 
failure mode.

We could also try to put an older armv7 image on the builder jail on 
ampere2.  Depending on whether we have a sufficiently old image, that 
will either be very straightforward, or a very deep rabbit hole.

Thanks again for keeping an eye on this.  We really should have better 
monitoring for stuck builds than "Mark will tell us". :-)

Philip