Re: qemu-user-static aarch64 lockup/race? (was Re: Python failure in poudriere on arm64 (via qemu-user-static cross compiling))

From: Warner Losh <imp_at_bsdimp.com>
Date: Mon, 29 Jan 2024 15:53:54 UTC
On Mon, Jan 29, 2024, 8:48 AM Guido Falsi <mad@madpilot.net> wrote:

> On 29/01/24 09:26, Guido Falsi wrote:
> > On 29/01/24 02:10, Warner Losh wrote:
> >>
> >>
> >> On Sun, Jan 28, 2024 at 4:45 PM Nathan Reilly-list <lists@nreilly.com
> >> <mailto:lists@nreilly.com>> wrote:
> >>
> >>
> >>
> >>>     On 29 Jan 2024, at 8:43 am, Guido Falsi <mad@madpilot.net
> >>>     <mailto:mad@madpilot.net>> wrote:
> >>>     On 28/01/24 22:34, Guido Falsi wrote:
> >>>>     On 28/01/24 22:23, Warner Losh wrote:
> >>>>>     On Sun, Jan 28, 2024, 12:38 PM Guido Falsi <mad@madpilot.net
> >>>>>     <mailto:mad@madpilot.net> <mailto:mad@madpilot.net
> >>>>>     <mailto:mad@madpilot.net>>> wrote:
> >>>>>
> >>>>>         On 28/01/24 15:15, Guido Falsi wrote:
> >>>>>         [snip]
> >>>>>          > Creating repository in /tmp/packages:   0%
> >>>>>          >
> >>>>>
> >>>>>         BTW, forgot to mention last time this worked without issue
> >>>>>     was around
> >>>>>         20th December.
> >>>>>
> >>>>>
> >>>>>     I think this is a bsd-user issue. There is a race somewhere in
> >>>>>     that code that causes the hangs. I'd love a reproducible test
> >>>>>     case that is somewhat smaller than python... there are bigger
> >>>>>     races with the newer stuff and I've not had the time to chase it
> >>>>>     there either. 😞
> >>>>     First of all thanks for your feedback. It encourages me having
> >>>>     someone else with better knowledge about this confirm that a race
> >>>>     condition is actually a possible cause!
> >>>>     Strange this has not been happening up to mid December.
> >>>>     My main and fully reproducible use case is actually mostly with
> >>>> pkg.
> >>>>     at the end of the run poudriere runs `pkg repo` to create the
> >>>>     meta files and sign the repo. It forks itself (ncpus + 2 I guess,
> >>>>     even forcing it to 1 worker I see three processes), and then
> >>>>     locks up, with all the processes stopping using CPU (ps output is
> >>>>     in my message)
> >>>>     I guess this can be reproduced with any poudriere repo with at
> >>>>     least more than ncpus packages in it. can also be reproduced
> >>>>     using `poudriere pkgclean -u <etc>`
> >>>>     If that does not work I'm not sure how to reproduce it in other
> >>>>     ways, but I can try  writing some code mocking what pkg seems to
> >>>>     be doing, not an expert at such things, though.
> >>>
> >>>     In case it helps further norrow doen things, It looks like the
> >>>     lockup is happening somewhere around here:
> >>>
> >>>
> >>>
> https://github.com/freebsd/pkg/blob/56fa3f87d9d9644348b89680dfd8af47a860ee82/libpkg/pkg_repo_create.c#L778
> <
> https://github.com/freebsd/pkg/blob/56fa3f87d9d9644348b89680dfd8af47a860ee82/libpkg/pkg_repo_create.c#L778
> >
> >>>
> >>>     and/or in the pkg_create_repo_worker() function here:
> >>>
> >>>
> >>>
> https://github.com/freebsd/pkg/blob/56fa3f87d9d9644348b89680dfd8af47a860ee82/libpkg/pkg_repo_create.c#L341
> <
> https://github.com/freebsd/pkg/blob/56fa3f87d9d9644348b89680dfd8af47a860ee82/libpkg/pkg_repo_create.c#L341
> >
> >>>
> >>>
> >>>     (I'm trying to spare you the time needed to find the actual code
> >>>     being executed, I guess you would have identified this in a few
> >>>     minutes yourself, but I'm trying to make myself useful)
> >>
> >>
> >>     There appears to be a GitHub issue for poudriere with this, but
> >>     seems to be looking in another direction.
> >>
> >>     https://github.com/freebsd/poudriere/issues/1009
> >>     <https://github.com/freebsd/poudriere/issues/1009>
> >>
> >
> > This one looks quite similar.
> >
> > In my case the ports/pkg are aligned between host and jail, in fact I
> > have built them from the exact same git checkout.
> >
> > I noticed pkg head has been converted to using pthreads instead of fork,
> > maybe that could help. I will make time to perform some testing.
>
> Thanks for pointing me here, it looks like this was "it", in that by
> fixing this issue it uses native pkg-static, and sidesteps the issue.
>
>
> Unluckily there ARE qemu races and lockups that prevent arm64 pkg-static
> binary to be correctly emulated by qemu-user-static. such conditions
> also cause sporadic failures in some ports being built.
>
> I filed a PR with a fix for that issue:
>
> https://github.com/freebsd/poudriere/pull/1115


Ok. This dodges the problem. But it papers over things.

Any chance you could give me the state of pkg before + the package added as
a test case for qemu?

Warner


>
> --
> Guido Falsi <mad@madpilot.net>
>
>