Re: qemu-user-static aarch64 lockup/race? (was Re: Python failure in poudriere on arm64 (via qemu-user-static cross compiling))

From: Guido Falsi <mad_at_madpilot.net>
Date: Mon, 29 Jan 2024 16:02:36 UTC
On 29/01/24 16:53, Warner Losh wrote:
> 
> 
> On Mon, Jan 29, 2024, 8:48 AM Guido Falsi <mad@madpilot.net 
> <mailto:mad@madpilot.net>> wrote:
> 
>     On 29/01/24 09:26, Guido Falsi wrote:
>      > On 29/01/24 02:10, Warner Losh wrote:
>      >>
>      >>
>      >> On Sun, Jan 28, 2024 at 4:45 PM Nathan Reilly-list
>     <lists@nreilly.com <mailto:lists@nreilly.com>
>      >> <mailto:lists@nreilly.com <mailto:lists@nreilly.com>>> wrote:
>      >>
>      >>
>      >>
>      >>>     On 29 Jan 2024, at 8:43 am, Guido Falsi <mad@madpilot.net
>     <mailto:mad@madpilot.net>
>      >>>     <mailto:mad@madpilot.net <mailto:mad@madpilot.net>>> wrote:
>      >>>     On 28/01/24 22:34, Guido Falsi wrote:
>      >>>>     On 28/01/24 22:23, Warner Losh wrote:
>      >>>>>     On Sun, Jan 28, 2024, 12:38 PM Guido Falsi
>     <mad@madpilot.net <mailto:mad@madpilot.net>
>      >>>>>     <mailto:mad@madpilot.net <mailto:mad@madpilot.net>>
>     <mailto:mad@madpilot.net <mailto:mad@madpilot.net>
>      >>>>>     <mailto:mad@madpilot.net <mailto:mad@madpilot.net>>>> wrote:
>      >>>>>
>      >>>>>         On 28/01/24 15:15, Guido Falsi wrote:
>      >>>>>         [snip]
>      >>>>>          > Creating repository in /tmp/packages:   0%
>      >>>>>          >
>      >>>>>
>      >>>>>         BTW, forgot to mention last time this worked without
>     issue
>      >>>>>     was around
>      >>>>>         20th December.
>      >>>>>
>      >>>>>
>      >>>>>     I think this is a bsd-user issue. There is a race
>     somewhere in
>      >>>>>     that code that causes the hangs. I'd love a reproducible test
>      >>>>>     case that is somewhat smaller than python... there are bigger
>      >>>>>     races with the newer stuff and I've not had the time to
>     chase it
>      >>>>>     there either. 😞
>      >>>>     First of all thanks for your feedback. It encourages me having
>      >>>>     someone else with better knowledge about this confirm that
>     a race
>      >>>>     condition is actually a possible cause!
>      >>>>     Strange this has not been happening up to mid December.
>      >>>>     My main and fully reproducible use case is actually mostly
>     with
>      >>>> pkg.
>      >>>>     at the end of the run poudriere runs `pkg repo` to create the
>      >>>>     meta files and sign the repo. It forks itself (ncpus + 2 I
>     guess,
>      >>>>     even forcing it to 1 worker I see three processes), and then
>      >>>>     locks up, with all the processes stopping using CPU (ps
>     output is
>      >>>>     in my message)
>      >>>>     I guess this can be reproduced with any poudriere repo with at
>      >>>>     least more than ncpus packages in it. can also be reproduced
>      >>>>     using `poudriere pkgclean -u <etc>`
>      >>>>     If that does not work I'm not sure how to reproduce it in
>     other
>      >>>>     ways, but I can try  writing some code mocking what pkg
>     seems to
>      >>>>     be doing, not an expert at such things, though.
>      >>>
>      >>>     In case it helps further norrow doen things, It looks like the
>      >>>     lockup is happening somewhere around here:
>      >>>
>      >>>
>      >>>
>     https://github.com/freebsd/pkg/blob/56fa3f87d9d9644348b89680dfd8af47a860ee82/libpkg/pkg_repo_create.c#L778 <https://github.com/freebsd/pkg/blob/56fa3f87d9d9644348b89680dfd8af47a860ee82/libpkg/pkg_repo_create.c#L778> <https://github.com/freebsd/pkg/blob/56fa3f87d9d9644348b89680dfd8af47a860ee82/libpkg/pkg_repo_create.c#L778 <https://github.com/freebsd/pkg/blob/56fa3f87d9d9644348b89680dfd8af47a860ee82/libpkg/pkg_repo_create.c#L778>>
>      >>>
>      >>>     and/or in the pkg_create_repo_worker() function here:
>      >>>
>      >>>
>      >>>
>     https://github.com/freebsd/pkg/blob/56fa3f87d9d9644348b89680dfd8af47a860ee82/libpkg/pkg_repo_create.c#L341 <https://github.com/freebsd/pkg/blob/56fa3f87d9d9644348b89680dfd8af47a860ee82/libpkg/pkg_repo_create.c#L341> <https://github.com/freebsd/pkg/blob/56fa3f87d9d9644348b89680dfd8af47a860ee82/libpkg/pkg_repo_create.c#L341 <https://github.com/freebsd/pkg/blob/56fa3f87d9d9644348b89680dfd8af47a860ee82/libpkg/pkg_repo_create.c#L341>>
>      >>>
>      >>>
>      >>>     (I'm trying to spare you the time needed to find the actual
>     code
>      >>>     being executed, I guess you would have identified this in a few
>      >>>     minutes yourself, but I'm trying to make myself useful)
>      >>
>      >>
>      >>     There appears to be a GitHub issue for poudriere with this, but
>      >>     seems to be looking in another direction.
>      >>
>      >> https://github.com/freebsd/poudriere/issues/1009
>     <https://github.com/freebsd/poudriere/issues/1009>
>      >>     <https://github.com/freebsd/poudriere/issues/1009
>     <https://github.com/freebsd/poudriere/issues/1009>>
>      >>
>      >
>      > This one looks quite similar.
>      >
>      > In my case the ports/pkg are aligned between host and jail, in
>     fact I
>      > have built them from the exact same git checkout.
>      >
>      > I noticed pkg head has been converted to using pthreads instead
>     of fork,
>      > maybe that could help. I will make time to perform some testing.
> 
>     Thanks for pointing me here, it looks like this was "it", in that by
>     fixing this issue it uses native pkg-static, and sidesteps the issue.
> 
> 
>     Unluckily there ARE qemu races and lockups that prevent arm64
>     pkg-static
>     binary to be correctly emulated by qemu-user-static. such conditions
>     also cause sporadic failures in some ports being built.
> 
>     I filed a PR with a fix for that issue:
> 
>     https://github.com/freebsd/poudriere/pull/1115
>     <https://github.com/freebsd/poudriere/pull/1115>
> 
> 
> Ok. This dodges the problem. But it papers over things.

Definitely, but this is actually also what was happening in the past. It 
stopped using native (host) pkg-static due to the pkg port gaining a 
PORTREVISION, which caused the same version check to fail.

I agree the underlying issue should be fixed.

> 
> Any chance you could give me the state of pkg before + the package added 
> as a test case for qemu?

Not sure I understand what you are asking for, can you elaborate?

What I did was run poudriere asking it to compile a few packages, the 
lockup, when trying to use target arch pkg-static via qemu-user, is 
reproducible 100% in my experience. It does not really depend on number 
of packages. I get it by starting with an empty build.

I'm building these packages (and obviously their dependencies):

dns/unbound
net/kea
sysutils/tmux

(I guess building only tmux could suffice)


With poudriere you can get it to use target arch pkg-static by modifying 
/usr/local/share/poudriere/common.sh function ensure_pkg_installed, 
making sure the check here fails:

https://github.com/freebsd/poudriere/blob/e00503d846dc7a3b661aac84a6657f15e0f4b702/src/share/poudriere/common.sh#L5687


-- 
Guido Falsi <mad@madpilot.net>