A head buildworld race visible in the ci.freebsd.org build history
Li-Wen Hsu
lwhsu at freebsd.org
Tue Aug 7 02:29:52 UTC 2018
On Thu, Jun 21, 2018 at 10:49 PM Mark Millard <marklmi at yahoo.com> wrote:
> Has the range r328278 < PROBLEM_START <= r330304 been narrowed down
> some more?
>
> (I'm just curious were the problem started.)
After several rounds of binary search, I found it might have something
todo with r329625.
The only thing I think this commit related to the situation we met is
it touched the code for doing unmount. But I cannot confirm if it is
the cause.
It is a bit tricky to reproduce. I will try to keep it concise.
We do builds for head in a jail (11.2-RELEASE) on a -CURRENT host.
The jail is on a
dedicated zfs. And there is a daemon doing jail/zfs cleanup running
outside of the jail.
In some edge cases, that cleanup daemon wants to destroy the zfs of
the jail in which a build is still running. If that happens, with an
earlier -CURRENT, it should just get "cannot unmount
'/jenkins/jails/test-ranlib': Device busy" and nothing serious will
happen. Recently, although it still didn't destroy the
busy zfs, it started causing build error out with "ranlib: fatal:
Failed to open 'libXXX.a'"
To reproduce this, create a zfs and use that as the root of a jail,
run this build script under /usr/src inside the jail:
https://gist.github.com/lwhsu/ae3b8b1f0c856837f93984ab2493f629#file-build-sh
Run this cleanup script on the host:
https://gist.github.com/lwhsu/ae3b8b1f0c856837f93984ab2493f629#file-clean-test-ranlib-sh
(need to modify the zfs path)
I use powerpcspe as TARGET_ARCH here because it takes a shorter time
in one iteration. There should be nothing related to the
architectures.
I am not very sure about what is the next step, maybe modifying ranlib
and log more what it gets "fatal: Failed to open 'libxxx.a'" Any good
idea about debugging this?
Li-wen
--
Li-Wen Hsu <lwhsu at FreeBSD.org>
https://lwhsu.org
More information about the freebsd-current
mailing list