Re: i386 on amd64 can fail to return from cond_wait_user, using basically 100% of a FreeBSD cpu
- Reply: Mark Millard : "Re: i386 on amd64 can fail to return from cond_wait_user, using basically 100% of a FreeBSD cpu [IGNORE: wrong thread]"
- In reply to: Konstantin Belousov : "Re: i386 on amd64 can fail to return from cond_wait_user, using basically 100% of a FreeBSD cpu"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sat, 05 Jul 2025 17:08:34 UTC
On Jul 5, 2025, at 01:23, Konstantin Belousov <kostikbel@gmail.com> wrote:
> On Fri, Jul 04, 2025 at 11:01:22PM -0700, Mark Millard wrote:
>> Some package builds are failing on the port-packages build cluster
>> machines that do i386 builds during the following code. The analysis
>> is from replication in a personal context (using poudriere bulk with
>> -i), using poudriere-devel instead. I'll note that the personal
>> context is from using PkgBase 14.3-RELEASE in the poudriere jail.
>> I also installed most of the realted *-dbg* PkgBase packages in order
>> to get the nicer backtracing.
>>
>> (gdb) bt
>> #0 _umtx_op_err () at /home/pkgbuild/worktrees/releng/14.3/lib/libthr/arch/i386/i386/_umtx_op_err.S:37
>> #1 0x2499f897 in _thr_umtx_timedwait_uint (mtx=0x249a365c, id=0, clockid=4, abstime=0x0, shared=0) at /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_umtx.c:233
>> #2 0x24995b26 in _thr_sleep (curthread=0x24d36004, clockid=4, abstime=0x0) at /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_kern.c:197
>> #3 0x24990beb in cond_wait_user (cvp=0x24dfa8a0, mp=0x24d38d04, abstime=<optimized out>, cancel=<optimized out>) at /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c:317
>>
>> NOTE: cond_wait_user never returns but #2..#0 repeat (observed by
>> repeated ^c and bt usage).
>>
>> (i386 is the oddball with 32-bit time_t but I do not know
>> if that is involved here.)
>>
>> #4 cond_wait_common (cond=<optimized out>, mutex=<optimized out>, abstime=0x0, cancel=1) at /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c:377
>> #5 0x24990e8f in __thr_cond_wait (cond=0x23b9b4f4, mutex=0x23b9b4ec) at /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c:392
>> #6 0x23be1e4b in uv_cond_wait () from /usr/local/lib/libuv.so.1
>> #7 0x024bd497 in node::NodePlatform::DrainTasks(v8::Isolate*) ()
>> #8 0x0232f5b6 in node::SpinEventLoopInternal(node::Environment*) ()
>> #9 0x02485bf0 in node::NodeMainInstance::Run() ()
>> #10 0x023eaba1 in node::Start(int, char**) ()
>> #11 0x24a1da85 in __libc_start1 (argc=5, argv=0xffffda3c, env=0xffffda54, cleanup=0x23b73020 <rtld_nop_exit>, mainX=0x314a720 <main>)
>> at /home/pkgbuild/worktrees/releng/14.3/lib/libc/csu/libc_start1.c:157
>> #12 0x0232d0a8 in _start ()
>>
>> www/librewolf and other firefox related package builds can do
>> this until a 7200 sec timeout by poudriere occurs:
>>
>> =>> Killing runaway build after 7200 seconds with no output
>>
>> I'll note that truss did not generate any output when used to
>> watch the process that was stuck. It appears to be a world-internal
>> problem.
Looks like my truss comment does not have the implications
that I thought.
> Can you provide a minimal stand-alone reproducer in C?
Unsure. My attempt will be mostly exploratory, not being familiar
with the subject area or /usr/local/lib/libuv.so.1's uv_cond_wait
or node's code.
To some extent the below is me looking for an idea of someething
specific to try.
Looking at where /home/pkgbuild/worktrees/releng/14.3/lib/libthr/
is shown as first listed: the call arguments to __the_cond_wait
are shown as:
(gdb) up 5
#5 0x24990e8f in __thr_cond_wait (cond=0x23b9b4f4, mutex=0x23b9b4ec) at /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c:392
warning: 392 /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c: No such file or directory
(gdb) print *cond
$3 = (pthread_cond_t) 0x24dfa8a0
(gdb) print **cond
$4 = {__has_user_waiters = 1, kcond = {c_has_waiters = 0, c_flags = 0, c_clockid = 4, c_spare = {0}}}
(gdb) print *mutex
$5 = (pthread_mutex_t) 0x24d38d04
(gdb) print **mutex
$6 = {m_lock = {m_owner = 0, m_flags = 0, m_ceilings = {0, 0}, m_rb_lnk = 0, m_pad = 0, m_spare = {0, 0}}, m_flags = 1, m_count = 0, m_spinloops = 0, m_yieldloops = 0, m_ps = 0, m_qe = {
tqe_next = 0x0, tqe_prev = 0x0}, m_pqe = {tqe_next = 0x0, tqe_prev = 0x0}, m_rb_prev = 0x0}
There is not much set to a non-zero value:
**cond.__has_user_waiters==1
**cond.kcond.c_clockid==4
**mutex.m_flags==1
But, for all I know, some of that status could involve changes
made after the call was made.
There is also the question of debugger accuracy, since the
jail contains an optimized PkgBase world build (with debugger
information available), as far as I know. Such are sometimes
messy to interpret.
A more detailed view of causing another hangup after
interrupting the existing hangup with ^C:
. . .
(gdb) finish
Run till exit from #0 0x2499f897 in _thr_umtx_timedwait_uint (mtx=0x249a365c, id=0, clockid=4, abstime=0x0, shared=0) at /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_umtx.c:233
0x24995b26 in _thr_sleep (curthread=0x24d36004, clockid=4, abstime=0x0) at /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_kern.c:197
warning: 197 /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_kern.c: No such file or directory
Value returned is $11 = 4
(gdb) finish
Run till exit from #0 0x24995b26 in _thr_sleep (curthread=0x24d36004, clockid=4, abstime=0x0) at /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_kern.c:197
0x24990beb in cond_wait_user (cvp=0x24dfa8a0, mp=0x24d38d04, abstime=<optimized out>, cancel=<optimized out>) at /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c:317
warning: 317 /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c: No such file or directory
Value returned is $12 = 4
(gdb) next
319 in /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c
(gdb)
321 in /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c
(gdb)
322 in /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c
(gdb)
325 in /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c
(gdb)
335 in /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c
(gdb)
300 in /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c
(gdb)
301 in /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c
(gdb)
302 in /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c
(gdb)
309 in /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c
(gdb)
286 in /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c
(gdb)
315 in /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c
(gdb)
316 in /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c
(gdb)
317 in /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c
(gdb)
The code around 317 is:
if (cancel)
_thr_cancel_enter2(curthread, 0);
error = _thr_sleep(curthread, cvp->kcond.c_clockid, abstime);
if (cancel)
_thr_cancel_leave(curthread, 0);
with the _thr_sleep line being 317.
The implicit next from hitting return does not get back
to a prompt, so _thr_sleep is staying on the stack.
top shows:
78445 420 root 141 0 147728Ki 131756Ki CPU30 30 588:37 99.94% /usr/local/bin/node
during this.
The same happens for _umtx_op_err with:
(gdb) br _umtx_op_err
Breakpoint 4 at 0x2498eec4: file /home/pkgbuild/worktrees/releng/14.3/lib/libthr/arch/i386/i386/_umtx_op_err.S, line 37.
(gdb) c
Continuing.
Thread 1 hit Breakpoint 4, _umtx_op_err () at /home/pkgbuild/worktrees/releng/14.3/lib/libthr/arch/i386/i386/_umtx_op_err.S:37
37 in /home/pkgbuild/worktrees/releng/14.3/lib/libthr/arch/i386/i386/_umtx_op_err.S
(gdb) next
So it looks like the cpu time reported by top is in the kernel,
despite my earlier attempt to check on that via truss.
But I'm still not noticing a suggestion of anything specific to
try.
===
Mark Millard
marklmi at yahoo.com