Re: Hang ast / pipelk / piperd
- Reply: Jan Mikkelsen : "Re: Hang ast / pipelk / piperd"
- Reply: Mark Johnston : "Re: Hang ast / pipelk / piperd"
- In reply to: Mark Johnston : "Re: Hang ast / pipelk / piperd"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Wed, 01 Jun 2022 21:27:52 UTC
On 6/1/22 14:16, Mark Johnston wrote:
> On Mon, May 30, 2022 at 09:35:05PM +0200, Paul Floyd wrote:
>>
>>
>> On 5/30/22 14:15, Mark Johnston wrote:
>>
>>> "procstat -kk <valgrind PID>" might help to reveal what's going on,
>>> since it sounds like the hand/livelock is happening somewhere in the
>>> kernel.
>>
>> Not knowing much about the kernel, my guess is that this is related to
>>
>> commit 4808bab7fa6c3ec49b49476b8326d7a0250a03fa
>> Author: Alexander Motin <mav@FreeBSD.org>
>> Date: Tue Sep 21 18:14:22 2021 -0400
>>
>> sched_ule(4): Improve long-term load balancer.
>>
>> and this bit of ast code
>>
>> doreti_ast:
>> /*
>> * Check for ASTs atomically with returning. Disabling CPU
>> * interrupts provides sufficient locking even in the SMP case,
>> * since we will be informed of any new ASTs by an IPI.
>> */
>> cli
>> movq PCPU(CURTHREAD),%rax
>> testl $TDF_ASTPENDING | TDF_NEEDRESCHED,TD_FLAGS(%rax)
>> je doreti_exit
>> sti
>> movq %rsp,%rdi /* pass a pointer to the trapframe */
>> call ast
>> jmp doreti_ast
>>
>>
>> The above commit seems to be migrating loaded threads to another CPU.
>
> How did you infer that? The long-term load balancer should be running
> fairly infrequently.
>
Well one thread is hung in
mi_switch+0xc2 ast+0x1e6 doreti_ast+0x1f
I found the above code referring to doreti_ast with grep and thought
that if the test always fails then it will loop forever. So I looked for
git commits containing TDF_ASTPENDING and TDF_NEEDRESCHED and found a
couple of commits mentioning these flags, scheduling and found the above
and the followup
commit 11f14b33629e552a451fdbfe653ebb0addd27700
Author: Alexander Motin <mav@FreeBSD.org>
Date: Sun Sep 26 12:03:05 2021 -0400
sched_ule(4): Fix hang with steal_thresh < 2.
e745d729be60 caused infinite loop with interrupts disabled in load
stealing code if steal_thresh set below 2. Such configuration should
not generally be used, but appeared some people are using it to
workaround some problems.
and I guessed they might be related.
> As a side note, I think we are missing ktrcsw() calls in some places,
> e.g., in turnstile_wait().
>
>> My test system is a VirtualBox amd64 FreeBSD 13.1 with one CPU running
>> on a 13.0 host.
>>
>> I just tried restarting the VM with 2 CPUs and the testcase seems to be
>> a lot better - it's been running in a loop for 10 minutes whereas
>> previously it would hang at least 1 in 5 times.
>
> Hmm. Could you, please, show the ktrace output with -H -T passed to
> kdump(1), together with fresh "procstat -kk" output?
>
> The fact that the problem apparently only occurs with 1 CPU suggests a
> scheduler bug, indeed.
Here is the fresh procstat output
paulf@freebsd:~ $ procstat -kk 12339
PID TID COMM TDNAME KSTACK
12339 100089 none-amd64-freebsd - mi_switch+0xc2
sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1f2
umtxq_sleep+0x143 do_wait+0x3e5 __umtx_op_wait+0x53 sys__umtx_op+0x7e
amd64_syscall+0x10c fast_syscall_common+0xf8
12339 100582 none-amd64-freebsd - mi_switch+0xc2
sleepq_catch_signals+0x2e6 sleepq_timedwait_sig+0x12 _sleep+0x1d1
kern_clock_nanosleep+0x1c1 sys_nanosleep+0x3b amd64_syscall+0x10c
fast_syscall_common+0xf8
12339 100583 none-amd64-freebsd - mi_switch+0xc2
ast+0x1e6 doreti_ast+0x1f
12339 100584 none-amd64-freebsd - mi_switch+0xc2
sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1f2
pipe_read+0x3d6 dofileread+0x81 sys_read+0xbc amd64_syscall+0x10c
fast_syscall_common+0xf8
12339 100585 none-amd64-freebsd - mi_switch+0xc2
intr_event_handle+0x167 intr_execute_handlers+0x4b Xapic_isr1+0xdc
sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1f2
pipe_read+0x3d6 dofileread+0x81 sys_read+0xbc amd64_syscall+0x10c
fast_syscall_common+0xf8
12339 100586 none-amd64-freebsd - mi_switch+0xc2
sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1f2
pipe_read+0x3d6 dofileread+0x81 sys_read+0xbc amd64_syscall+0x10c
fast_syscall_common+0xf8
12339 100587 none-amd64-freebsd - mi_switch+0xc2
sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1f2
pipe_read+0x3d6 dofileread+0x81 sys_read+0xbc amd64_syscall+0x10c
fast_syscall_common+0xf8
12339 100588 none-amd64-freebsd - mi_switch+0xc2
sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1f2
pipe_read+0x3d6 dofileread+0x81 sys_read+0xbc amd64_syscall+0x10c
fast_syscall_common+0xf8
12339 100589 none-amd64-freebsd - mi_switch+0xc2
sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1f2
pipe_read+0x3d6 dofileread+0x81 sys_read+0xbc amd64_syscall+0x10c
fast_syscall_common+0xf8
12339 100590 none-amd64-freebsd - mi_switch+0xc2
sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1f2
pipe_read+0x3d6 dofileread+0x81 sys_read+0xbc amd64_syscall+0x10c
fast_syscall_common+0xf8
12339 100591 none-amd64-freebsd - mi_switch+0xc2
sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1f2
pipe_read+0x3d6 dofileread+0x81 sys_read+0xbc amd64_syscall+0x10c
fast_syscall_common+0xf8
12339 100592 none-amd64-freebsd - mi_switch+0xc2
sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1f2
pipe_read+0xb3 dofileread+0x81 sys_read+0xbc amd64_syscall+0x10c
fast_syscall_common+0xf8
12339 100593 none-amd64-freebsd - mi_switch+0xc2
sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1f2
pipe_read+0xb3 dofileread+0x81 sys_read+0xbc amd64_syscall+0x10c
fast_syscall_common+0xf8
12339 100594 none-amd64-freebsd - mi_switch+0xc2
sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1f2
pipe_read+0xb3 dofileread+0x81 sys_read+0xbc amd64_syscall+0x10c
fast_syscall_common+0xf8
12339 100595 none-amd64-freebsd - mi_switch+0xc2
sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1f2
pipe_read+0xb3 dofileread+0x81 sys_read+0xbc amd64_syscall+0x10c
fast_syscall_common+0xf8
And the ktrace
12339 100591 none-amd64-freebsd 1654117859.896834 RET nanosleep 0
12339 100591 none-amd64-freebsd 1654117859.896839 CALL
sigprocmask(SIG_SETMASK,0x404563db0,0)
12339 100591 none-amd64-freebsd 1654117859.896840 RET sigprocmask 0
12339 100591 none-amd64-freebsd 1654117859.896842 CALL
thr_self(0x404563ca8)
12339 100591 none-amd64-freebsd 1654117859.896844 RET thr_self 0
12339 100591 none-amd64-freebsd 1654117859.896845 CALL
read(0x6f57,0x404563cfe,0x1)
12339 100591 none-amd64-freebsd 1654117859.896847 CSW stop kernel
"pipelk"
12339 100583 none-amd64-freebsd 1654117859.896849 GIO fd 28503 read
1 byte
"T"
12339 100583 none-amd64-freebsd 1654117859.896852 RET read 1
12339 100583 none-amd64-freebsd 1654117859.896853 CSW stop user "ast"
12339 100590 none-amd64-freebsd 1654117859.896854 CSW resume kernel
"pipelk"
12339 100590 none-amd64-freebsd 1654117859.896854 CSW stop kernel
"piperd"
12339 100589 none-amd64-freebsd 1654117859.896855 CSW resume kernel
"pipelk"
12339 100589 none-amd64-freebsd 1654117859.896855 CSW stop kernel
"piperd"
12339 100588 none-amd64-freebsd 1654117859.896856 CSW resume kernel
"pipelk"
12339 100588 none-amd64-freebsd 1654117859.896856 CSW stop kernel
"piperd"
[REPEATS, TRIMMMED]
12339 100590 none-amd64-freebsd 1654117859.896896 CSW resume kernel
"piperd"
12339 100590 none-amd64-freebsd 1654117859.896897 CSW stop kernel
"piperd"
12339 100089 none-amd64-freebsd Events dropped.
12339 100089 none-amd64-freebsd 1654118042.172196 CALL
thr_self(0x40268d7f8)
12339 100089 none-amd64-freebsd 1654118042.172198 RET thr_self 0
12339 100089 none-amd64-freebsd 1654118042.172202 CALL
thr_self(0x40268d788)
12339 100089 none-amd64-freebsd 1654118042.172204 RET thr_self 0
12339 100089 none-amd64-freebsd 1654118042.172206 CALL
read(0x6f57,0x40268d7de,0x1)
12339 100584 none-amd64-freebsd 1654118042.172217 CSW stop kernel
"piperd"
12339 100089 none-amd64-freebsd 1654118042.172221 CSW stop kernel
"piperd"
12339 100584 none-amd64-freebsd 1654118042.172222 CSW resume kernel
"piperd"
12339 100584 none-amd64-freebsd 1654118042.172222 CSW stop kernel
"piperd"
[REPEATS, TRIMMMED]
12339 100089 none-amd64-freebsd 1654118042.172259 CSW stop kernel
"piperd"
12339 100584 none-amd64-freebsd 1654118042.172260 CSW resume kernel
"piperd"
12339 100089 none-amd64-freebsd Events dropped.
12337 100093 sh 1654117857.635559 CSW stop kernel "wait"
12337 100093 sh 1654118057.965243 CSW resume kernel "wait"
12337 100093 sh 1654118057.965258 RET wait4 12339/0x3033
12337 100093 sh 1654118057.965356 CALL
fstatat(AT_FDCWD,0x7fffffffe0c0,0x7fffffffe4c0,0)
12337 100093 sh 1654118057.965359 NAMI
"/usr/share/nls/C.UTF-8/libc.cat"
12337 100093 sh 1654118057.965363 RET fstatat -1 errno 2 No
such file or directory
12337 100093 sh 1654118057.965367 CALL
fstatat(AT_FDCWD,0x7fffffffe0c0,0x7fffffffe4c0,0)
12337 100093 sh 1654118057.965369 NAMI
"/usr/share/nls/libc/C.UTF-8"
12337 100093 sh 1654118057.965371 RET fstatat -1 errno 2 No
such file or directory
12337 100093 sh 1654118057.965373 CALL
fstatat(AT_FDCWD,0x7fffffffe0c0,0x7fffffffe4c0,0)
12337 100093 sh 1654118057.965375 NAMI
"/usr/local/share/nls/C.UTF-8/libc.cat"
12337 100093 sh 1654118057.965377 RET fstatat -1 errno 2 No
such file or directory
12337 100093 sh 1654118057.965380 CALL
fstatat(AT_FDCWD,0x7fffffffe0c0,0x7fffffffe4c0,0)
12337 100093 sh 1654118057.965381 NAMI
"/usr/local/share/nls/libc/C.UTF-8"
12337 100093 sh 1654118057.965383 RET fstatat -1 errno 2 No
such file or directory
12337 100093 sh 1654118057.965410 CALL write(0x2,0x80184c000,0x7)
12337 100093 sh 1654118057.965417 GIO fd 2 wrote 7 bytes
"Killed
"
A+
Paul