Hang ast / pipelk / piperd

From: Paul Floyd <paulf2718_at_gmail.com>
Date: Fri, 27 May 2022 22:13:52 UTC

I'm debugging two issues with Valgrind on FreeBSD 13.1 and 14, one on 
amd64 and one on i386.

The 1st testcase, on i386, creates 10 threads that all just then call 
pause(). Then there is a fork(), the parent does a pause() and the child 
kills the parent(). The error is reproducible.

The second testcase, on amd64, runs a loop for 7 tests, each one 
creating 2 threads. The thread function writes either to a global 
variable or various types of TLS, using a nanosleep as a way to yeild 
between the threads. This hang is intermittent.

The above detail is probably not that relevant.

In both examples Valgrind is hanging with 100% CPU use.

In ktrace where things seem to go wrong there is

|9340 none-amd64-freebsd GIO fd 28503 read 1 byte "X" 9340 
none-amd64-freebsd RET read 1 9340 none-amd64-freebsd CSW stop user 
"ast" 9340 none-amd64-freebsd CSW resume kernel "pipelk" 9340 
none-amd64-freebsd CSW stop kernel "piperd" 9340 none-amd64-freebsd CSW 
resume kernel "pipelk" 9340 none-amd64-freebsd CSW stop kernel "piperd" 
... repeat until killed That read is a pipe used for the Valgrind 
scheduler lock. The scheduler runs single threaded, and the read above 
means that one thread has acquired the lock and should be able to run. 
Instead it looks like there is an ast that gets the kernel stuck in 
context switches to pipe read and pipe lock states. kill -9 is the only 
way out. This all worked OK from FreeBSD 11.3 to 13.0. It's quite 
difficult to trace this within Valgrind. Both hangs seem quite sensitive 
to timing - in both cases adding or changing nanosleep times seem to 
make them no longer hang. Adding debug statements to Valgrind can also 
change the behaviour (and is also unsafe when not holding the scheduler 
lock). Does this look like a kernel bug? A+ Paul |