[Bug 290843] killpg deadlock against a stopped interrupted fork

From: <bugzilla-noreply_at_freebsd.org>
Date: Thu, 06 Nov 2025 19:09:03 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=290843

            Bug ID: 290843
           Summary: killpg deadlock against a stopped interrupted fork
           Product: Base System
           Version: CURRENT
          Hardware: Any
                OS: Any
            Status: New
          Severity: Affects Only Me
          Priority: ---
         Component: kern
          Assignee: bugs@FreeBSD.org
          Reporter: bdrewery@FreeBSD.org
                CC: kib@FreeBSD.org, markj@FreeBSD.org

This is on CURRENT 55c28005f544282b984ae0e15dacd0c108d8ab12 but I've seen this
for a few years. I had to disable the DEADLKRES option because I hit it so
often in Poudriere tests. Finally found a simple repro today.

Basic summary is `killpg(pgid, STOP)` against a forking child blocks further
`killpg(pgid)`.

Repro given later. Here's the simplest result:

```
# procstat -t 51155 31783
  PID    TID COMM                TDNAME              CPU  PRI STATE   WCHAN
51155 302071 sh                  -                    -1  115 sleep   killpg r
31783 128716 sh                  -                    -1  115 stop    -

# procstat -kk 51155 31783
  PID    TID COMM                TDNAME              KSTACK
51155 302071 sh                  -                   mi_switch+0x172
sleepq_switch+0x109 _sx_xlock_hard+0x513 _sx_xlock+0xac killpg1+0x138
kern_kill+0x222 amd64_syscall+0x451 fast_syscall_common+0xf8
31783 128716 sh                  -                   mi_switch+0x172
thread_suspend_check+0xbd sig_intr+0x7a fork1+0x448 sys_fork+0x54
amd64_syscall+0x451 fast_syscall_common+0xf8
```

Using `kill -CONT -31783` blocks on killpg racer, while avoiding killpg with
`kill -CONT 31783` does not block.

Repro:

```
# `kill -STOP; kill -TERM; kill-CONT` against a forking job (job control
enabled).
# foo() is trying to repro a blank $() value which is not required for the
repro but brings in enough forking to trigger the problem quickly so I left it
in.

sh -c 'trap "kill -9 %1; exit" INT; foo() { unset cmd; cmd=$(/sbin/sysctl -n
vm.loadavg|/usr/bin/awk "{print \$2,\$3,\$4}"); case "${cmd:+set}" in set) ;;
*) exit 99 ;; esac }; runner() { while foo; do :; done }; launch() { local -;
set -m; PS4="child+ " runner & }; set -x; while :; do launch; sleep 0.1; kill
-STOP %1; kill -TERM %1; kill -CONT %1; ret=0; wait; if [ $ret -eq 99 ]; then
exit 99; fi; done;'
```

It appears https://reviews.freebsd.org/D40493 and
https://reviews.freebsd.org/D41128 may have relevant discussion and attempts to
fix.

-- 
You are receiving this mail because:
You are the assignee for the bug.