svn commit: r329448 - head/sys/kern

Mateusz Guzik mjguzik at gmail.com
Sat Feb 17 17:34:36 UTC 2018


On Sat, Feb 17, 2018 at 5:38 PM, Peter Holm <peter at holm.cc> wrote:

> On Sat, Feb 17, 2018 at 06:26:32PM +0200, Konstantin Belousov wrote:
> > On Sat, Feb 17, 2018 at 05:07:07PM +0100, Mateusz Guzik wrote:
> > > On Sat, Feb 17, 2018 at 01:27:38PM +0200, Konstantin Belousov wrote:
> > > > On Sat, Feb 17, 2018 at 08:48:46AM +0000, Mateusz Guzik wrote:
> > > > > Author: mjg
> > > > > Date: Sat Feb 17 08:48:45 2018
> > > > > New Revision: 329448
> > > > > URL: https://svnweb.freebsd.org/changeset/base/329448
> > > > >
> > > > > Log:
> > > > >   exit: get rid of PROC_SLOCK when checking a process to report
> > > > Was this tested ?
> > > >
> > >
> > > I was trussing multithreaded microbenchmarks, no issues.
> > >
> > > > In particular, are you aware of r309539 ?
> > > >
> > >
> > > So it looks like I misread the code - I have grepped
> > > thread_suspend_switch operating with the proc locked and misread
> > > thread_suspend_one's assert as PROC_LOCK_ASSERT.
> > >
> > > That said, I think this is harmless. Regardless of the lock the
> > > inspecting thread can race and check "too soon". Even for a case where
> > > it decides to report, I don't see anything which would depend on the
> > > suspending thread to finish.
> > It was definitely not harmless when I tried to avoid the spin lock there,
> > but I do not remember exact failure mode.  Most likely, it was a missed
> > report of the traced child indeed, but I am not sure that truss triggered
> > it.  Most likely, Peter Holm was the reporter, since he is listed in
> > the commit.
> >
>
> I ran a truss(1) test on r329456 and it fails. I have not had a
> chance to look closer at this, but this is what I see:
>
> [root at mercat1 /home/pho]# pgrep truss | xargs ps -Hlp
> UID   PID  PPID CPU PRI NI   VSZ  RSS MWCHAN STAT TT     TIME COMMAND
>   0 41149 41118   0  52  0 11532 2588 wait   I     0  0:01.38 truss
> /tmp/ttruss 10
>   0 41151 41149   0  52  0 13156 2300 -      TX    0  0:00.98 /tmp/ttruss
> 10
>   0 41151 41149   0  52  0 13156 2300 -      TX    0  0:00.00 /tmp/ttruss
> 10
> [root at mercat1 /home/pho]# procstat -k 41151
>   PID    TID COMM                TDNAME              KSTACK
> 41151 100211 ttruss              -                   mi_switch
> thread_suspend_switch ptracestop amd64_syscall fast_syscall_common
> 41151 100765 ttruss              -                   mi_switch
> thread_suspend_check ast doreti_ast
> [root at mercat1 /home/pho]#
>
>
Ok, I reproduced the bug with your script. I reverted the change.

The patch I mailed in this thread fixes it for me. Below is a variant
which can be applied on top of fresh head:

https://people.freebsd.org/~mjg/wait6_slock.diff

Now that the bug got reported it is rather obvious:
the suspending thread does lock -> wakeup -> slock -> unlock -> sunlock ->
sleep

Only locking the proc puts as in a spot where we are past the wakeup,
but before it gets the chance to bump the counter.
On the other hand if we slock, we effectively wait for it to bump and go
to sleep after which we see what's going on.

-- 
Mateusz Guzik <mjguzik gmail.com>


More information about the svn-src-all mailing list