[PATCH] Close a race between exit1() and SIGSTOP

John Baldwin jhb at freebsd.org
Fri Sep 7 18:19:24 UTC 2012


On Friday, September 07, 2012 12:52:10 pm Konstantin Belousov wrote:
> On Fri, Sep 07, 2012 at 12:17:47PM -0400, John Baldwin wrote:
> > We ran into a bug at work recently where processes were reporting an
> > exit status from wait(2) of WIFSIGNALED() with WTERMSIG() of SIGSTOP.
> > I narrowed this down to a race in exit1(). Specifically, exit1()
> > sets p->p_xstat to the passed in exit status and sets P_WEXIT. It
> > then drops the PROC_LOCK() to do other work such as freeing file
> > descriptors, etc. Later it reacquires the PROC_LOCK(), marks the
> > process as a zombie, and terminates. During the window where the
> > PROC_LOCK() is not held, if a stop signal arrives (SIGSTOP, SIGTSTP,
> > etc.), then the signal code will overwrite p->p_xstat with the signal
> > that initiated the stop. The end result is that setting from SIGSTOP
> > stays in p->p_xstat and is reported via wait(2).
> >
> > My first attempt at a fix was to simply ignore all signals once
> > P_WEXIT is set by adding a check for P_WEXIT to the existing check for
> > PRS_ZOMBIE. However, I think this is not quite right. For example, if
> > an exiting process gets hung on an interruptible NFS mount while doing
> > fdfree() I think we want a user to be able to kill the process to let
> > it break out of NFS and finish exiting.
> >
> > The second fix I have is to explicitly clear P_STOPPED_SIGNAL to
> > discard any pending SIGSTOP when setting P_WEXIT and to explicitly
> > disallow any stop signals for a process that is currently exiting
> > (that is, P_WEXIT is set). This fixes the race I ran into while still
> > allowing other signals during process exit. The patch to do that is
> > below. Below that is a test program that reproduces the issue.
> >
> > I would really like to get some feedback on which approach is best
> > and if my concerns about allowing other signals during exit1() is a
> > legitimate concern. (For some reason I feel like I've seen an argument
> > for allowing that before.)
> >
> 
> Argument ? Please see r163541. I hope you agree with the author, who
> basically provided the same reasoning about interruptible NFS mounts.

I was using "argument" as in definition 4 here:

http://dictionary.reference.com/browse/argument

rather than definitions 1 or 2.

As I said, I thought I recalled there being a good reason for allowing this. 
:)

> So I agree with r163541, which log message says that I requested it.
> And I do not see any other possible solution except case 2. Shouldn't
> the fix actually prevent any overwrite of p_xstat for P_WEXIT process
> from the signal delivery code ?

The solution does do that AFAICT.  SIGCONT should be a nop since the 
process should never be in a stopped state once P_WEXIT is set, and
the patch explicitly turns SIGTOP into a nop.

I added assertions to make sure that no signal is delivered to a
stopped process while P_WEXIT is set as well as in the other places
that overwrite p_xstat but that I think are unreachable (e.g. in
ptracestop()).

-- 
John Baldwin


More information about the freebsd-arch mailing list