deadlocks with intr NFS mounts and ^Z (or: PCATCH and sleepable locks)

Sat Jun 20 20:33:30 UTC 2009

On Sat, Jun 20, 2009 at 07:15:40PM +0300, Kostik Belousov wrote:
> On Fri, Jun 19, 2009 at 06:23:28PM +0200, Jilles Tjoelker wrote:
> > I have been having trouble with deadlocks with NFS mounts for a while,
> > and I have found at least one way it can deadlock. It seems an issue
> > with the sleep/lock system.

> > NFS sleeps while holding a lockmgr lock, waiting for a reply from the
> > server. When the mount is set intr, this is an interruptible sleep, so
> > that interrupting signals can abort the sleep. However, this also means
> > that SIGSTOP etc will suspend the thread without waking it up first, so
> > it will be suspended with a lock held.

> > If it holds the wrong locks, it is possible that the shell will not be
> > able to run, and the process cannot be continued in the normal manner.

> > Due to some other things I do not understand, it is then possible that
> > the process cannot be continued at all (SIGCONT seems ignored), but in
> > simple cases SIGCONT works, and things go back to normal.

> > In any case, this situation is undesirable, as even 'umount -f' doesn't
> > work while the thread is suspended.

> > Of course, this reasoning applies to any code that goes to sleep
> > interruptibly while holding a lock (sx or lockmgr). Is this supposed to
> > be possible (likely useful)? If so, a third type of sleep would be
> > needed that is interrupted by signals but not suspended? If not,
> > something should check that it doesn't happen and NFS intr mounts may
> > need to check for signals periodically or find a way to avoid sleeping
> > with locks held.

> > The td_locks field is only accessible for the current thread, so it
> > cannot be used to check if suspending is safe.

> > Also, making SIGSTOP and the like interrupt/restart syscalls is not
> > acceptable unless you find some way to do it such that userland won't
> > notice. For example, a read of 10 megabytes from a regular file with
> > that much available must not return less then 10 megabytes.

> Note that NFS does check for the signals during i/o, so you may get
> short reads anyway.

> I do think that the right solution both there and with SINGLE_NO_EXIT
> case for thread_single is to stop at the usermode boundary instead of
> suspending a thread in the interruptible sleep state.

> I set error code returned from interrupted msleep() to ERESTART,
> that seems to be the right thing, at least to restart the i/o that
> transferred no data upon receiving SIGSTOP.

Any such short read on a regular file is wrong. That that badness
already occurs in some cases is not an excuse to make it occur more
often. Particularly because process suspension is expected not to affect
the process and interrupting syscalls would change the behaviour of the
debugged program significantly, while the current interruptions only
occur with signals that likely terminate the process anyway (note that
intr mounts only check for SIGINT, SIGTERM, SIGHUP, SIGKILL, SIGSTOP and
SIGQUIT and appear to mask all others; I don't know why SIGTSTP gets
through -- possibly a thread/process difference).

No matter the SIGSTOP issue, a warning about the interruptions in the
mount_nfs(8) man page may be in order; the current language makes the
impression that intr is only a good thing, which is not the case. This
applies to all NFS versions. A better way to deal with nonresponsive NFS
servers that will not come back would be forced unmount (does it always
work, apart from the case mentioned above? same for the experimental
client?). SIGKILL (but not any other signal, not even SIGSTOP) could
also be allowed on processes blocked on nointr mounts.

Another point (mostly for socket operations and the like) is that the
current causes of interrupted system calls are under control of the
application: if you do not catch any signals, you will only get short
read/writes for reasons related to the underlying object; hence, it is
often not necessary to add (ugly) code to handle it: any unexpected
short read or write is a problem with the underlying object.

Another example which currently works and would be a shame to break:

% /usr/bin/time sleep 10
^Z
zsh: suspended  /usr/bin/time sleep 10
% fg
[1]  + continued  /usr/bin/time sleep 10
       10.00 real         0.00 user         0.00 sys
%

What's more, the fact that this works is thanks to the kernel. sleep(1)
just calls nanosleep(2), and because it doesn't catch any signals, that
suffices.

I do notice this is already broken for debuggers. Attaching gdb or truss
to a running sleep process immediately aborts the nanosleep with EINTR.

-- 
Jilles Tjoelker