Deadlock between nfsd and snapshots.

Tue Aug 22 09:15:45 UTC 2006

On Mon, 21 Aug 2006, Tor Egge wrote:

> I wrote:
>
>> The deadlock indicates that one or more of IN_CHANGE, IN_MODIFIED or
>> IN_UPDATE was set on the inode, indicating a write operation
>> (e.g. VOP_WRITE(), VOP_RENAME(), VOP_CREATE(), VOP_REMOVE(), VOP_LINK(),
>> VOP_SYMLINK(), VOP_SETATTR(), VOP_MKDIR(), VOP_RMDIR(), VOP_MKNOD()) that was
>> not protected by vn_start_write() or vn_start_secondary_write().
>
> The most common "write" operation was probably VOP_GETATTR().

Reading the attributes really is a write operation if it causes marks for
update to be turned into updates.  ufs_inactive() is also a write operation
if it causes this.  I think marking for update shouldn't require much locking.

> ufs_itimes(), called from ufs_getattr(), might set the IN_MODIFIED inode flag
> if IN_ACCESS is set on the inode even if neither IN_CHANGE nor IN_UPDATE is
> set, transitioning the inode flags into a state where ufs_inactive() calls the
> blocking variant of vn_start_secondary_write().

In the implementation of marking access times for update on exec, we try hard
to avoid calling vn_start_write(), etc.  Early implementations did call
vn_start_write(), etc., and had some bugs from this.  The current
implementation is mainly for ffs and is rather fragile: VOP_SETATTR()
depends on callers calling vfs_start_write, but vfs_mark_atime() calls
VOP_SETATTR() without calling vfs_start_write().  The correctness of this
depends on the VA_MARK_ATIME case of VOP_SETATTR() not writing any more
than VOP_READ() would or should.

I think ufs_itimes() shouldn't call vn_start_write() any more than
ufs_getattr() should.  Callers should be aware that GETATTR may write
and thus it seems to be necessary for them to call vn_start_write()
unconditionally.  ufs_itimes() is a utility function that is called
from places other than ufs_getattr().  The other places are ufs_*close()
and ufs_setattr().  These don't cause any additional problems.
VOP_CLOSE() is called from several places which already seem to have
sufficient locking.  VOP_GETATTR() is called from many places that don't
call vn_start_write(), starting with vn_stat().

The updates to timestamps in in ufs_itimes() and ufs_getattr() are still
soft (not even delayed writes, but writes to the vnode that will usually
become delayed writes later).  Perhaps vn_start_write() can be avoided
for them, but since they are logically writes this might be hard to
implement correctly.  ufs_inactive() has to do a (possibly delayed) physical
write to force any updates to disk, so it needs strong locking.

BTW, ufs_itimes() has some possibly related kludges involving changing
from r/w mounts to r/o mounts.  Some of the marks for update aren't
handled quite right and are still present after the change.  Then they
want to be turned into updates on a r/o file system.  This is impossible.
The problem is handled bogusly, essentially by clearing them without doing
the update, in a way that triggers seome of my local debugging code.

> calling ufs_itimes() with only a shared vnode lock might cause unsafe accesses
> to the inode flags.  Setting of IN_ACCESS at the end of ffs_read() and
> ffs_extread() might also be unsafe.  If DIRECTIO is enabled then O_DIRECT reads
> might not even attempt to set the IN_ACCESS flag.

I think setting it should be safe -- see above.

Not setting the IN_ACCESS flag in the early return for the O_DIRECT case
is a different bug.  read() is supposed to set IN_ACCESS on sucessful
completion and returning early for the O_DIRECT case would defeat this.

Bruce