Deadlock between nfsd and snapshots.

Tue Aug 22 13:08:06 UTC 2006

On Tue, Aug 22, 2006 at 07:13:15PM +1000, Bruce Evans wrote:
> On Mon, 21 Aug 2006, Tor Egge wrote:
> 
> >I wrote:
> >
> >>The deadlock indicates that one or more of IN_CHANGE, IN_MODIFIED or
> >>IN_UPDATE was set on the inode, indicating a write operation
> >>(e.g. VOP_WRITE(), VOP_RENAME(), VOP_CREATE(), VOP_REMOVE(), VOP_LINK(),
> >>VOP_SYMLINK(), VOP_SETATTR(), VOP_MKDIR(), VOP_RMDIR(), VOP_MKNOD()) that 
> >>was
> >>not protected by vn_start_write() or vn_start_secondary_write().
> >
> >The most common "write" operation was probably VOP_GETATTR().
> 
> Reading the attributes really is a write operation if it causes marks for
> update to be turned into updates.  ufs_inactive() is also a write operation
> if it causes this.  I think marking for update shouldn't require much 
> locking.
> 
> >ufs_itimes(), called from ufs_getattr(), might set the IN_MODIFIED inode 
> >flag
> >if IN_ACCESS is set on the inode even if neither IN_CHANGE nor IN_UPDATE is
> >set, transitioning the inode flags into a state where ufs_inactive() calls 
> >the
> >blocking variant of vn_start_secondary_write().
> 
> In the implementation of marking access times for update on exec, we try 
> hard
> to avoid calling vn_start_write(), etc.  Early implementations did call
> vn_start_write(), etc., and had some bugs from this.  The current
> implementation is mainly for ffs and is rather fragile: VOP_SETATTR()
> depends on callers calling vfs_start_write, but vfs_mark_atime() calls
> VOP_SETATTR() without calling vfs_start_write().  The correctness of this
> depends on the VA_MARK_ATIME case of VOP_SETATTR() not writing any more
> than VOP_READ() would or should.
> 
> I think ufs_itimes() shouldn't call vn_start_write() any more than
> ufs_getattr() should.  Callers should be aware that GETATTR may write
> and thus it seems to be necessary for them to call vn_start_write()
> unconditionally.  ufs_itimes() is a utility function that is called
> from places other than ufs_getattr().  The other places are ufs_*close()
> and ufs_setattr().  These don't cause any additional problems.
> VOP_CLOSE() is called from several places which already seem to have
> sufficient locking.  VOP_GETATTR() is called from many places that don't
> call vn_start_write(), starting with vn_stat().
> 
> The updates to timestamps in in ufs_itimes() and ufs_getattr() are still
> soft (not even delayed writes, but writes to the vnode that will usually
> become delayed writes later).  Perhaps vn_start_write() can be avoided
> for them, but since they are logically writes this might be hard to
> implement correctly.  ufs_inactive() has to do a (possibly delayed) physical
> write to force any updates to disk, so it needs strong locking.
> 
> BTW, ufs_itimes() has some possibly related kludges involving changing
> from r/w mounts to r/o mounts.  Some of the marks for update aren't
> handled quite right and are still present after the change.  Then they
> want to be turned into updates on a r/o file system.  This is impossible.
> The problem is handled bogusly, essentially by clearing them without doing
> the update, in a way that triggers seome of my local debugging code.

I have a proposal.

1. Remove IN_ACCESS, IN_UPDATE, IN_CHANGE from i_flag. For each flag,
introduce two new i_ fields, e.g., i_access of type timespec, and
i_accessed of boolean type.

2. All places that currently set IN_ACCESS, instead would increment i_accessed
using the atomic ops. ufs_itimes shall update i_access under some mutex if
i_accessed is greater than zero.

3. Check the i_access instead of the IN_ACCESS.

4. ffs_update and ffs_syncvnode shall do the DIP_SET(i_atime) under the
mutex from #2 before the main run and set IN_MODIFIED accordingly if
i_accessed is not 0.

4. ufs_getattr shall retrieve the *time from new i_ fields under the
mutex from #2 if corresponding i_ flag is set.

Basically, I want to set IN_MODIFIED i_flag (induced by IN_ACCESS and
others) only under exclusive vnode lock. Moreover, i_accessed can be
zeroed only under exclusive lock. This way, even shared lock on the
vnode shall be enough to safely update modification times, and the times
are moved to the disk often enough (at least, at the sync of the syncer
vnodes).

Am I missing something obvious there ? I want to hear you opinion before
starting to prototype the changes.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20060822/59bf8ef0/attachment.pgp