Deadlock between nfsd and snapshots.

Wed Aug 23 14:05:08 UTC 2006

On Tue, 22 Aug 2006, Tor Egge wrote:

>> I have a proposal.

Sorry, I don't like it.

>> 1. Remove IN_ACCESS, IN_UPDATE, IN_CHANGE from i_flag. For each flag,
>> introduce two new i_ fields, e.g., i_access of type timespec, and
>> i_accessed of boolean type.
>
> On amd64, sizeof(struct timespec) is 16 bytes and sizeof(struct boolean_t) is 4
> bytes.  3 * (16 + 4) = 60 bytes extra per inode.  With 100K inodes that becomes
> 6 MB extra memory.
>
> I don't see why all these extra fields are needed.

I think i_access is to avoid changing the dinode, but changes to the dinode
are best avoided by just not changing it (as you give details on below).
IN_ACCESS is not in the dinode so it shouldn't need a new field.

>> 2. All places that currently set IN_ACCESS, instead would increment
>> i_accessed using the atomic ops. ufs_itimes shall update i_access under some
M
>> mutex if i_accessed is greater than zero.
>
> Protecting the existing i_flag and the timestamps with the vnode interlock when
> the current thread only has a shared vnode lock should be sufficient to protect
> against the races, removing the need for #3, #4 and #4 below.

I agree that this should be sufficient.  Don't know if it is.  Actually,
I thought that the vnode lock was more exclusive.  How can a shared lock
work even for a reader if a writer is changing inode contents?

> What's left is avoiding setting IN_MODIFIED when it's unsafe, to protect
> against the deadlock.
> 
>> 3. Check the i_access instead of the IN_ACCESS.
>
>> 4. ffs_update and ffs_syncvnode shall do the DIP_SET(i_atime) under the mutex
>> from #2 before the main run and set IN_MODIFIED accordingly if i_accessed is
>> not 0.

>From a followup:

% ffs_update shall be excluded, only ffs_syncvnode left in the list.
% ffs_syncvnode is enclosed in the vn_start_write braces.

>> 4. ufs_getattr shall retrieve the *time from new i_ fields under the mutex
>> from #2 if corresponding i_ flag is set.
>
>> Basically, I want to set IN_MODIFIED i_flag (induced by IN_ACCESS and others)
>> only under exclusive vnode lock. Moreover, i_accessed can be zeroed only
>> under exclusive lock. This way, even shared lock on the vnode shall be enough
>> to safely update modification times, and the times are moved to the disk
>> often enough (at least, at the sync of the syncer vnodes).
>
> An exclusive vnode lock isn't needed, see above.  Holding an exclusive vnode
> lock does not make it safe to set IN_MODIFIED.

Locking is complicated enough even if you can actually lock things :-(.

> There are some constraints with regards to setting IN_MODIFIED on an inode.
>
> If neither IN_CHANGE nor IN_UPDATE is set then it might be unsafe to set
> IN_MODIFIED since the file system might be suspended or in the process of being
> suspended with the vnode sync loop in ffs_sync() having iterated past the
> vnode.

The case can't happen.  IN_CHANGE is always set if IN_MODIFIED will
(or should be) set later.  There are some buggy cases where the combined
setting of { IN_CHANGE, IN_MODIFIED } is incorrect, but these don't cause
any problems here.  We just have to avoid setting (or having to set)
IN_MODIFED when setting it is not safe.  Hopefully this is only in suspend
mode.

> If the file system is suspended then IN_MODIFIED cannot be set. If IN_MODIFIED,
> IN_CHANGE or IN_UPDATE is set and the file system is suspended then something
> is wrong.

I think ufs_itimes() needs to use a non-blocking vn_start_write() and do
nothing (except perhaps assert that the above harmful IN_* flags are not
set) if it (ufs_itimes()) would set IN_MODIFIED.

> If the file system is in the process of being suspended then IN_MODIFIED can be
> set at the cost of triggering a restart of the vnode sync loop in ffs_sync().

Yes, once IN_MODIFIED is set, it is up to the sync loop or somewhere near it
to ensure that suspend mode is not entered while an IN_MODIFIED flag is set
for _any_ inode, irrespective of why or how IN_MODIFIED was set (whatever
set it should be free to call vn_finish_write()).

> If either IN_MODIFIED, IN_CHANGE or IN_UPDATE is already set then the vnode
> sync loop has not reached the vnode, and a restart isn't needed.
>
> When ufs_itimes() cannot set IN_MODIFIED then it has to either risk losing the
> access time update or use some mechanism to defer it (e.g. set IN_LAZYMOD or
> a new flag and let process_deferred_inactive() set IN_MODIFIED after the file
> system has been resumed).

Yes, IN_LAZYMOD can be used to reduce the problem, and not much would
be lost by throwing away atime changes that occur during a suspend if
there are too many to convert to IN_LAZYMOD.

Note that IN_LAZYMOD is completely (?) unused in -current:
- it us not used for soft updates for historical reasons (initially mostly
   FUD)
- it was only used for special files, but now devfs is used for special files.

I have used IN_LAZYMOD for atime-only updates of all file types in
ufs_itimes() for several years, but haven't tested it much since I mount
almost all file systems with -noatime.  The patch is simple.  IN_LAZYMOD
still gets turned into IN_MODIFIED in ufs_reclaim() so using it only for
atime-only updates wouldn't completely fix the problem.  It's hard for
ufs_reclaim() to do anything except discard IN_LAZYMOD updates now.

Summary of my understanding of this problem:
- no problem with normal "writes" (at least modulo the sync loop checking
   the correct flags and/or bugs related to missing settings of IN_MODIFED).
   We wait until writes complete before entrering suspend mode, and don't
   allow writes in while in suspend mode.
- problem for atime updates caused by reads.  These become writes while in
   suspend mode.  We want reads to work in suspend mode, so we cannout
   disallow reads, and we cannot disallow the implicit writes without
   breaking atime semantics.  This problem can be partly avoided by
   ignoring IN_ACCESS in ufs_itimes() while in suspend mode, or by
   converting it to IN_LAZYMOD.  If the inode gets reclaimed then we
   lose the atime update; otherwise the atime gets updated some time
   after we leave suspend mode, and either way the update doesn't go
   to snapshots, as is necessary for having a coherent snapshot.
- problem for IN_LAZYMOD in ufs_reclaim().  Currently not reached.  A quick
   fix would be to lose whatever updates are being done lazily.
- problem syncing atimes while entering suspend mode.  For writes, hopefully
   we get a consistent snapshot by disallowing new writes while entering sync
   mode (not just when it is entered).  This doesn't work for reads.

Bruce