umount -f implementation

Mon Jun 29 19:06:09 UTC 2009

On Mon, 29 Jun 2009, Rick Macklem wrote:

> On Mon, 29 Jun 2009, Attilio Rao wrote:
>> 
>> While that should be real in principle (immediate shutdown of the fs
>> operation and unmounting of the partition) it is totally impossible to
>> have it completely unsleeping, so it can happen that also umount -f
>> sleeps / delays for some times (example: vflush).
>> Currently, umount -f is one of the most complicated thing to handle in
>> our VFS because it puts as requirement that vnodes can be reclaimed in
>> any moment, adding complexity and possibility for races.
>> 
> Yes, agreed. And I like to leave that stuff to more clever chaps than I:-)
>
>> What's the fix for your problem?
>> 
> Well, when I tested it I found that it got stuck in two places, both
> calls to VFS_SYNC(). The first was a
> 	sync();
> right at the beginning of umount.c.
> - All I did for that one is move it to after the code that handles
>  option processing and change it to
> 	if ((fflag & MNT_FORCE) == 0)
> 		sync();
>  so that it isn't done for the "-f" case. (I believe the sync(); call
>  at the beginning of umount is only a performance optimization, so I
>  don't think not doing it for "-f" should break anything.)

OK.  This sync() is probably actually a performance pessimization, since
it syncs all file systems while the internal sync in umount(2) only
syncs the one being unmounted.

> - the second happened just before the VFS_UNMOUNT() call in the
>  umount(2) system call. The code looks like:
> 	if (((mp->mnt_flag & MNT_RDONLY) ||
> 	     (error = VFS_SYNC(mp, MNT_WAIT)) == 0) || (flags & MNT_FORCE) != 
> 0)
>  - Although it was tempting to reverse the order of VFS_SYNC() and the
>    test for MNT_FORCE, I thought that might have a negative impact on
>    other file systems, since it avoided doing the VFS_SYNC(), so...
>
>  - Instead, I just put a check for MNTK_UNMOUNTF at the beginning of
>    nfs_sync(), so that it returns EBUSY for this case instead of getting
>    stuck trying to flush().

OK.  This sync is probably an optimization for correctness, since it
arranges to do as much as possible without forcing.

I checked ffs_mount() and found 2 large bugs, one related:
- in the only case that tends to cause problems, namely the non-readonly
   case, ffs_unmount() does a suspend which calls VOP_SYNC(..., MNT_SUSPEND),
   but after errors from this sync it checks neither MNT_FORCE nor
   error == ENXIO.  I think the usual effect is the same as if the top-level
   unmount() didn't check MNT_FORCE after suspend failure: in problematic
   cases, we have an unrecoverable write, due to the device going away or
   just an i/o error, and this error has probably already occured (only
   in rare cases will it be triggered by unmount).  Then MNT_FORCE is
   essentially unused, and the ENXIO hack is not reached either, and
   unmount usually fails.
- the UFS_EXTATTR case destroys infrastructure before committing to
   succeeding.  It used to be just broken on failure.  Now it uses a
   hack to recover (call a constructor) on failure, but the recovery
   code is not reached in the usual case of failure -- when the
   suspension fails.

ffs_unmount() still seems to have no support for handling unrecoverable
write errors (short of you converting them to ENXIO by removing the
media).  MNT_FORCE only meant FORCECLOSE for it.  I see that old nfs
was similar, and you are now making MNT_FORCE stronger.  I thought
that umount(8)'s man page documented -f being strongly forceful, but
checking it shows that it only documents a weak force like that of
FORCECLOSE (but not precisely enough).  Perhaps a different flag should
be used for strong forcefulness.  Weak forcefulness is still useful and
used for mount -f -u -- for remount we would never want errors in the
file system itself ignored.  This use also shows that the generic
FORCECLOSE code must not ignore errors.

> Assuming that I'm right w.r.t. the "sync();" at the beginning of umount.c,
> it simply ensures that the umount command thread makes it as far as
> VFS_UNMOUNT()->nfs_unmount(), so that the forced dismount proceeds. It
> kills RPCs in progress before doing the vflush() and, since no new RPCs
> can be done once MNTK_UNMOUNTF is set (it is checked at the beginning of
> a request), the vflush() won't actually flush anything to the server.
>
> As such, "umount -f" is pretty well guaranteed to throw away the dirty
> buffers. I believe this is correct behaviour,

This is how I think ffs_mount() should work too -- It should be
responsible for throwing away the dirty buffers, while nothing else
should discard them.  Now the discarding seems to be done by falling
through to g_vfs_done(), except g_vfs_done() is not reached in most
cases (see above).  I don't like this -- at best we lose the opportunity
to print ffs-specific details about what was discarded.  Falling through
only works for ENXIO anyway -- on other errors we should discard the
unwritable buffers in an fs-specific manner so as to write as many of
the writable buffers as possible.

> but it would mean that a
> user/sysadmin that uses "umount -f" for cases where the server is still
> functioning, but slow, will lose data when they probably don't expect to.

A new flag would help for this.

Bruce