umount -f implementation

Mon Jun 29 14:36:29 UTC 2009

On Mon, 29 Jun 2009, Attilio Rao wrote:

> 2009/6/29 Rick Macklem <rmacklem at uoguelph.ca>:
>> I just noticed that when I do the following:
>> - start a large write to an NFS mounted fs
>> - network partition the server (unplug a net cable)
>> - do a "umount -f <mntpoint>" on the machine
>>
>> that it gets stuck trying to write dirty blocks to the server.
>>
>> I had, in the past, assumed that a "umount -f" of an NFS mount would be
>> used to get rid of an NFS mount on an unresponsive server and that loss
>> of "writes in progress" would be expected to happen.
>>
>> Does that sound correct? (In other words, an I seeing a bug or a feature?)
>
> While that should be real in principle (immediate shutdown of the fs
> operation and unmounting of the partition) it is totally impossible to
> have it completely unsleeping, so it can happen that also umount -f
> sleeps / delays for some times (example: vflush).
> Currently, umount -f is one of the most complicated thing to handle in
> our VFS because it puts as requirement that vnodes can be reclaimed in
> any moment, adding complexity and possibility for races.
>
Yes, agreed. And I like to leave that stuff to more clever chaps than I:-)

> What's the fix for your problem?
>
Well, when I tested it I found that it got stuck in two places, both
calls to VFS_SYNC(). The first was a
 	sync();
right at the beginning of umount.c.
- All I did for that one is move it to after the code that handles
   option processing and change it to
 	if ((fflag & MNT_FORCE) == 0)
 		sync();
   so that it isn't done for the "-f" case. (I believe the sync(); call
   at the beginning of umount is only a performance optimization, so I
   don't think not doing it for "-f" should break anything.)

- the second happened just before the VFS_UNMOUNT() call in the
   umount(2) system call. The code looks like:
 	if (((mp->mnt_flag & MNT_RDONLY) ||
 	     (error = VFS_SYNC(mp, MNT_WAIT)) == 0) || (flags & MNT_FORCE) != 0)
   - Although it was tempting to reverse the order of VFS_SYNC() and the
     test for MNT_FORCE, I thought that might have a negative impact on
     other file systems, since it avoided doing the VFS_SYNC(), so...

   - Instead, I just put a check for MNTK_UNMOUNTF at the beginning of
     nfs_sync(), so that it returns EBUSY for this case instead of getting
     stuck trying to flush().

Assuming that I'm right w.r.t. the "sync();" at the beginning of umount.c,
it simply ensures that the umount command thread makes it as far as
VFS_UNMOUNT()->nfs_unmount(), so that the forced dismount proceeds. It
kills RPCs in progress before doing the vflush() and, since no new RPCs
can be done once MNTK_UNMOUNTF is set (it is checked at the beginning of
a request), the vflush() won't actually flush anything to the server.

As such, "umount -f" is pretty well guaranteed to throw away the dirty
buffers. I believe this is correct behaviour, but it would mean that a
user/sysadmin that uses "umount -f" for cases where the server is still
functioning, but slow, will lose data when they probably don't expect to.

Does this help? rick
ps: During simple testing, it has worked ok. It waits about 1 minute for
     the RPC threads to shut down, but the "umount -f" does complete after
     that happens. It the consensus seems to be that patching this is a
     good idea, I'll get some more testing done.