fix for per-mount i/o counting in ffs

Wed May 18 23:03:27 UTC 2016

On Wed, 18 May 2016, Konstantin Belousov wrote:

> On Wed, May 18, 2016 at 10:00:09AM +1000, Bruce Evans wrote:
>> On Wed, 18 May 2016, Konstantin Belousov wrote:
>>> VCHR check ensures that the devvp vnode is not reclaimed. I do not want
>>> to remove the check and rely on the caller of ffs_mountfs() to always do
>>> the right thing for it without unlocking devvp, this is too subtle.
>>
>> Surely the caller must lock devvp?  Otherwise none of the uses of devvp
>> can be trusted, and there are several others.
> It must lock, but the interface of ffs_mountfs() would then require
> that there is no relock between vn_isdisk() check and call.
>
> I think I know how to make a good compromise there.  I converted the
> check for VCHR into the assert.

But it is very clear there is no re-lock, and that there must be no
re-lock to work ("very clear" relative other complications).
ffs_mountfs() is only called once and only exists to make the function
more readable and debuggable (and auto-inlining it breaks debugging).
Its nearby logic is:

 	namei();			// lock vnode
 	vn_isdisk();			// return if not
 	if (MNT_UPDATE)
 		fail_sometimes();	// locking problems -- see below
 	else
 		ffs_mountfs();		// clearly guaranteed still VCHR

I found another locking problem for revoke.  After mounting /i and revoking
its device, mount -u fails.  This is clearly because its rdev has gone
away.  This makes devvp->v_rdev != ump->um_devvp->v_rdev.  The new devvp
has the old rdev and the old devvp has a null rdev.  This is not really
a locking problem, but the correct behaviour.  Most places just don't
check.

>> There is also ump->um_devvvp, but this seems to be unusable since it
>> might go away.
> Go away as in being reclaimed, yes.  The vnode itself is there, since
> we keep a reference.

I think "reclaimed" is the wrong terminology.  The reference prevents
it being reclaimed by vnlrureclaim(), but doesn't prevent it being
revoked (or vgone()d by a forced unmount of the devfs instance that it
is on).  The reference prevents it being reclaimed even if it is
revoked.  When it is revoked, some but apparently not all of the pointers
in it are cleared or become garbage.  None of them should be used, but
some are.  v_rdev is cleared and we are fairly careful not to follow it,
but we depend on it being cleared and not garbage.  Pointers that are
not cleared include v_bufobj (apparently) and GEOM's hooks related to
v_bufobj, and si_mountpt.  si_mountpt is in the cdev and not in the
vnode.

>> So using the devvp->v_rdev instead of the dev variable is not just a
>> style bug.
> Might be.

In some places.  ump->um_devvp->v_rdev gives the old rdev, and
devvp->v_rdev gives the current rdev provided devvp is locked.  These
can be compared to see if the old rdev was revoked.  Otherwise,
devvp->v_rdev is garbage and both ump->um_dev and ump->um_devvp are
close to garbage -- they are both old and the only correct use of
this is to check that they are still current, but then you have the
current devvp (locked) and can use it instead.

>> ...
>> You only needed to move the unlocking to fix.  devvp->v_bufobj.  How does
>> that work?  The write is now locked, but if devvp goes away, then don't
>> we lose its bufobj?
> The buffer queues are flushed, and BO_DEAD flag is set.  But the flag
> does very little.
>
>> How does any use of ump->um_devvp work?

The problems are similar to the ones with ttys that we are still working
on.  When the device is revoked, there may be many i/o's in progess on
it.  We don't want to block waiting for these, but they should be aborted
before doing any more.  But there are enough stale pointers to even allow
new i/o's.  Enough for tar cf of a complete small file system.

>> I tried revoke(2) on the devvp of a mounted file system.  This worked
>> to give v_type = VBAD and v_rdev = NULL, but didn't crash.  ffs_unmount()
>> checked for the bad vnode, unlike most places, and failed to clear
>> si_mountpt.
>>
>> Normal use doesn't have revokes, but if the vnode is reclaimed instead
>> of just becoming bad, then worse things probably happen.  I think vnode
>> cache resizing gives very unstable storage so the pointer becomes very
>> invalid.  But even revoke followed by setting kern.numvnodes to 1 didn't
>> crash (15 vnodes remained).  So devvp must be referenced throughout.
>> It seems to have reference count 2, since umounting reduced kern.numvnodes
>> from 15 to 13.  (It is surprising how much works with kern.maxvnodes=1.
>> I was able to run revoke, sysctl and umount.)  It is still a mystery
>> that the VBAD vnode doesn't crash soon.
>>
> I believe that bo_ops assignment is the reason why UFS mounts survive the
> reclamation of the devvp vnode.  Take a look at the ffs_geom_strategy(),
> which is the place where UFS io is tunneled directly into geom.  It does
> not pass io requests through devfs.  As result, revocation does not
> change much except doing unneccessary buf queue flush.
>
> It might be telling to try the same experiment, as conducted in your
> next message, on msdosfs instead of UFS.

Everything seems to work exactly the same for msdosfs.  I retried:
- mount; tar; revoke; mount; tar  # (2nd mount succeeds due to revoke)
- mount; tar; mount-another-devfs; mount-using-other-devfs; tar # (2nd
   mount succeeds due to separate devvp).
No crashes.  I didn't risk any rw mounts.

> Below is the simplified patch.
>
> diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c
> index 712fc21..412b000 100644
> --- a/sys/ufs/ffs/ffs_vfsops.c
> +++ b/sys/ufs/ffs/ffs_vfsops.c
> @@ -764,6 +764,7 @@ ffs_mountfs(devvp, mp, td)
> 	cred = td ? td->td_ucred : NOCRED;
> 	ronly = (mp->mnt_flag & MNT_RDONLY) != 0;
>
> +	KASSERT(devvp->v_type == VCHR, ("reclaimed devvp"));
> 	dev = devvp->v_rdev;
> 	dev_ref(dev);
> 	DROP_GIANT();

Not needed.

> @@ -771,17 +772,17 @@ ffs_mountfs(devvp, mp, td)
> 	error = g_vfs_open(devvp, &cp, "ffs", ronly ? 0 : 1);
> 	g_topology_unlock();
> 	PICKUP_GIANT();
> -	VOP_UNLOCK(devvp, 0);
> -	if (error)
> +	if (error != 0) {
> +		VOP_UNLOCK(devvp, 0);
> 		goto out;

Needed.

> -	if (devvp->v_rdev->si_iosize_max != 0)
> -		mp->mnt_iosize_max = devvp->v_rdev->si_iosize_max;
> +	}
> +	if (dev->si_iosize_max != 0)
> +		mp->mnt_iosize_max = dev->si_iosize_max;
> 	if (mp->mnt_iosize_max > MAXPHYS)
> 		mp->mnt_iosize_max = MAXPHYS;
> -
> 	devvp->v_bufobj.bo_ops = &ffs_ops;
> -	if (devvp->v_type == VCHR)
> -		devvp->v_rdev->si_mountpt = mp;
> +	dev->si_mountpt = mp;
> +	VOP_UNLOCK(devvp, 0);
>
> 	fs = NULL;
> 	sblockloc = 0;

I would keep the unlock as early as possible.  Just move the initialization
of v_bufobj before it.

BTW, I don't like the fixup for > MAXPHYS.  This is removed from all file
systems in my version.  dev->si_iosize_max should be clamped to MAXPHYS
unless larger sizes work, and if larger sizes work then individual file
systems don't know enough to kill using them.

The check for si_iosize_max != 0 is bogus too, but not removed in my
version.  mp->mnt_iosize_max defaults to DFLTPHYS and the check avoids
changing that, but if si_iosize_max remains at 0 then i/o won't actually
work, and if some bug results in si_iosize_max being initialized later
but early enough for some i/o to work, then the default of DFLTPHYS
still won't work if it is larger than the driver size.  g_dev_taste()
actually defaults si_iosize_max to MAXPHYS and I think GEOM hides the
driver iosize_max from file systems so I think si_iosize_max is actually
always MAXPHYS here.

> @@ -1083,8 +1084,7 @@ ffs_mountfs(devvp, mp, td)
> out:
> 	if (bp)
> 		brelse(bp);
> -	if (devvp->v_type == VCHR && devvp->v_rdev != NULL)
> -		devvp->v_rdev->si_mountpt = NULL;
> +	dev->si_mountpt = NULL;
> 	if (cp != NULL) {
> 		DROP_GIANT();
> 		g_topology_lock();

I think this is still racy, but the race is more harmless than most of
the problems from revokes.  I think the following can happen:
- after we unlock, another mount starts and and clobbers our si_mountpt
   with a nonzero value.  Then this clobbers the other mount's si_mountpt
   with a zero value.  The zero value is relatively harmless.  It takes
   either a revoke, a separate devfs instance, or the old multiple-mount-
   allowing code for another mount to start.
The old code has a smaller race window:
- since the vnode is unlocked, it gives a null pointer panic if the
   v_rdev becomes null after it is tested to be non-null, or if it is
   still non-null then using it may clobber another mount's si_mountpt
   if the other mount set si_mountpt races with us.  It takes a revoke
   to get the null pointer.  Clobbering only takes a separate devfs
   instance or the old multiple-mount code.

> @@ -1287,8 +1287,7 @@ ffs_unmount(mp, mntflags)
> 	g_vfs_close(ump->um_cp);
> 	g_topology_unlock();
> 	PICKUP_GIANT();
> -	if (ump->um_devvp->v_type == VCHR && ump->um_devvp->v_rdev != NULL)
> -		ump->um_devvp->v_rdev->si_mountpt = NULL;
> +	ump->um_dev->si_mountpt = NULL;
> 	vrele(ump->um_devvp);
> 	dev_rel(ump->um_dev);
> 	mtx_destroy(UFS_MTX(ump));

This has the same problems as cleaning up after an error in mount.

I think the following works to prevent multiple mounts via all of the
known buggy paths: early in every fsmount():

 	dev = devvp->v_rdev;
 	if (dev->si_mountpt != NULL) {
 		cleanup();
 		return (EBUSY);
 	}
 	dev->si_mountpt = mp;

This also prevents other mounts racing with us before we complete.  Too
bad if we fail but the other mount would have succeeded.  In fsunmount(),
move clearing si_mountpt to near the end.  I hope si_mountpt is locked
by the device reference and that this makes si_mountpt robust enough to
use as an exclusive access flag.

GEOM's exclusive access counters somehow don't prevent the multiple
mounts.  I think they are too closely associated with the vnode via
v_bufobj.

Bruce