fix for per-mount i/o counting in ffs

Fri May 20 11:22:11 UTC 2016

On Fri, 20 May 2016, Konstantin Belousov wrote:

> On Fri, May 20, 2016 at 09:27:38AM +1000, Bruce Evans wrote:
>> On Thu, 19 May 2016, Konstantin Belousov wrote:
>>
>>> On Thu, May 19, 2016 at 12:20:19PM +1000, Bruce Evans wrote:
>>>> On Thu, 19 May 2016, Bruce Evans wrote:
>>>>
>>>>> On Thu, 19 May 2016, Bruce Evans wrote:
>>>>>>  ...
>>>>>> I think the following works to prevent multiple mounts via all of the
>>>>>> known buggy paths: early in every fsmount():
>>>>
>>>> Here is a lightly tested version:

I checked some details for r206130 again:
- r206130 claims to allow only 1 mount per device node.  It actually
   allows only 1 mount per vnode.
- the case of separate vnodes seems to actually work almost as intended.
   This is most easily reached using separate devfs mounts.  It gets the
   access counts right (they are combined).  It has a chance of working
   because the separate vnodes provide a place to attach separate bufobjs.
- the common si_mountpt of course can't work for multiple mounts, but
   clobbering it doesn't have to break anything more than the i/o counts.
- I think it was only intended to allow multiple ro mounts.  However,
   1 rw mount is allowed after any number of ro mounts (using separate
   vnodes after r206130).  ro after rw is also allowed, but then the ro
   mount prints the warning that the fs was not properly dismounted.
- I think the behaviour in the previous point is a side affect of allowing
   fsck to write on a device open ro for a ro mount.  fsck reloads for
   just one of the ro mounts.
So multiple mounts are still too dangerous, and we should do finish
r206130.

>> I think your version needs atomic ops for resetting the pointer, and
>> maybe acquire/release too.  It has locking that is very similar to a
>> ...
>> Maybe even weaker locking is enough here, but this is too hard to
>> understand.
> Well, I do not think that barriers would add much there, since we really
> do not care about two almost parallel mounts to fail, and other locking
> provides enough synchronization points.  On the other hand, having
> explicit barriers makes si_mountpt act as the real semaphore.  Unlike
> mutex, it attributes the ownership of the device to a mount point, and
> not to the locking thread.
>
> So I added acq/rel.

Thanks.

>>> ...
>>> diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c
>>> index 712fc21..21425f5 100644
>>> --- a/sys/ufs/ffs/ffs_vfsops.c
>>> +++ b/sys/ufs/ffs/ffs_vfsops.c
>>> @@ -764,24 +764,29 @@ ffs_mountfs(devvp, mp, td)
>>> 	cred = td ? td->td_ucred : NOCRED;
>>> 	ronly = (mp->mnt_flag & MNT_RDONLY) != 0;
>>>
>>> +	KASSERT(devvp->v_type == VCHR, ("reclaimed devvp"));
>>> 	dev = devvp->v_rdev;
>>> 	dev_ref(dev);
>>> +	if (!atomic_cmpset_ptr(&dev->si_mountpt, 0, mp)) {
>>> +		dev_rel(dev);
>>> +		VOP_UNLOCK(devvp, 0);
>>> +		return (EBUSY);
>>> +	}
>>
>> This is cleaner and safer than my version.
>>
>>> 	DROP_GIANT();
>>> 	g_topology_lock();
>>> 	error = g_vfs_open(devvp, &cp, "ffs", ronly ? 0 : 1);
>>
>> g_vfs_open() already sets devvp->v_bufobj.bo_ops to g_vfs_bufops unless
>> it fails.  This clobbered our setting in the buggy multiple-mount case.
>> But with multiple mounts not allowed, this cleans up any garbage in
>> v_bufobj.
> Yes, and this orders things.  g_vfs_open() shoudl have devvp locked,
> both fo bo manipulations and for vnode_create_vobject() call.
> We can only assign to bo_ops after g_vfs_open() was done successfully.

The atomic cmpset now orders things too.  Is that enough?  It ensures
that an old mount cannot be active.  I don't know if v_bufobj is used
for non-mounts.

Except, for zfs there is no g_vfs_open() to order things, and for all
other file systems there is no atomic cmpset yet.

>> g_vfs_open() has 2 failures for non-exclusive access.  It starts by
>> checking v_bufobj.bo_private == devvp (this is after translating its
>> pointers to the ones passed here).  This is avg's fix for the multiple-
>> mounts problem (r206130).  It doesn't work in all cases.  I think this
>> is unecessary now.
> At least it weeds out other devfs mounts.

Yes, we need it until everything is converted.

>> ...
>> I now see another cleanup: don't goto out when g_vfs_open() fails.  This
>> depends on it setting cp to NULL and leaving nothing to clean when it
>> fails.  It has no man page and this detail is documented in its source
>> code.
> Then I would need to add another NULL assignment, VOP_UNLOCK etc.

g_vfs_open() already sets cp to NULL when it fails, and the cleanup
depends on that now, but it is just as good to depend on no cleanup
being needed on failure.  You do need another dev_rel().

I thought about moving the dev_ref() later to simplify the early returns.
I thought that this didn't quite work.  Now I think it does work, for
obvious reasons:
- the device is attached to a vnode, so it is referenced to prevent it
   going away unless the device is revoked.  It seems to be referenced
   at least 3 times in FreeBSD-9.
- the vnode is locked, so the reference count remains > 0 until we unlock.
So we just need a dev_ref() before the unlock in the non-error case, to
keep the device from going away if it is revoked.

> Updated patch to add acq/rel.
>
> diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c
> index 712fc21..670bb15 100644
> --- a/sys/ufs/ffs/ffs_vfsops.c
> +++ b/sys/ufs/ffs/ffs_vfsops.c
> @@ -764,24 +764,29 @@ ffs_mountfs(devvp, mp, td)
> 	cred = td ? td->td_ucred : NOCRED;
> 	ronly = (mp->mnt_flag & MNT_RDONLY) != 0;
>
> +	KASSERT(devvp->v_type == VCHR, ("reclaimed devvp"));

Hrmph.

> 	dev = devvp->v_rdev;
> 	dev_ref(dev);

Move later...

> +	if (atomic_cmpset_acq_ptr(&dev->si_mountpt, 0, mp) != 0) {

I changed the first 0 to NULL, and this works on i386, but now I remember
that i386 has bogus casts which break detection of type mismatches --
the atomic ptr functions take a [u]intptr_t, not a pointer type, so
NULL won't work if it is ((void *)0).  At least amd64 is still missing
this bug.

> +		dev_rel(dev);

...then this dev_rel() is not needed.

> +		VOP_UNLOCK(devvp, 0);
> +		return (EBUSY);
> +	}
> 	DROP_GIANT();
> 	g_topology_lock();
> 	error = g_vfs_open(devvp, &cp, "ffs", ronly ? 0 : 1);
> 	g_topology_unlock();
> 	PICKUP_GIANT();
> -	VOP_UNLOCK(devvp, 0);
> -	if (error)
> +	if (error != 0) {
> +		VOP_UNLOCK(devvp, 0);
> 		goto out;

This becomes:

 	if (error != 0) {
 		VOP_UNLOCK(devvp, 0);
 		return (EBUSY);
 	}

Then assign v_bufobj.

Then dev_ref(), just in time for unlocking.

Then unlock.

> -	if (devvp->v_rdev->si_iosize_max != 0)
> -		mp->mnt_iosize_max = devvp->v_rdev->si_iosize_max;
> +	}
> +	if (dev->si_iosize_max != 0)
> +		mp->mnt_iosize_max = dev->si_iosize_max;
> 	if (mp->mnt_iosize_max > MAXPHYS)
> 		mp->mnt_iosize_max = MAXPHYS;
> -
> 	devvp->v_bufobj.bo_ops = &ffs_ops;
> -	if (devvp->v_type == VCHR)
> -		devvp->v_rdev->si_mountpt = mp;
> +	VOP_UNLOCK(devvp, 0);

This belongs earlier.

>
> 	fs = NULL;
> 	sblockloc = 0;
> ...

We need this in a central function.  g_vfs_open/close() can do it for
all cases except zfs.  This looks like:

 	DROP_GIANT();
 	g_topology_lock();
 	// atomic_cmpset and its error = EBUSY moved to top of g_vfs_open()
 	error = g_vfs_open(devvp, &cp, "ffs", ronly ? 0 : 1);
 	g_topology_unlock();
 	PICKUP_GIANT();
 	if (error != 0) {
 		VOP_UNLOCK(devvp, 0);
 		return (error);
 	}
 	devvp->v_bufobj.bo_ops = &ffs_ops;
 	dev_ref(dev);
 	VOP_UNLOCK(devvp, 0);
 	if (dev->si_iosize_max != 0)
 		mp->mnt_iosize_max = dev->si_iosize_max;
 	if (mp->mnt_iosize_max > MAXPHYS)
 		mp->mnt_iosize_max = MAXPHYS;

where 2 of 2  lines with GIANT and 3 of 4 lines with iosize_max remain to
be cleaned up.

Resetting si_mountpt in g_vfs_close() is even simpler.  Oops, it also has
to be reset in g_vfs_open() on a later failure there.

Bruce