device/disk vnode use by filesystems: vfs_bio, geom_vfs

Wed Feb 13 05:04:40 PST 2008

I changed Subject to reflect more general discussion than what we
originally had (UDF specifics).
I also CC arch, because, I think, these issues are core
architectural/design issues.

ufs/ffs is quite a complex beast, I don't pretend I understand its
internal workings well.

on 12/02/2008 17:58 Bruce Evans said the following:
> In fact, ffs already overrides the setting of bo_bsize for the device
> vnode to a different wrong setting -- g_vfs_open() sets the sector size,
> and ffs_mount() changes the setting to fs_bsize, but ffs actually needs
> the setting to be DEV_BSIZE (I think).

Latest ffs code doesn't seem to do that.
It calls g_vfs_open which sets bo_bsize on disk vnode and that's it.
The code does override indeed bo_ops from those that were set by g_vfs_open.
What you said is done in ffs_vget, it assigns bo_bsize=fs_bsize for new
vnodes for ffs files.

> Other bugs from this:
> - ffs_rawread wants the sector size, and it assumes that this is in bo_bsize
>    for the device vnode, but ffs_mount() has changed this to fs_bsize.

In the latest code this is not true. bo_bsize is underlying GEOM
provider's sector size ("device sector size").

> - multiple r/o mounts are supposed to work, but don't, since there is only
>    one device vnode with a shared bufobj, but the bufobj needs to be
>    per-file-system since all mounts write to it.  Various bad things
>    including panics result.  There is a well-know panic from bo_private
>    becoming garbage on unmount.

I agree.
Here's another example (more familiar to me) - many DVD disks have both
CD9660 and UDF fs structures referencing the same data. mkisofs can also
produce such images with -udf option.
In theory, there should be nothing that would prevent simultaneous RO
mount of such a disk via both methods. In practice at least bo_private
would be overridden by the second mount.
Well, even two simultaneous RO CD9660 mounts of the same device can/will
result in a problem (at least one umount).
I agree that this is indeed an architectural problem. Concurrent
filesystems using the same device should not fight over its
vnode/bufobj. They should get their private structures, but I have no
idea how.

>    I just noticed more possibilities for
>    panics.  bo_ops points to static storage, so it never becomes complete
>    garbage.  However, at least ffs sets it blindly early on in
>    ffs_mountfs(), before looking at the file system to see if ffs can
>    mount it.  Thus if the file system is already mounted by another
>    ffs, then ffs clobbers the other fs's bo_ops.  The ffs mount will
>    presumably fail, leaving bo_ops clobbered.

I agree.

>    Also, a successful
>    g_vfs_open() has clobbered bo_ops, bo_private and bo_bsize a little
>    earlier.

I agree.

>    g_vfs_open() is fundamental to looking at the file system,
>    since I/O is not set up until it completes.  Clobbering the pointers
>    is most dangerous, but just clobbering bo_bsize breaks blkno
>    calculations for any code that uses bo_bsize.

I don't think there is any current code that uses bo_bsize for this purpose.

> Apart from these bugs, the fix for the blkno calculations for device
> vp's may be as simple as setting bo_bsize to DEV_BSIZE for the device
> vp of all disk file systems (since all disk file systems use expect
> this size).  Currently, ffs seems to be the only file system that
> overrides g_vfs_open()'s default of the sector size.  Nothing in any
> of fs/, gnu/fs/ and contrib/ references bo_bsize.

It is the first possibility. And I actually currently run a system with
this patch in place and do not see any regressions.
Another possibility, which requires more changes but is more "elegant"
(in my opinion), is to keep bo_bsize as it is. Instead, always use
non-adjusted block numbers in bread*/bwrite and modify g_vfs_strategy to
set bio_offset to blkno*bo_bsize. That is, filesystems would not have to
hack block numbers via shifting them by (bshift - DEV_BSHIFT). In this
case the code should be very similar whether you work via a device vnode
or via a file vnode. This is because filesystems already pass logical
block number when they work via a file vnode.

> 
> Yes.  They can set it more easily as they can tell g_vfs_open() what to
> set it to.  Except, until the bugs are fixed properly, g_vfs_open() can
> more easily do tests to prevent clobbering.  For bo_bsize and bo_ops,
> sharing a common value is safe and safe changes can be detected by
> setting to a special value on last unmount.  For bo_private, a safety
> check would probably disallow multiple mounts (since cp is dynamically
> allocated (?)).

I agree. We either should prohibit a concurrent use of a device vnode by
multiple filesystems; or should have some reference counting - maybe
this would work for multiple mounts for the same fs type; or design some
way to have that data private to filesystems.

One tough thing here is a lack of encapsulation: g_vfs_open can make its
own changes in v_bufobj, then filesystem code can modify it further, so
there is no single point of control.
The simplest solution is to disallow this practice altogether.
Another approach that comes to my mind is some kind of vnode "cloning",
so that each filesystem has its own device vnode. This should work for
multiple RO mounts, maybe with some resource usage penalties.

RW mounts must be exclusive of course.

-- 
Andriy Gapon