Nullfs shared lookup

Wed Sep 5 09:19:03 UTC 2012

I, together with Peter Holm, developed a patch to enable shared lookups
on nullfs mounts when lower filesystem allows the shared lookups. The lack
of shared lookup support for nullfs is quite visible on any VFS-intensive
workloads which utilize path translations. In particular, it was a complain
on $dayjob which started me thinking about this issue.

There are two problems which prevent direct translation of shared
lookup bit into nullfs upper mount bit:

1. When vfs_lookup() calls VOP_LOOKUP() for nullfs, which passes lookup
operation to lower fs, resulting vnode is often only shared-locked. Then
null_nodeget() cannot instantiate covering vnode for lower vnode, since
insmntque1() and null_hashins() require exclusive lock on the lower.

The solution is straightforward, if null hash failed to find pre-existing
nullfs vnode for lower vnode, the lower vnode lock is upgraded.

2. (More serious). Nullfs reclaims its vnodes on deactivation. The cause
is due to nullfs inability to detect reclamation of the lower vnode.
Reclamation of a nullfs vnode at deactivation time prevents a reference
to the lower vnode to become stale.

Unfortunately, this means that all lookups on nullfs need exclusive lock
to instantiate upper vnode, which is never cached.

Solution which we propose is to add VFS notification to the upper
filesystem about reclamation of the vnode in the lower filesystem. Now,
vgone() calls new VFS op vfs_reclaim_lowervp() with an argument lowervp
which is reclaimed. It is possible to register several reclamation event
listeners, to correctly handle the case of several nullfs mounts over
the same directory.

For the filesystem not having nullfs mounts over it, the overhead added is
a single mount interlock lock/unlock in the vnode reclamation path.

Benchmarks consisting of up 1K threads doing parallel stat(2) on the
same file demonstate almost constant execution time, not depending of
number of running threads. While without the patch, exec time between
single-threaded run and run with 1024 threads performing the same total
count of stat(2), differ in 6 times.

Somewhat problematic detail, IMO, is that nullfs reclamation procedure
calls vput() on the lowervp vnode, temporary unlocking the vnode being
reclaimed. This seems to be fine for MPSAFE filesystems, but not-MPSAFE
code often put partially initialized vnode on some globally visible
list, and later can decide that half-constructed vnode is not needed.
If nullfs mount is created above such filesystem, then other threads
might catch such not properly initialized vnode. Instead of trying
to overcome this case, e.g. by recursing the lower vnode lock in
null_reclaim_lowervp(), I decided to rely on nearby extermination of
non-MPSAFE filesystems support.

I think that unionfs can also benefit from this mechanism, but I did not
even looked at unionfs.

Patch is available at
http://people.freebsd.org/~kib/misc/nullfs_shared_lookup.1.patch
It survived stress2 torturing.

Comments ?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20120905/eed1c756/attachment.pgp