Namecache improvements (Summer of Code proposal)

Thu Apr 1 18:54:23 UTC 2010

I plan to submit following idea as Summer of Code project. I'd like it
to be treated as a research project but not like ready-for-production
solution. Comments and suggestions are welcome.

Into.

First of all there could be several paths to a vnode: hardlinks, nullfs
mount, snapshot dirs, etc.

There are several traditional ways of solving problem of getting full
path from cache:

Pass granular updates to namecace on file system change or cache full
path for vnode. This approach was taken by Solaris, XNU and many others.
In FreeBSD on file remove or rename cache for entire directory is
purged.

Namecache in Dragonfly keeps references to vnodes from the path root to
the given vnode. Linux does similar thing but its VFS is very different.
But this approach doesn't work for network filesystems well enough,
there are tweaks to make NFS usable. AFAIK Linux also had problems with
NFS due to its VFS design (long ago), there is a d_validate call used to
verify if "cached" entry (dentry) is valid. Such problems arise because
of stateless and nameless operation of NFS.

Random thoughts to make my intent clear.

I do not like idea of keeping all vnodes from path root around, it
serves no good purpose and creates another way of having vnode in
semi-valid state.

Keeping cache in sync by updating it from outside of filesystem is hard,
unreliable and error prone. Basically there is no reliable way to find
out if path has changed for network filesystem or nullfs mount.

Our VFS keeps names and vnodes separate, the same happens on filesystem
level: one inode, one vnode, several names (just a hard link); by design
there is no easy way of doing reverse lookup (getting name for an
inode/vnode).

Proposal.

Generalize dirhash into generic directory cache (dircache) updated by
filesystem itself. It's much like dirhash in UFS or dircache in pefs
(rather different from dirhash, it's more like cache for a network
filesystem). Dircache entry ("strong entry") contains inode numbers,
mount struct pointer, internal filesystem data if need and contains no
vnode reference in general case. Filesystem keeps entries in sync on
create/rename/remove/etc.

By 'no vnode reference' I mean that dircache entry may exist without
vnode reference attached to it. On VOP_LOOKUP, VOP_CREATE and VFS_GET
cache entry is updated with vnode reference. Vnode reference is removed
on VOP_RECLAIM.

Unify current namecache and dircache. There are strong and weak entries
in the cache. Strong added by dircache, all entires added by oldcache
(current cache_*) are weak (inode number = 0). If there's no inode
number for entry it's weak. Snapshot directory for filesystem using
dircache is likely to be weak. Nullfs and NFS entries are all weak.

Only dircache entries may be strong.

There is no weak entries without vnode reference in the cache at any
time.

Strong nodes are permanent, they remain in cache until there are
references to them.

Cache remains consistent, i.e. if cache grows large only leaf strong
nodes that do not reference vnode removed. Entries removed not one by
one but all directory entries at once, so that cache remains valid. It
also means, that entries forming full path to mount point are always in
cache. Traditional-Weak entries behave the same way as in current cache
(removed with cache_purge(vp)).

Filesystem uses either dircache or oldcache but not both.

Strong nodes form a Directed Acyclic Graph supporting reliable full path
resolve (among strong nodes only). It needs more thinking, componentname
struct should be extended with support for "namespaces".

Resolving weak entries remains the same, there is not much one can do
without knowing that cache is valid, i.e. relay on namecache to get full
path only in cases it's known to be reliable, do not try to make it
reliable with different tweaks.

Vnode won't contain v_cache_src/v_cache_dst lists, just a reference to
cache entry.

Rewriting existing filesystems to support dircache won't be necessary
oldcache should remain fully functional. Dircache is intended only for
local filesystems, it can be extended with validate operation to become
useful for cases like nullfs.

Additional goal is to make it possible to get vnode for *strong*
dircache entry by traversing the graph and, if vnode doesn't exist,
calling VFS_GET(inode number) for the entry, thus giving considerable
performance improvement, avoiding VOP_LOOKUP calls for all elements in
path. Access check issues are not hard to solve, there is VFS_GETACL. I
do not mean that vnode can be created for arbitrary dircache entry, full
path of vnodes has to be created at least once, ACLs cached, etc. But,
ignoring this goal, filesystem level data, namely inode number, is still
necessary for dircache to remain connected with filesystem.

In other words, namecache+dircache is a per-filesystem meta data cache,
it's also aware of vnodes if there are any attached to entries.
Namecache becomes not a VFS subsystem, but moves in between VFS and
filesystem code. Anyway there is internal name handling in filesystems,
in UFS case there is dirhash which partially does what dircache should,
the idea is to expose it to upper levels and use it.

Thanks,
Gleb.