Major SMP problems with lstat/namei
John Baldwin
jhb at freebsd.org
Wed Sep 24 16:54:57 UTC 2008
On Wednesday 24 September 2008 12:52:59 am Jeff Wheelhouse wrote:
>
> We have encountered some serious SMP performance/scalability problems
> that we've tracked back to lstat/namei calls. I've written a quick
> benchmark with a pair of tests to simplify/measure the problem. Both
> tests use a tree of directories: the top level directory contains five
> subdirectories a, b, c, d, and e. Each subdirectory contains five
> subdirectories a, b, c, d, and e, and so on.. 1 directory at level
> one, 5 at level two, 25 at level three, 125 at level four, 625 at
> level five, and 3125 at level six.
>
> In the "realpath" test, a random path is constructed at the bottom of
> the tree (e.g. /tmp/lstat/a/b/c/d/e) and realpath() is called on that,
> provoking lstat() calls on the whole tree. This is to simulate a mix
> of high-contention and low-contention lstat() calls.
>
> In the "lstat" test, lstat is called directly on a path at the bottom
> of the tree. Since there are 3125 files, this simulates relatively
> low-contention lstat() calls.
>
> In both cases, the test repeats as many times as possible for 60
> seconds. Each test is run simultaneously by multiple processes, with
> progressively doubling concurrency from 1 to 512.
>
> What I found was that everything is fine at concurrency 2, probably
> indicating that the benchmark pegged on some other resource limit. At
> concurrency 4, realpath drops to 31.8% of concurrency 1. At
> concurrency 8, performance is down to 18.3%. In the interim, CPU load
> goes to 80-90% system CPU. I've confirmed via ktrace and the rusage
> that the CPU usage is all system time, and that lstat() is the *only*
> system call in the test (realpath() is called with an absolute path).
>
> I then reran the 32-process test on 1-7 cores, and found that
> performance peaks at 2 cores and drops sharply from there. eight
> cores runs *fifteen* times slower than two cores.
>
> The test full results are at the bottom of this message.
>
> This is on 6.3-RELEASE-p4 with vfs.lookup_shared=1.
Shared lookups only work on the NFS client in 6.x. I'm about to turn them on
for UFS in HEAD (8.x) and will backport the needed fixes to 7.x after 7.1
(too risky to merge to 7.x this close to a release). So lookup_shared=1
isn't going to really help on 6.x unless you are doing it all over NFS. You
also want to backport my fix to cache_enter() before using lookup_shared at
all:
jhb 2008-08-23 15:13:39 UTC
FreeBSD src repository
Modified files:
sys/kern vfs_cache.c
Log:
SVN rev 182061 on 2008-08-23 15:13:39Z by jhb
Fix a race condition with concurrent LOOKUP namecache operations for a vnode
not in the namecache when shared lookups are enabled (vfs.lookup_shared=1,
it is currently off by default) and the filesystem supports shared lookups
(e.g. NFS client). Specifically, if multiple concurrent LOOKUPs both miss
in the name cache in parallel, each of the lookups may each end up adding an
entry to the namecache resulting in duplicate entries in the namecache
for the same pathname. A subsequent removal of the mapping of that
pathname to that vnode (via remove or rename) would only evict one of the
entries from the name cache. As a result, subseqent lookups for that
pathname would still return the old vnode.
This race was observed with shared lookups over NFS where a file was updated
by writing a new file out to a temporary file name and then renaming that
temporary file to the "real" file to effect atomic updates of a file. Other
processes on the same client that were periodically reading the file would
occasionally receive an ESTALE error from open(2) because the VOP_GETATTR()
in nfs_open() would receive that error when given the stale vnode.
The fix here is to check for duplicates in cache_enter() and just return
if an entry for this same directory and leaf file name for this vnode is
already in the cache. The check for duplicates is done by walking the
per-vnode list of name cache entries. It is expected that this list should
be very small in the common case (usually 0 or 1 entries during a
cache_enter() since most files only have 1 "leaf" name).
Reviewed by: ups, scottl
MFC after: 2 months
Revision Changes Path
1.124 +33 -9 src/sys/kern/vfs_cache.c
If you want to try the UFS stuff on 7, you would need to probably backport at
least the following, maybe more:
jeff 2008-04-11 09:44:25 UTC
FreeBSD src repository
Modified files:
sys/ufs/ufs ufs_lookup.c
Log:
- cache dp->i_offset in the local 'i_offset' variable for use in loop
indexes so directory lookup becomes shared lock safe. In the modifying
cases an exclusive lock is held here so the commit routine may
rely on the state of i_offset.
- Similarly handle i_diroff by fetching at the start and setting only once
the operation is complete. Without the exclusive lock these are only
considered hints.
- Assert that an exclusive lock is held when we're preparing for a commit
routine.
- Honor the lock type request from lookup instead of always using exclusive
locking.
Tested by: pho, kris
Revision Changes Path
1.87 +48 -29 src/sys/ufs/ufs/ufs_lookup.c
jeff 2008-04-22 12:34:16 UTC
FreeBSD src repository
Modified files:
sys/ufs/ufs inode.h ufs_lookup.c
Log:
- Use a local variable for i_ino in ufs_lookup. It is only used to
communicate between two parts of this one function. This was causing
problems with shared lookups as each would trash the ino value in the
inode.
- Remove the unused i_ino field from the inode structure.
Revision Changes Path
1.53 +0 -1 src/sys/ufs/ufs/inode.h
1.88 +10 -13 src/sys/ufs/ufs/ufs_lookup.c
jhb 2008-07-30 21:07:56 UTC
FreeBSD src repository
Modified files:
sys/ufs/ufs ufs_lookup.c
Log:
SVN rev 181018 on 2008-07-30 21:07:56Z by jhb
Whitespace tweak.
Revision Changes Path
1.90 +0 -1 src/sys/ufs/ufs/ufs_lookup.c
jhb 2008-09-16 16:18:36 UTC
FreeBSD src repository
Modified files:
sys/ufs/ufs ufs_lookup.c
Log:
SVN rev 183079 on 2008-09-16 16:18:36Z by jhb
- Only set i_offset in the parent directory's i-node during a lookup for
non-LOOKUP operations.
- Relax a VOP assertion for a DELETE lookup. rename() uses WANTPARENT
instead of LOCKPARENT when looking up the source pathname. ufs_rename()
uses a relookup() to lock the parent directory when it decides to finally
remove the source path. Thus, it is ok for a DELETE with WANTPARENT set
instead of LOCKPARENT to use a shared vnode lock rather than an exclusive
vnode lock.
Reported by: kris (2)
Reviewed by: jeff
Revision Changes Path
1.91 +9 -3 src/sys/ufs/ufs/ufs_lookup.c
jhb 2008-09-16 19:06:44 UTC
FreeBSD src repository
Modified files:
sys/ufs/ufs inode.h ufs_lookup.c
Log:
SVN rev 183093 on 2008-09-16 19:06:44Z by jhb
Retire the 'i_reclen' field from the in-memory i-node. Previously,
during a DELETE lookup operation, lookup would cache the length of the
directory entry to be deleted in 'i_reclen'. Later, the actual VOP to
remove the directory entry (ufs_remove, ufs_rename, etc.) would call
ufs_dirremove() which extended the length of the previous directory
entry to "remove" the deleted entry.
However, we always read the entire block containing the directory
entry when doing the removal, so we always have the directory entry to
be deleted in-memory when doing the update to the directory block.
Also, we already have to figure out where the directory entry that is
being removed is in the block so that we can pass the component name
to the dirhash code to update the dirhash. So, instead of passing
'i_reclen' from ufs_lookup() to the ufs_dirremove() routine, just read
the 'd_reclen' field directly out of the entry being removed when
updating the length of the previous entry in the block.
This avoids a cosmetic issue of writing to 'i_reclen' while holding a
shared vnode lock. It also slightly reduces the amount of side-band
data passed from ufs_lookup() to operations updating a directory via
the directory's i-node.
Reviewed by: jeff
Revision Changes Path
1.54 +0 -1 src/sys/ufs/ufs/inode.h
1.92 +9 -6 src/sys/ufs/ufs/ufs_lookup.c
jeff 2008-04-11 09:48:12 UTC
FreeBSD src repository
Modified files:
sys/ufs/ufs dirhash.h ufs_dirhash.c
Log:
- Use a lockmgr lock rather than a mtx to protect dirhash. This lock
may be held for the duration of the various dirhash operations which
avoids many complex unlock/lock/revalidate sequences.
- Permit shared locks on lookup. To protect the ip->i_dirhash pointer we
use the vnode interlock in the shared case. Callers holding the
exclusive vnode lock can run without fear of concurrent modification to
i_dirhash.
- Hold an exclusive dirhash lock when creating the dirhash structure for
the first time or when re-creating a dirhash structure which has been
recycled.
Tested by: kris, pho
Revision Changes Path
1.6 +2 -1 src/sys/ufs/ufs/dirhash.h
1.24 +289 -227 src/sys/ufs/ufs/ufs_dirhash.c
jhb 2008-09-16 16:23:56 UTC
FreeBSD src repository
Modified files:
sys/ufs/ufs dirhash.h ufs_dirhash.c
Log:
SVN rev 183080 on 2008-09-16 16:23:56Z by jhb
Fix a race with shared lookups on UFS. If the the dirhash code reached the
cap on memory usage, then shared LOOKUP operations could start free'ing
dirhash structures. Without these fixes, concurrent free's on the same
directory could result in one of the threads blocked on a lock in a dirhash
structure free'd by the other thread.
- Replace the lockmgr lock in the dirhash structure with an sx lock.
- Use a reference count managed with ufsdirhash_hold()/drop() to determine
when to free the dirhash structures. The directory i-node holds a
reference while the dirhash is attached to an i-node. Code that wishes
to lock the dirhash while holding a shared vnode lock must first
acquire a private reference to the dirhash while holding the vnode
interlock before acquiring the dirhash sx lock. After acquiring the sx
lock, it drops the private reference after checking to see if the
dirhash is still used by the directory i-node.
Revision Changes Path
1.7 +5 -1 src/sys/ufs/ufs/dirhash.h
1.25 +82 -33 src/sys/ufs/ufs/ufs_dirhash.c
jhb 2008-09-22 20:53:22 UTC
FreeBSD src repository
Modified files:
sys/ufs/ufs ufs_dirhash.c
Log:
SVN rev 183280 on 2008-09-22 20:53:22Z by jhb
Close a race between concurrent calls to ufsdirhash_recycle() and
ufsdirhash_free() introduced in my last commit by removing the dirhash
about to be free'd in ufsdirhash_free() from the global dirhash list
before dropping the sx lock.
Tested by: kris
Revision Changes Path
1.26 +10 -5 src/sys/ufs/ufs/ufs_dirhash.c
There are additional fixes needed to fix races with umount -f, so if you
backport all this stuff, don't use umount -f or you risk panics. :) Also,
you will need to set the flag in the mount flags to enable shared lookups in
the mount VOP in ffs_vfsops.c:
--- //depot/projects/smpng/sys/ufs/ffs/ffs_vfsops.c 2008/08/25 16:33:41
+++ //depot/user/jhb/lock/ufs/ffs/ffs_vfsops.c 2008/08/29 15:04:03
@@ -852,7 +852,7 @@
* Initialize filesystem stat information in mount struct.
*/
MNT_ILOCK(mp);
- mp->mnt_kern_flag |= MNTK_MPSAFE;
+ mp->mnt_kern_flag |= MNTK_MPSAFE | MNTK_LOOKUP_SHARED;
MNT_IUNLOCK(mp);
#ifdef UFS_EXTATTR
#ifdef UFS_EXTATTR_AUTOSTART
For 6.x you could in theory backport all of this as well, but there may be
other fixes needed as well for 6.x. I'm only planning on merging this stuff
back to 7.x myself.
--
John Baldwin
More information about the freebsd-hackers
mailing list