Major SMP problems with lstat/namei

Wed Sep 24 16:54:57 UTC 2008

On Wednesday 24 September 2008 12:52:59 am Jeff Wheelhouse wrote:
> 
> We have encountered some serious SMP performance/scalability problems  
> that we've tracked back to lstat/namei calls.  I've written a quick  
> benchmark with a pair of tests to simplify/measure the problem.  Both  
> tests use a tree of directories: the top level directory contains five  
> subdirectories a, b, c, d, and e.  Each subdirectory contains five  
> subdirectories a, b, c, d, and e, and so on..  1 directory at level  
> one, 5 at level two, 25 at level three, 125 at level four, 625 at  
> level five, and 3125 at level six.
> 
> In the "realpath" test, a random path is constructed at the bottom of  
> the tree (e.g. /tmp/lstat/a/b/c/d/e) and realpath() is called on that,  
> provoking lstat() calls on the whole tree.  This is to simulate a mix  
> of high-contention and low-contention lstat() calls.
> 
> In the "lstat" test, lstat is called directly on a path at the bottom  
> of the tree.  Since there are 3125 files, this simulates relatively  
> low-contention lstat() calls.
> 
> In both cases, the test repeats as many times as possible for 60  
> seconds.  Each test is run simultaneously by multiple processes, with  
> progressively doubling concurrency from 1 to 512.
> 
> What I found was that everything is fine at concurrency 2, probably  
> indicating that the benchmark pegged on some other resource limit.  At  
> concurrency 4, realpath drops to 31.8% of concurrency 1.  At  
> concurrency 8, performance is down to 18.3%.  In the interim, CPU load  
> goes to 80-90% system CPU.  I've confirmed via ktrace and the rusage  
> that the CPU usage is all system time, and that lstat() is the *only*  
> system call in the test (realpath() is called with an absolute path).
> 
> I then reran the 32-process test on 1-7 cores, and found that  
> performance peaks at 2 cores and drops sharply from there.  eight  
> cores runs *fifteen* times slower than two cores.
> 
> The test full results are at the bottom of this message.
> 
> This is on 6.3-RELEASE-p4 with vfs.lookup_shared=1.

Shared lookups only work on the NFS client in 6.x.  I'm about to turn them on 
for UFS in HEAD (8.x) and will backport the needed fixes to 7.x after 7.1 
(too risky to merge to 7.x this close to a release).  So lookup_shared=1 
isn't going to really help on 6.x unless you are doing it all over NFS.  You 
also want to backport my fix to cache_enter() before using lookup_shared at 
all:

jhb         2008-08-23 15:13:39 UTC

  FreeBSD src repository

  Modified files:
    sys/kern             vfs_cache.c 
  Log:
  SVN rev 182061 on 2008-08-23 15:13:39Z by jhb
  
  Fix a race condition with concurrent LOOKUP namecache operations for a vnode
  not in the namecache when shared lookups are enabled (vfs.lookup_shared=1,
  it is currently off by default) and the filesystem supports shared lookups
  (e.g. NFS client).  Specifically, if multiple concurrent LOOKUPs both miss
  in the name cache in parallel, each of the lookups may each end up adding an
  entry to the namecache resulting in duplicate entries in the namecache
  for the same pathname.  A subsequent removal of the mapping of that
  pathname to that vnode (via remove or rename) would only evict one of the
  entries from the name cache.  As a result, subseqent lookups for that
  pathname would still return the old vnode.
  
  This race was observed with shared lookups over NFS where a file was updated
  by writing a new file out to a temporary file name and then renaming that
  temporary file to the "real" file to effect atomic updates of a file.  Other
  processes on the same client that were periodically reading the file would
  occasionally receive an ESTALE error from open(2) because the VOP_GETATTR()
  in nfs_open() would receive that error when given the stale vnode.
  
  The fix here is to check for duplicates in cache_enter() and just return
  if an entry for this same directory and leaf file name for this vnode is
  already in the cache.  The check for duplicates is done by walking the
  per-vnode list of name cache entries.  It is expected that this list should
  be very small in the common case (usually 0 or 1 entries during a
  cache_enter() since most files only have 1 "leaf" name).
  
  Reviewed by:    ups, scottl
  MFC after:      2 months
  
  Revision  Changes    Path
  1.124     +33 -9     src/sys/kern/vfs_cache.c

If you want to try the UFS stuff on 7, you would need to probably backport at 
least the following, maybe more:

jeff        2008-04-11 09:44:25 UTC

  FreeBSD src repository

  Modified files:
    sys/ufs/ufs          ufs_lookup.c 
  Log:
   - cache dp->i_offset in the local 'i_offset' variable for use in loop
     indexes so directory lookup becomes shared lock safe.  In the modifying
     cases an exclusive lock is held here so the commit routine may
     rely on the state of i_offset.
   - Similarly handle i_diroff by fetching at the start and setting only once
     the operation is complete.  Without the exclusive lock these are only
     considered hints.
   - Assert that an exclusive lock is held when we're preparing for a commit
     routine.
   - Honor the lock type request from lookup instead of always using exclusive
     locking.
  
  Tested by:      pho, kris
  
  Revision  Changes    Path
  1.87      +48 -29    src/sys/ufs/ufs/ufs_lookup.c

jeff        2008-04-22 12:34:16 UTC

  FreeBSD src repository

  Modified files:
    sys/ufs/ufs          inode.h ufs_lookup.c 
  Log:
   - Use a local variable for i_ino in ufs_lookup.  It is only used to
     communicate between two parts of this one function.  This was causing
     problems with shared lookups as each would trash the ino value in the
     inode.
   - Remove the unused i_ino field from the inode structure.
  
  Revision  Changes    Path
  1.53      +0 -1      src/sys/ufs/ufs/inode.h
  1.88      +10 -13    src/sys/ufs/ufs/ufs_lookup.c

jhb         2008-07-30 21:07:56 UTC

  FreeBSD src repository

  Modified files:
    sys/ufs/ufs          ufs_lookup.c 
  Log:
  SVN rev 181018 on 2008-07-30 21:07:56Z by jhb
  
  Whitespace tweak.
  
  Revision  Changes    Path
  1.90      +0 -1      src/sys/ufs/ufs/ufs_lookup.c

jhb         2008-09-16 16:18:36 UTC

  FreeBSD src repository

  Modified files:
    sys/ufs/ufs          ufs_lookup.c 
  Log:
  SVN rev 183079 on 2008-09-16 16:18:36Z by jhb
  
  - Only set i_offset in the parent directory's i-node during a lookup for
    non-LOOKUP operations.
  - Relax a VOP assertion for a DELETE lookup.  rename() uses WANTPARENT
    instead of LOCKPARENT when looking up the source pathname.  ufs_rename()
    uses a relookup() to lock the parent directory when it decides to finally
    remove the source path.  Thus, it is ok for a DELETE with WANTPARENT set
    instead of LOCKPARENT to use a shared vnode lock rather than an exclusive
    vnode lock.
  
  Reported by:    kris (2)
  Reviewed by:    jeff
  
  Revision  Changes    Path
  1.91      +9 -3      src/sys/ufs/ufs/ufs_lookup.c

jhb         2008-09-16 19:06:44 UTC

  FreeBSD src repository

  Modified files:
    sys/ufs/ufs          inode.h ufs_lookup.c 
  Log:
  SVN rev 183093 on 2008-09-16 19:06:44Z by jhb
  
  Retire the 'i_reclen' field from the in-memory i-node.  Previously,
  during a DELETE lookup operation, lookup would cache the length of the
  directory entry to be deleted in 'i_reclen'.  Later, the actual VOP to
  remove the directory entry (ufs_remove, ufs_rename, etc.) would call
  ufs_dirremove() which extended the length of the previous directory
  entry to "remove" the deleted entry.
  
  However, we always read the entire block containing the directory
  entry when doing the removal, so we always have the directory entry to
  be deleted in-memory when doing the update to the directory block.
  Also, we already have to figure out where the directory entry that is
  being removed is in the block so that we can pass the component name
  to the dirhash code to update the dirhash.  So, instead of passing
  'i_reclen' from ufs_lookup() to the ufs_dirremove() routine, just read
  the 'd_reclen' field directly out of the entry being removed when
  updating the length of the previous entry in the block.
  
  This avoids a cosmetic issue of writing to 'i_reclen' while holding a
  shared vnode lock.  It also slightly reduces the amount of side-band
  data passed from ufs_lookup() to operations updating a directory via
  the directory's i-node.
  
  Reviewed by:    jeff
  
  Revision  Changes    Path
  1.54      +0 -1      src/sys/ufs/ufs/inode.h
  1.92      +9 -6      src/sys/ufs/ufs/ufs_lookup.c

jeff        2008-04-11 09:48:12 UTC

  FreeBSD src repository

  Modified files:
    sys/ufs/ufs          dirhash.h ufs_dirhash.c 
  Log:
   - Use a lockmgr lock rather than a mtx to protect dirhash.  This lock
     may be held for the duration of the various dirhash operations which
     avoids many complex unlock/lock/revalidate sequences.
   - Permit shared locks on lookup.  To protect the ip->i_dirhash pointer we
     use the vnode interlock in the shared case.  Callers holding the
     exclusive vnode lock can run without fear of concurrent modification to
     i_dirhash.
   - Hold an exclusive dirhash lock when creating the dirhash structure for
     the first time or when re-creating a dirhash structure which has been
     recycled.
  
  Tested by:      kris, pho
  
  Revision  Changes    Path
  1.6       +2 -1      src/sys/ufs/ufs/dirhash.h
  1.24      +289 -227  src/sys/ufs/ufs/ufs_dirhash.c

jhb         2008-09-16 16:23:56 UTC

  FreeBSD src repository

  Modified files:
    sys/ufs/ufs          dirhash.h ufs_dirhash.c 
  Log:
  SVN rev 183080 on 2008-09-16 16:23:56Z by jhb
  
  Fix a race with shared lookups on UFS.  If the the dirhash code reached the
  cap on memory usage, then shared LOOKUP operations could start free'ing
  dirhash structures.  Without these fixes, concurrent free's on the same
  directory could result in one of the threads blocked on a lock in a dirhash
  structure free'd by the other thread.
  - Replace the lockmgr lock in the dirhash structure with an sx lock.
  - Use a reference count managed with ufsdirhash_hold()/drop() to determine
    when to free the dirhash structures.  The directory i-node holds a
    reference while the dirhash is attached to an i-node.  Code that wishes
    to lock the dirhash while holding a shared vnode lock must first
    acquire a private reference to the dirhash while holding the vnode
    interlock before acquiring the dirhash sx lock.  After acquiring the sx
    lock, it drops the private reference after checking to see if the
    dirhash is still used by the directory i-node.
  
  Revision  Changes    Path
  1.7       +5 -1      src/sys/ufs/ufs/dirhash.h
  1.25      +82 -33    src/sys/ufs/ufs/ufs_dirhash.c

jhb         2008-09-22 20:53:22 UTC

  FreeBSD src repository

  Modified files:
    sys/ufs/ufs          ufs_dirhash.c 
  Log:
  SVN rev 183280 on 2008-09-22 20:53:22Z by jhb
  
  Close a race between concurrent calls to ufsdirhash_recycle() and
  ufsdirhash_free() introduced in my last commit by removing the dirhash
  about to be free'd in ufsdirhash_free() from the global dirhash list
  before dropping the sx lock.
  
  Tested by:      kris
  
  Revision  Changes    Path
  1.26      +10 -5     src/sys/ufs/ufs/ufs_dirhash.c

There are additional fixes needed to fix races with umount -f, so if you 
backport all this stuff, don't use umount -f or you risk panics. :)  Also, 
you will need to set the flag in the mount flags to enable shared lookups in 
the mount VOP in ffs_vfsops.c:

--- //depot/projects/smpng/sys/ufs/ffs/ffs_vfsops.c     2008/08/25 16:33:41
+++ //depot/user/jhb/lock/ufs/ffs/ffs_vfsops.c  2008/08/29 15:04:03
@@ -852,7 +852,7 @@
         * Initialize filesystem stat information in mount struct.
         */
        MNT_ILOCK(mp);
-       mp->mnt_kern_flag |= MNTK_MPSAFE;
+       mp->mnt_kern_flag |= MNTK_MPSAFE | MNTK_LOOKUP_SHARED;
        MNT_IUNLOCK(mp);
 #ifdef UFS_EXTATTR
 #ifdef UFS_EXTATTR_AUTOSTART

For 6.x you could in theory backport all of this as well, but there may be 
other fixes needed as well for 6.x.  I'm only planning on merging this stuff 
back to 7.x myself.

-- 
John Baldwin