File trees: the deeper, the weirder

Sat Nov 18 18:14:47 UTC 2006

On Sat, Nov 18, 2006 at 01:05:44PM +0200, Kostik Belousov wrote:
> On Sat, Nov 18, 2006 at 12:54:00PM +0300, Yar Tikhiy wrote:
> > On Mon, Oct 30, 2006 at 03:47:37PM +0200, Kostik Belousov wrote:
> > > On Mon, Oct 30, 2006 at 04:05:19PM +0300, Yar Tikhiy wrote:
> > > > On Sun, Oct 29, 2006 at 11:32:58AM -0500, Matt Emmerton wrote:
> > > > > [ Restoring some OP context.]
> > > > > 
> > > > > > On Sun, Oct 29, 2006 at 05:07:16PM +0300, Yar Tikhiy wrote:
> > > > > >
> > > > > > > As for the said program, it keeps its 1 Hz pace, mostly waiting on
> > > > > > > "vlruwk".  It's killable, after a delay.  The system doesn't show ...
> > > > > > >
> > > > > > > Weird, eh?  Any ideas what's going on?
> > > > > >
> > > > > > I would guess that you need a new vnode to create the new file, but no
> > > > > > vnodes are obvious candidates for freeing because they all have a child
> > > > > > directory in use. Is there some sort of vnode clearing that goes on every
> > > > > > second if we are short of vnodes?
> > > > > 
> > > > > See sys/vfs_subr.c, subroutine getnewvnode().  We call msleep() if we're
> > > > > waiting on vnodes to be created (or recycled).  And just look at the 'hz'
> > > > > parameter passed to msleep()!
> > > > > 
> > > > > The calling process's mkdir() will end up waiting in getnewvnode() (in
> > > > > "vlruwk" state) while the vnlru kernel thread does it's thing (which is to
> > > > > recycle vnodes.)
> > > > > 
> > > > > Either the vnlru kernel thread has to work faster, or the caller has to
> > > > > sleep less, in order to avoid this lock-step behaviour.
> > > > 
> > > > I'm afraid that, though your analysis is right, you arrive at wrong
> > > > conclusions.  The process waits for the whole second in getnewvnode()
> > > > because the vnlru thread cannot free as much vnodes as it wants to.
> > > > vnlru_proc() will wake up sleepers on vnlruproc_sig (i.e.,
> > > > getnewvnode()) only if (numvnodes <= desiredvnodes * 9 / 10).
> > > > Whether this condition is attainable depends on vlrureclaim() (called
> > > > from the vnlru thread) freeing vnodes at a sufficient rate.  Perhaps
> > > > vlrureclaim() just can't keep the pace at this conditions.
> > > > debug.vnlru_nowhere increasing is an indication of that.  Consequently,
> > > > each getnewvnode() call sleeps 1 second, then grabs a vnode beyond
> > > > desiredvnodes.  It's no surprise that the 1 second delays start to
> > > > appear after approx. kern.maxvnodes directories were created.
> > > 
> > > I think that David is right. The references _from_ the directory make it immune
> > > to vnode reclamation. Try this patch. It is very unfair for lsof.
> > > 
> > > Index: sys/kern/vfs_subr.c
> > > ===================================================================
> > > RCS file: /usr/local/arch/ncvs/src/sys/kern/vfs_subr.c,v
> > > retrieving revision 1.685
> > > diff -u -r1.685 vfs_subr.c
> > > --- sys/kern/vfs_subr.c	2 Oct 2006 07:25:58 -0000	1.685
> > > +++ sys/kern/vfs_subr.c	30 Oct 2006 13:44:59 -0000
> > > @@ -582,7 +582,7 @@
> > >  		 * If it's been deconstructed already, it's still
> > >  		 * referenced, or it exceeds the trigger, skip it.
> > >  		 */
> > > -		if (vp->v_usecount || !LIST_EMPTY(&(vp)->v_cache_src) ||
> > > +		if (vp->v_usecount || /* !LIST_EMPTY(&(vp)->v_cache_src) || */
> > >  		    (vp->v_iflag & VI_DOOMED) != 0 || (vp->v_object != NULL &&
> > >  		    vp->v_object->resident_page_count > trigger)) {
> > >  			VI_UNLOCK(vp);
> > > @@ -607,7 +607,7 @@
> > >  		 * interlock, the other thread will be unable to drop the
> > >  		 * vnode lock before our VOP_LOCK() call fails.
> > >  		 */
> > > -		if (vp->v_usecount || !LIST_EMPTY(&(vp)->v_cache_src) ||
> > > +		if (vp->v_usecount || /* !LIST_EMPTY(&(vp)->v_cache_src) || */
> > >  		    (vp->v_object != NULL && 
> > >  		    vp->v_object->resident_page_count > trigger)) {
> > >  			VOP_UNLOCK(vp, LK_INTERLOCK, td);
> > 
> > By the way, what do you think v_cache_src is for?  The only two
> > places it is used in the kernel are in the unused function
> > cache_leaf_test() and this one, in vlrureclaim().  Is its main
> > purpose just to keep directory vnodes that are referenced by nc_dvp
> > in some namecache entries?
> 
> I think that yes. Now, it mostly gives immunity for the vnodes that
> could be used for getcwd()/lsof path lookups through namecache.

Another purpose of v_cache_src that I missed is to allow
for removing all namecache entries with nc_dvp pointing
to a particular vnode when the vnode is recycled so that
we don't end up with stale nc_dvp's in the namecache.
Perhaps this is the main role v_cache_src plays.

> Does my change helped on you load ?

Your hack works, thanks!  Your analysis of the problem proves
correct.  And I'm gaining some understanding of it, too :-)

> cache_leaf_test() seems to be way to go. By partitioning vlru reclaim into
> two stages - first, which reclaim leaf vnodes (that it, vnodes that do
> not contain child dirs in namecache), and second, which will be fired only
> if first stage failed to free something and simply ignores v_cache_src, as
> in my change. See comment for rev. 1.56 of vfs_cache.c.

Excuse me, but why "vnodes that do not contain child dirs in the
namecache"?  Perhaps they should be vnodes that do not contain _any_
children in the namecache?  That would be better suited for trying
to preserve information for vn_fullpath().  However, I must admit
that I don't know how lsof works because I've never used it.

-- 
Yar