File trees: the deeper, the weirder

Sat Nov 18 11:06:11 UTC 2006

On Sat, Nov 18, 2006 at 12:54:00PM +0300, Yar Tikhiy wrote:
> On Mon, Oct 30, 2006 at 03:47:37PM +0200, Kostik Belousov wrote:
> > On Mon, Oct 30, 2006 at 04:05:19PM +0300, Yar Tikhiy wrote:
> > > On Sun, Oct 29, 2006 at 11:32:58AM -0500, Matt Emmerton wrote:
> > > > [ Restoring some OP context.]
> > > > 
> > > > > On Sun, Oct 29, 2006 at 05:07:16PM +0300, Yar Tikhiy wrote:
> > > > >
> > > > > > As for the said program, it keeps its 1 Hz pace, mostly waiting on
> > > > > > "vlruwk".  It's killable, after a delay.  The system doesn't show ...
> > > > > >
> > > > > > Weird, eh?  Any ideas what's going on?
> > > > >
> > > > > I would guess that you need a new vnode to create the new file, but no
> > > > > vnodes are obvious candidates for freeing because they all have a child
> > > > > directory in use. Is there some sort of vnode clearing that goes on every
> > > > > second if we are short of vnodes?
> > > > 
> > > > See sys/vfs_subr.c, subroutine getnewvnode().  We call msleep() if we're
> > > > waiting on vnodes to be created (or recycled).  And just look at the 'hz'
> > > > parameter passed to msleep()!
> > > > 
> > > > The calling process's mkdir() will end up waiting in getnewvnode() (in
> > > > "vlruwk" state) while the vnlru kernel thread does it's thing (which is to
> > > > recycle vnodes.)
> > > > 
> > > > Either the vnlru kernel thread has to work faster, or the caller has to
> > > > sleep less, in order to avoid this lock-step behaviour.
> > > 
> > > I'm afraid that, though your analysis is right, you arrive at wrong
> > > conclusions.  The process waits for the whole second in getnewvnode()
> > > because the vnlru thread cannot free as much vnodes as it wants to.
> > > vnlru_proc() will wake up sleepers on vnlruproc_sig (i.e.,
> > > getnewvnode()) only if (numvnodes <= desiredvnodes * 9 / 10).
> > > Whether this condition is attainable depends on vlrureclaim() (called
> > > from the vnlru thread) freeing vnodes at a sufficient rate.  Perhaps
> > > vlrureclaim() just can't keep the pace at this conditions.
> > > debug.vnlru_nowhere increasing is an indication of that.  Consequently,
> > > each getnewvnode() call sleeps 1 second, then grabs a vnode beyond
> > > desiredvnodes.  It's no surprise that the 1 second delays start to
> > > appear after approx. kern.maxvnodes directories were created.
> > 
> > I think that David is right. The references _from_ the directory make it immune
> > to vnode reclamation. Try this patch. It is very unfair for lsof.
> > 
> > Index: sys/kern/vfs_subr.c
> > ===================================================================
> > RCS file: /usr/local/arch/ncvs/src/sys/kern/vfs_subr.c,v
> > retrieving revision 1.685
> > diff -u -r1.685 vfs_subr.c
> > --- sys/kern/vfs_subr.c	2 Oct 2006 07:25:58 -0000	1.685
> > +++ sys/kern/vfs_subr.c	30 Oct 2006 13:44:59 -0000
> > @@ -582,7 +582,7 @@
> >  		 * If it's been deconstructed already, it's still
> >  		 * referenced, or it exceeds the trigger, skip it.
> >  		 */
> > -		if (vp->v_usecount || !LIST_EMPTY(&(vp)->v_cache_src) ||
> > +		if (vp->v_usecount || /* !LIST_EMPTY(&(vp)->v_cache_src) || */
> >  		    (vp->v_iflag & VI_DOOMED) != 0 || (vp->v_object != NULL &&
> >  		    vp->v_object->resident_page_count > trigger)) {
> >  			VI_UNLOCK(vp);
> > @@ -607,7 +607,7 @@
> >  		 * interlock, the other thread will be unable to drop the
> >  		 * vnode lock before our VOP_LOCK() call fails.
> >  		 */
> > -		if (vp->v_usecount || !LIST_EMPTY(&(vp)->v_cache_src) ||
> > +		if (vp->v_usecount || /* !LIST_EMPTY(&(vp)->v_cache_src) || */
> >  		    (vp->v_object != NULL && 
> >  		    vp->v_object->resident_page_count > trigger)) {
> >  			VOP_UNLOCK(vp, LK_INTERLOCK, td);
> 
> By the way, what do you think v_cache_src is for?  The only two
> places it is used in the kernel are in the unused function
> cache_leaf_test() and this one, in vlrureclaim().  Is its main
> purpose just to keep directory vnodes that are referenced by nc_dvp
> in some namecache entries?

I think that yes. Now, it mostly gives immunity for the vnodes that
could be used for getcwd()/lsof path lookups through namecache.

Does my change helped on you load ?

cache_leaf_test() seems to be way to go. By partitioning vlru reclaim into
two stages - first, which reclaim leaf vnodes (that it, vnodes that do
not contain child dirs in namecache), and second, which will be fired only
if first stage failed to free something and simply ignores v_cache_src, as
in my change. See comment for rev. 1.56 of vfs_cache.c.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-hackers/attachments/20061118/c4d27540/attachment.pgp