svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm

Peter Wemm peter at wemm.org
Fri Aug 29 20:33:20 UTC 2014


On Friday 29 August 2014 20:51:03 Steven Hartland wrote:
> > On Friday 29 August 2014 11:54:42 Alan Cox wrote:
> snip...
> > > > With the patch we confirmed that both RAM usage and performance
> > > > for those
> > > > seeing that issue are resolved, with no reported regressions.
> > > > 
> > > >> (I should know better than to fire a reply off before full fact
> > > >> checking, but
> > > >> this commit worries me..)
> > > > 
> > > > Not a problem, its great to know people pay attention to changes,
> > > > and
> > > > raise
> > > > their concerns. Always better to have a discussion about potential
> > > > issues
> > > > than to wait for a problem to occur.
> > > > 
> > > > Hopefully the above gives you some piece of mind, but if you still
> > > > have any
> > > > concerns I'm all ears.
> > > 
> > > You didn't really address Peter's initial technical issue.  Peter
> > > correctly observed that cache pages are just another flavor of free
> > > pages.  Whenever the VM system is checking the number of free pages
> > > against any of the thresholds, it always uses the sum of
> > > v_cache_count
> > > and v_free_count.  So, to anyone familiar with the VM system, like
> > > Peter, what you've done, which is to derive a threshold from
> > > v_free_target but only compare v_free_count to that threshold, looks
> > > highly suspect.
> > 
> > I think I'd like to see something like this:
> > 
> > Index: cddl/compat/opensolaris/kern/opensolaris_kmem.c
> > ===================================================================
> > --- cddl/compat/opensolaris/kern/opensolaris_kmem.c (revision 270824)
> > +++ cddl/compat/opensolaris/kern/opensolaris_kmem.c (working copy)
> > @@ -152,7 +152,8 @@
> > 
> >  kmem_free_count(void)
> >  {
> > 
> > - return (vm_cnt.v_free_count);
> > + /* "cache" is just a flavor of free pages in FreeBSD */
> > + return (vm_cnt.v_free_count + vm_cnt.v_cache_count);
> > 
> >  }
> >  
> >  u_int
> 
> This has apparently already been tried and the response from Karl was:
> 
> - No, because memory in "cache" is subject to being either reallocated
> or freed.
> - When I was developing this patch that was my first impression as well
> and how
> - I originally coded it, and it turned out to be wrong.
> -
> - The issue here is that you have two parts of the system contending for
> RAM --
> - the VM system generally, and the ARC cache.  If the ARC cache frees
> space before
> - the VM system activates and starts pruning then you wind up with the
> ARC pinned
> - at the minimum after some period of time, because it releases "early."
> 
> I've asked him if he would retest just to be sure.
> 
> > The rest of the system looks at the "big picture" it would be happy to
> > let the
> > "free" pool run quite a way down so long as there's "cache" pages
> > available to
> > satisfy the free space requirements.  This would lead ZFS to
> > mistakenly
> > sacrifice ARC for no reason.  I'm not sure how big a deal this is, but
> > I can't
> > imagine many scenarios where I want ARC to be discarded in order to
> > save some
> > effectively free pages.
> 
> From Karl's response from the original PR (above) it seems like this
> causes
> unexpected behaviour due to the two systems being seperate.
> 
> > > That said, I can easily believe that your patch works better than
> > > the
> > > existing code, because it is closer in spirit to my interpretation
> > > of
> > > what the Solaris code does.  Specifically, I believe that the
> > > Solaris
> > > code starts trimming the ARC before the Solaris page daemon starts
> > > writing dirty pages to secondary storage.  Now, you've made FreeBSD
> > > do
> > > the same.  However, you've expressed it in a way that looks broken.
> > > 
> > > To wrap up, I think that you can easily write this in a way that
> > > simultaneously behaves like Solaris and doesn't look wrong to a VM
> > > expert.
> > > 
> > > > Out of interest would it be possible to update machines in the
> > > > cluster to
> > > > see how their workload reacts to the change?
> > 
> > I'd like to see the free vs cache thing resolved first but it's going
> > to be
> > tricky to get a comparison.
> 
> Does Karl's explaination as to why this doesn't work above change your
> mind?

Actually no, I would expect the code as committed would *cause* the 
undesirable behavior that Karl described.

ie: access a few large files and cause them to reside in cache.  Say 50GB or so 
on a 200G ram machine.  We now have the state where:

v_cache = 50GB
v_free = 1MB

The rest of the vm system looks at vm_paging_needed(), which is:  do we have 
enough "v_cache + v_free"?  Since there's 50.001GB free, the answer is no.  
It'll let v_free run right down to v_free_min because of the giant pool of 
v_cache just sitting there, waiting to be used.

The zfs change, as committed will ignore all the free memory in the form of 
v_cache.. and will be freaking out about how low v_free is getting and will be 
sacrificing ARC in order to put more memory into the v_free pool.

As long as ARC keeps sacrificing itself this way, the free pages in the v_cache 
pool won't get used.  When ARC finally runs out of pages to give up to v_free, 
the kernel will start using the free pages from v_cache.  Eventually it'll run 
down that v_cache free pool and arc will be in a bare minimum state while this 
is happening.

Meanwhile, ZFS ARC will be crippled.  This has consequences - it does RCU like 
things from ARC to keep fragmentation under control.  With ARC crippled, 
fragmentation will increase because there's less opportunistic gathering of 
data from ARC.

Granted, you have to get things freed from active/inactive to the cache state, 
but once it's there, depending on the worlkload, it'll mess with ARC.

-- 
Peter Wemm - peter at wemm.org; peter at FreeBSD.org; peter at yahoo-inc.com; KI6FJV
UTF-8: for when a ' or ... just won\342\200\231t do\342\200\246
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.freebsd.org/pipermail/svn-src-all/attachments/20140829/d1694abd/attachment.sig>


More information about the svn-src-all mailing list