svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm

Peter Wemm peter at wemm.org
Fri Aug 29 19:27:03 UTC 2014


On Friday 29 August 2014 11:54:42 Alan Cox wrote:
> On 08/29/2014 03:32, Steven Hartland wrote:
> >> On Thursday 28 August 2014 17:30:17 Alan Cox wrote:
> >> > On 08/28/2014 16:15, Matthew D. Fuller wrote:
> >> > > On Thu, Aug 28, 2014 at 10:11:39PM +0100 I heard the voice of
> >> > > 
> >> > > Steven Hartland, and lo! it spake thus:
> >> > >> Its very likely applicable to stable/9 although I've never used 9
> >> > >> myself, we jumped from 9 direct to 10.
> >> > > 
> >> > > This is actually hitting two different issues from the two bugs:
> >> > > 
> >> > > - 191510 is about "ARC isn't greedy enough" on huge-memory > >
> >> 
> >> machines,
> >> 
> >> > >   and from the osreldate that bug was filed on 9.2, so presumably
> >> > > 
> >> > > is
> >> > > 
> >> > >   applicable.
> >> > > 
> >> > > - 187594 is about "ARC is too greedy" (probably mostly on > >
> >> 
> >> not-so-huge
> >> 
> >> > >   machines) and starves/drives the rest of the system into swap.
> >> > > 
> >> > > That
> >> > > 
> >> > >   I believe came about as a result of some unrelated change in the
> >> > >   10.x stream that upset the previous balance between ARC and the
> >> > > 
> >> > > rest
> >> > > 
> >> > >   of the VM, so isn't a problem on 9.x.
> >> > 
> >> > 10.0 had a bug in the page daemon that was fixed in 10-STABLE about
> >> > three months ago (r265945).  The ARC was not the only thing
> >> 
> >> affected > by
> >> this bug.
> >> 
> >> I'm concerned about potential unintended consequences of this change.
> >> 
> >> Before, arc reclaim was driven by vm_paging_needed(), which was:
> >> vm_paging_needed(void)
> >> {
> >> 
> >>     return (vm_cnt.v_free_count + vm_cnt.v_cache_count <
> >>     
> >>         vm_pageout_wakeup_thresh);
> >> 
> >> }
> >> 
> >> Now it's ignoring the v_cache_count and looking exclusively at
> >> v_free_count.
> >> "cache" pages are free pages that just happen to have known contents.
> >> If I
> >> read this change right, zfs arc will now discard checksummed cache
> >> pages to
> > 
> >> make room for non-checksummed pages:
> > That test is still there so if it needs to it will still trigger.
> > 
> > However that often a lower level as vm_pageout_wakeup_thresh is only 110%
> > of min free, where as zfs_arc_free_target is based of target free
> > which is
> > 4 * (min free + reserved).
> > 
> >> +       if (kmem_free_count() < zfs_arc_free_target) {
> >> +               return (1);
> >> +       }
> >> ...
> >> +kmem_free_count(void)
> >> +{
> >> +       return (vm_cnt.v_free_count);
> >> +}
> >> 
> >> This seems like a pretty substantial behavior change.  I'm concerned
> >> that it
> >> doesn't appear to count all the forms of "free" pages.
> >> 
> >> I haven't seen the problems with the over-aggressive ARC since the
> >> page daemon
> >> bug was fixed.  It's been working fine under pretty abusive loads in
> >> the freebsd
> >> cluster after that fix.
> > 
> > Others have also confirmed that even with r265945 they can still trigger
> > performance issue.
> > 
> > In addition without it we still have loads of RAM sat their unused, in my
> > particular experience we have 40GB of 192GB sitting their unused and that
> > was with a stable build from last weekend.
> 
> The Solaris code only imposed this limit on 32-bit machines where the
> available kernel virtual address space may be much less than the
> available physical memory.  Previously, FreeBSD imposed this limit on
> both 32-bit and 64-bit machines.  Now, it imposes it on neither.  Why
> continue to do this differently from Solaris?

Since the question was asked below, we don't have zfs machines in the cluster 
running i386.  We can barely get them to boot as it is due to kva pressure.  
We have to reduce/cap physical memory and change the user/kernel virtual split 
from 3:1 to 2.5:1.5. 

We do run zfs on small amd64 machines with 2G of ram, but I can't imagine it 
working on the 10G i386 PAE machines that we have.


> > With the patch we confirmed that both RAM usage and performance for those
> > seeing that issue are resolved, with no reported regressions.
> > 
> >> (I should know better than to fire a reply off before full fact
> >> checking, but
> >> this commit worries me..)
> > 
> > Not a problem, its great to know people pay attention to changes, and
> > raise
> > their concerns. Always better to have a discussion about potential issues
> > than to wait for a problem to occur.
> > 
> > Hopefully the above gives you some piece of mind, but if you still
> > have any
> > concerns I'm all ears.
> 
> You didn't really address Peter's initial technical issue.  Peter
> correctly observed that cache pages are just another flavor of free
> pages.  Whenever the VM system is checking the number of free pages
> against any of the thresholds, it always uses the sum of v_cache_count
> and v_free_count.  So, to anyone familiar with the VM system, like
> Peter, what you've done, which is to derive a threshold from
> v_free_target but only compare v_free_count to that threshold, looks
> highly suspect.

I think I'd like to see something like this:

Index: cddl/compat/opensolaris/kern/opensolaris_kmem.c
===================================================================
--- cddl/compat/opensolaris/kern/opensolaris_kmem.c	(revision 270824)
+++ cddl/compat/opensolaris/kern/opensolaris_kmem.c	(working copy)
@@ -152,7 +152,8 @@
 kmem_free_count(void)
 {
 
-	return (vm_cnt.v_free_count);
+	/* "cache" is just a flavor of free pages in FreeBSD */
+	return (vm_cnt.v_free_count + vm_cnt.v_cache_count);
 }
 
 u_int


The rest of the system looks at the "big picture" it would be happy to let the 
"free" pool run quite a way down so long as there's "cache" pages available to 
satisfy the free space requirements.  This would lead ZFS to mistakenly 
sacrifice ARC for no reason.  I'm not sure how big a deal this is, but I can't 
imagine many scenarios where I want ARC to be discarded in order to save some 
effectively free pages.

> That said, I can easily believe that your patch works better than the
> existing code, because it is closer in spirit to my interpretation of
> what the Solaris code does.  Specifically, I believe that the Solaris
> code starts trimming the ARC before the Solaris page daemon starts
> writing dirty pages to secondary storage.  Now, you've made FreeBSD do
> the same.  However, you've expressed it in a way that looks broken.
> 
> To wrap up, I think that you can easily write this in a way that
> simultaneously behaves like Solaris and doesn't look wrong to a VM expert.
> 
> > Out of interest would it be possible to update machines in the cluster to
> > see how their workload reacts to the change?
> > 
> >    Regards
> >    Steve

I'd like to see the free vs cache thing resolved first but it's going to be 
tricky to get a comparison.

For the first few months of the year, things were really troublesome.  It was 
quite easy to overtax the machines and run them into the ground.

This is not the case now - things are working pretty well under pressure 
(prior to the commit).  Its got to the point that we feel comfortable 
thrashing the machines really hard again.  Getting a comparison when it 
already works well is going to be tricky.

We don't have large memory machines that aren't already tuned for 
vfs.zfs.arc_max caps for tmpfs use.

For context to the wider audience, we do not run -release or -pN in the 
freebsd cluster.  We mostly run -current, and some -stable.   I am well aware 
that there is significant discomfort in 10.0-R with zfs but we already have the 
fixes for that.
-- 
Peter Wemm - peter at wemm.org; peter at FreeBSD.org; peter at yahoo-inc.com; KI6FJV
UTF-8: for when a ' or ... just won\342\200\231t do\342\200\246
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.freebsd.org/pipermail/svn-src-head/attachments/20140829/f8f949eb/attachment.sig>


More information about the svn-src-head mailing list