svn commit: r270759 - in head/sys: cddl/compat/opensolaris/kern cddl/compat/opensolaris/sys cddl/contrib/opensolaris/uts/common/fs/zfs vm

Sat Aug 30 01:03:50 UTC 2014

----- Original Message ----- 
From: "Peter Wemm" <peter at wemm.org>
> On Friday 29 August 2014 21:42:15 Steven Hartland wrote:
> > ----- Original Message -----
> > From: "Peter Wemm" <peter at wemm.org>
> >
> > > On Friday 29 August 2014 20:51:03 Steven Hartland wrote:
> > snip..
> >
> > > > Does Karl's explaination as to why this doesn't work above 
> > > > change
> > > > your
> > > > mind?
> > >
> > > Actually no, I would expect the code as committed would *cause* 
> > > the
> > > undesirable behavior that Karl described.
> > >
> > > ie: access a few large files and cause them to reside in cache. 
> > > Say
> > > 50GB or so
> > > on a 200G ram machine.  We now have the state where:
> > >
> > > v_cache = 50GB
> > > v_free = 1MB
> > >
> > > The rest of the vm system looks at vm_paging_needed(), which is: 
> > > do
> > > we have
> > > enough "v_cache + v_free"?  Since there's 50.001GB free, the 
> > > answer is
> > > no.
> > > It'll let v_free run right down to v_free_min because of the giant
> > > pool of
> > > v_cache just sitting there, waiting to be used.
> > >
> > > The zfs change, as committed will ignore all the free memory in 
> > > the
> > > form of
> > > v_cache.. and will be freaking out about how low v_free is getting 
> > > and
> > > will be
> > > sacrificing ARC in order to put more memory into the v_free pool.
> > >
> > > As long as ARC keeps sacrificing itself this way, the free pages 
> > > in
> > > the v_cache
> > > pool won't get used.  When ARC finally runs out of pages to give 
> > > up to
> > > v_free,
> > > the kernel will start using the free pages from v_cache. 
> > > Eventually
> > > it'll run
> > > down that v_cache free pool and arc will be in a bare minimum 
> > > state
> > > while this
> > > is happening.
> > >
> > > Meanwhile, ZFS ARC will be crippled.  This has consequences - it 
> > > does
> > > RCU like
> > > things from ARC to keep fragmentation under control.  With ARC
> > > crippled,
> > > fragmentation will increase because there's less opportunistic
> > > gathering of
> > > data from ARC.
> > >
> > > Granted, you have to get things freed from active/inactive to the
> > > cache state,
> > > but once it's there, depending on the worlkload, it'll mess with 
> > > ARC.
> >
> > There's already a vm_paging_needed() check in there below so this 
> > will
> > already be dealt with will it not?
>
> No.
>
> If you read the code that you changed, you won't get that far. The 
> v_free test
> comes before vm_paging_needed(), and if the v_free test triggers then 
> ARC will
> return pages and not look at the rest of the function.

Sorry I should have phrased that question better, prior to the change
vm_paging_needed() was the top test, ignoring needfeed, but it still 
causing
performance issues.

Surely with all other return 1 cases triggering at what should be a 
higher level
of free memory we should never have seen the performance issues, but 
users where
reporting it lots. So there's still some mystery surrounding why was 
this
happening?

> If this function returns non-zerp, ARC is given back:
>
> static int
> arc_reclaim_needed(void)
> {
>         if (kmem_free_count() < zfs_arc_free_target) {
>                 return (1);
>         }
>          /*
>          * Cooperate with pagedaemon when it's time for it to scan
>          * and reclaim some pages.
>          */
>         if (vm_paging_needed()) {
>                 return (1);
>         }
>
> ie: if v_free (ignoring v_cache free pages) gets below the threshold, 
> stop
> evertyhing and discard ARC pages.
>
> The vm_paging_needed() code is a NO-OP at this point. It can never 
> return
> true.  Consider:
>         vm_cnt.v_free_target = 4 * vm_cnt.v_free_min + 
> vm_cnt.v_free_reserved;
> vs
>         vm_pageout_wakeup_thresh = (vm_cnt.v_free_min / 10) * 11;
>
> zfs_arc_free_target defaults to vm_cnt.v_free_target, which is 400% of
> v_free_min, and compares it against the smaller v_free pool.
>
> vm_paging_needed() compares the total free pool (v_free + v_cache) 
> against the
> smaller wakeup threshold - 110% of v_free_min.
>
> Comparing a larger value against a smaller target than the previous 
> test will
> never succeed unless you manually change the arc_free_target sysctl.

I'm aware of the values involved, and as I said what you're proposing
was more akin to where I started, but I was informed that it had already
been tested and didn't work well.

So as I'm sure you'll appreciate given that information I opted to trust
the real life tests.

Now its totally possible there could be something in the tests that
was skewing the result, but as it still indicated the performance issue
was still there, where as using the current values it wasn't, I opted
for that vs. what I believed was the more technically correct value.

Now you've confirmed what I initially thought should be the correct 
values
are indeed so; I've asked Karl to retest, so we can confirm that any of
changes that went into stable/10 after that point haven't changed this
behaviour.

Hope that makes sense?

> Also, what about the magic numbers here:
> u_int zfs_arc_free_target = (1 << 19); /* default before pagedaemon
> init only */

That is just a total fall back case and should never be triggered unless
as the comment states the pagedaemon isn't initialised.

> That's half a million pages, or 2GB of physical ram on a 4K page size 
> system
> How is this going to work on early boot in the machines in the cluster 
> with
> less than 2GB of ram?

Its there to ensure that ARC doesn't run wild ARC for the few 
milliseconds
/ seconds before pagedaemon is initalised.

We can change the value no problem, what would you suggest 1<<16 aka
256MB?

Thanks for all the feedback, its great to have my understanding of
how things work in this area confirmed by those who know.

Hopefully we'll be able to get to the bottom of this with everyones
help and get a solid fix for these issues that have plaged 10 into
10.1 :)