kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix

Tue Mar 18 17:45:59 UTC 2014

on 18/03/2014 19:19 Karl Denninger said the following:
> 
> On 3/18/2014 10:20 AM, Andriy Gapon wrote:
>> The following reply was made to PR kern/187594; it has been noted by GNATS.
>>
>> From: Andriy Gapon <avg at FreeBSD.org>
>> To: bug-followup at FreeBSD.org, karl at fs.denninger.net
>> Cc:
>> Subject: Re: kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix
>> Date: Tue, 18 Mar 2014 17:15:05 +0200
>>
>>   Karl Denninger <karl at fs.denninger.net> wrote:
>>   > ZFS can be convinced to engage in pathological behavior due to a bad
>>   > low-memory test in arc.c
>>   >
>>   > The offending file is at
>>   > /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c; it allegedly
>>   > checks for 25% free memory, and if it is less asks for the cache to shrink.
>>   >
>>   > (snippet from arc.c around line 2494 of arc.c in 10-STABLE; path
>>   > /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs)
>>   >
>>   > #else /* !sun */
>>   > if (kmem_used() > (kmem_size() * 3) / 4)
>>   > return (1);
>>   > #endif /* sun */
>>   >
>>   > Unfortunately these two functions do not return what the authors thought
>>   > they did. It's clear what they're trying to do from the Solaris-specific
>>   > code up above this test.
>>     No, these functions do return what the authors think they do.
>>   The check is for KVA usage (kernel virtual address space), not for physical
>> memory.
> I understand, but that's nonsensical in the context of the Solaris code. 
> "lotsfree" is *not* a declaration of free kvm space, it's a declaration of when
> the system has "lots" of free *physical* memory.

No, it's not nonsensical.
Replacement for lotsfree stuff is vm_paging_needed().
kmem_* stuff is replacement for vmem_* stuff in Solaris code.

> Further it makes no sense at all to allow the ARC cache to force things into
> virtual (e.g. swap-space backed) memory.

Seems like you don't have proper understanding of what kernel virtual memory is.
That makes conversation harder.

> But that's the behavior that has been
> observed, and it fits with the code as originally written.
> 
>>     > The result is that the cache only shrinks when vm_paging_needed() tests
>>   > true, but by that time the system is in serious memory trouble and by
>>     No, it is not.
>>   The description and numbers here are a little bit outdated but they should give
>>   an idea of how paging works in general:
>>   https://wiki.freebsd.org/AvgPageoutAlgorithm
>>     > triggering only there it actually drives the system further into paging,
>>     How does ARC eviction drives the system further into paging?
> 1. System gets low on physical memory but the ARC cache is looking at available
> kvm (of which there is plenty.)  The ARC cache continues to expand.
> 
> 2. vm_paging_needed() returns true and the system begins to page off to the
> swap.  At the same time the ARC cache is pared down because arc_reclaim_needed
> has returned "1".

Except that ARC is supposed to be evicted before page daemon does anything.

> 3. As the ARC cache shrinks and paging occurs vm_paging_needed() returns false. 
> Paging out ceases but inactive pages remain on the swap.  They are not recalled
> until and unless they are scheduled to execute.  Arc_reclaim_needed again
> returns "0".
> 
> 4. The hold-down timer expires in the ARC cache code ("arc_grow_retry", declared
> as 60 seconds) and the ARC cache begins to expand again.
> 
> Go back to #2 until the system's performance starts to deteriorate badly enough
> due to the paging that you notice it, which occurs when something that is
> actually consuming CPU time has to be called in from swap.
> 
> This is consistent with what I and others have observed on both 9.2 and 10.0;
> the ARC will expand until it hits the maximum configured even at the expense of
> forcing pages onto the swap.  In this specific machine's case left to defaults
> it will grab nearly all physical memory (over 20GB of 24) and wire it down.

Well, this does not match my experience from before 10.x times.

> Limiting arc_max to 16GB sorta fixes it.  I say "sorta" because it turns out
> that 16GB is still too much for the workload; it prevents the pathological
> behavior where system "stalls" happen but only in the extreme.  It turns out
> with the patch in my ARC cache stabilizes at about 13.5GB during the busiest
> part of the day, growing to about 16 off-hours.
> 
> One of the problems with just limiting it in /boot/loader.conf is that you have
> to guess and the system doesn't reasonably adapt to changing memory loads.  The
> code is clearly intended to do that but it doesn't end up working that way in
> practice.
>>     > because the pager will not recall pages from the swap until they are next
>>   > executed. This leads the ARC to try to fill in all the available RAM even
>>   > though pages have been pushed off onto swap. Not good.
>>     Unused physical memory is a waste.  It is true that ARC tries to use as
>> much of
>>   memory as it is allowed.  The same applies to the page cache (Active,
>> Inactive).
>>   Memory management is a dynamic system and there are a few competing agents.
>>   
> That's true.  However, what the stock code does is force working set out of
> memory and into the swap.  The ideal situation is one in which there is no free
> memory because cache has sized itself to consume everything *not* necessary for
> the working set of the processes that are running.  Unfortunately we cannot
> determine this presciently because a new process may come along and we do not
> necessarily know for how long a process that is blocked on an event will remain
> blocked (e.g. something waiting on network I/O, etc.)
> 
> However, it is my contention that you do not want to evict a process that is
> scheduled to run (or is going to be) in favor of disk cache because you're
> defeating yourself by doing so.  The point of the disk cache is to avoid going
> to the physical disk for I/O, but if you page something you have ditched a
> physical I/O for data in favor of having to go to physical disk *twice* -- first
> to write the paged-out data to swap, and then to retrieve it when it is to be
> executed.  This also appears to be consistent with what is present for Solaris
> machines.
> 
> From the Sun code:
> 
> #ifdef sun
>         /*
>          * take 'desfree' extra pages, so we reclaim sooner, rather than later
>          */
>         extra = desfree;
>  
>         /*
>          * check that we're out of range of the pageout scanner.  It starts to
>          * schedule paging if freemem is less than lotsfree and needfree.
>          * lotsfree is the high-water mark for pageout, and needfree is the
>          * number of needed free pages.  We add extra pages here to make sure
>          * the scanner doesn't start up while we're freeing memory.
>          */
>         if (freemem < lotsfree + needfree + extra)
>                 return (1);
>  
>         /*
>          * check to make sure that swapfs has enough space so that anon
>          * reservations can still succeed. anon_resvmem() checks that the
>          * availrmem is greater than swapfs_minfree, and the number of reserved
>          * swap pages.  We also add a bit of extra here just to prevent
>          * circumstances from getting really dire.
>          */
>         if (availrmem < swapfs_minfree + swapfs_reserve + extra)
>                 return (1);
> 
> "freemem" is not virtual memory, it's actual memory.  "Lotsfree" is the point
> where the system considers free RAM to be "ample"; "needfree" is the
> "desperation" point and "extra" is the margin (presumably for image activation.)
> 
> The base code on FreeBSD doesn't look at physical memory at all; it looks at kvm
> space instead.

This is an incorrect statement as I explained above.  vm_paging_needed() looks
at physical memory.

>>   It is hard to correctly tune that system using a large hummer such as your
>>   patch.  I believe that with your patch ARC will get shrunk to its minimum size
>>   in due time.  Active + Inactive will grow to use the memory that you are
>> denying
>>   to ARC driving Free below a threshold, which will reduce ARC.  Repeated enough
>>   times this will drive ARC to its minimum.
> I disagree both in design theory and based on the empirical evidence of actual
> operation.
> 
> First, I don't (ever) want to give memory to the ARC cache that otherwise would
> go to "active", because any time I do that I'm going to force two page events,
> which is double the amount of I/O I would take on a cache *miss*, and even with
> the ARC at minimum I get a reasonable hit percentage.  If I therefore prefer ARC
> over "active" pages I am going to take *at least* a 200% penalty on physical I/O
> and if I get an 80% hit ratio with the ARC at a minimum the penalty is closer to
> 800%!
> 
> For inactive pages it's a bit more complicated as those may not be reactivated. 
> However, I am trusting FreeBSD's VM subsystem to demote those that are unlikely
> to be reactivated to the cache bucket and then to "free", where they are able to
> be re-used.  This is consistent with what I actually see on a running system --
> the "inact" bucket is typically fairly large (often on a busy machine close to
> that of "active") but pages demoted to "cache" don't stay there long - they
> either get re-promoted back up or they are freed and go on the free list.
> 
> The only time I see "inact" get out of control is when there's a kernel memory
> leak somewhere (such as what I ran into the other day with the in-kernel NAT
> subsystem on 10-STABLE.)  But that's a bug and if it happens you're going to get
> bit anyway.
> 
> For example right now on one of my very busy systems with 24GB of installed RAM
> and many terabytes of storage across three ZFS pools I'm seeing 17GB wired of
> which 13.5 is ARC cache.  That's the adaptive figure it currently is running at,
> with a maximum of 22.3 and a minimum of 2.79 (8:1 ratio.)  The remainder is
> wired down for other reasons (there's a fairly large Postgres server running on
> that box, among other things, and it has a big shared buffer declaration --
> that's most of the difference.)  Cache hit efficiency is currently 97.8%.
> 
> Active is 2.26G right now, and inactive is 2.09G.  Both are stable. Overnight
> inactive will drop to about 1.1GB while active will not change all that much
> since most of it postgres and the middleware that talks to it along with apache,
> which leaves most of its processes present even when they go idle.  Peak load
> times are about right now (mid-day), and again when the system is running
> backups nightly.
> 
> Cache is 7448, in other words, insignificant.  Free memory is 2.6G.
> 
> The tunable is set to 10%, which is almost exactly what free memory is.  I find
> that when the system gets under 1G free transient image activation can drive it
> into paging and performance starts to suffer for my particular workload.
> 
>>     Also, there are a few technical problems with the patch:
>>   - you don't need to use sysctl interface in kernel, the values you need are
>>   available directly, just take a look at e.g. implementation of
>> vm_paging_needed()
> That's easily fixed.  I will look at it.
>>   - similarly, querying vfs.zfs.arc_freepage_percent_target value via
>>   kernel_sysctlbyname is just bogus; you can use percent_target directly
> I did not know if during setup of the OID the value was copied (and thus you had
> to reference it later on) or the entry simply took the pointer and stashed
> that.  Easily corrected.
>>   - you don't need to sum various page counters to get a total count, there is
>>   v_page_count
>>   
> Fair enough as well.
>>   Lastly, can you try to test reverting your patch and instead setting
>>   vm.lowmem_period=0 ?
>>   
> Yes.  By default it's 10; I have not tampered with that default.
> 
> Let me do a bit of work and I'll post back with a revised patch. Perhaps a
> tunable for percentage free + a free reserve that is a "floor"?  The problem
> with that is where to put the defaults.  One option would be to grab total size
> at init time and compute something similar to what "lotsfree" is for Solaris,
> allowing that to be tuned with the percentage if desired.  I selected 25%
> because that's what the original test was expressing and it should be reasonable
> for modest RAM configurations.  It's clearly too high for moderately large (or
> huge) memory machines unless they have a lot of RAM -hungry processes running on
> them.
> 
> The percentage test, however, is an easy knob to twist that is unlikely to
> severely harm you if you dial it too far in either direction; anyone setting it
> to zero obviously knows what they're getting into, and if you crank it too high
> all you end up doing is limiting the ARC to the minimum value.
> 

-- 
Andriy Gapon