kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix

Tue Mar 18 17:19:39 UTC 2014

On 3/18/2014 10:20 AM, Andriy Gapon wrote:
> The following reply was made to PR kern/187594; it has been noted by GNATS.
>
> From: Andriy Gapon <avg at FreeBSD.org>
> To: bug-followup at FreeBSD.org, karl at fs.denninger.net
> Cc:
> Subject: Re: kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix
> Date: Tue, 18 Mar 2014 17:15:05 +0200
>
>   Karl Denninger <karl at fs.denninger.net> wrote:
>   > ZFS can be convinced to engage in pathological behavior due to a bad
>   > low-memory test in arc.c
>   >
>   > The offending file is at
>   > /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c; it allegedly
>   > checks for 25% free memory, and if it is less asks for the cache to shrink.
>   >
>   > (snippet from arc.c around line 2494 of arc.c in 10-STABLE; path
>   > /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs)
>   >
>   > #else /* !sun */
>   > if (kmem_used() > (kmem_size() * 3) / 4)
>   > return (1);
>   > #endif /* sun */
>   >
>   > Unfortunately these two functions do not return what the authors thought
>   > they did. It's clear what they're trying to do from the Solaris-specific
>   > code up above this test.
>   
>   No, these functions do return what the authors think they do.
>   The check is for KVA usage (kernel virtual address space), not for physical memory.
I understand, but that's nonsensical in the context of the Solaris 
code.  "lotsfree" is *not* a declaration of free kvm space, it's a 
declaration of when the system has "lots" of free *physical* memory.

Further it makes no sense at all to allow the ARC cache to force things 
into virtual (e.g. swap-space backed) memory.  But that's the behavior 
that has been observed, and it fits with the code as originally written.

>   
>   > The result is that the cache only shrinks when vm_paging_needed() tests
>   > true, but by that time the system is in serious memory trouble and by
>   
>   No, it is not.
>   The description and numbers here are a little bit outdated but they should give
>   an idea of how paging works in general:
>   https://wiki.freebsd.org/AvgPageoutAlgorithm
>   
>   > triggering only there it actually drives the system further into paging,
>   
>   How does ARC eviction drives the system further into paging?
1. System gets low on physical memory but the ARC cache is looking at 
available kvm (of which there is plenty.)  The ARC cache continues to 
expand.

2. vm_paging_needed() returns true and the system begins to page off to 
the swap.  At the same time the ARC cache is pared down because 
arc_reclaim_needed has returned "1".

3. As the ARC cache shrinks and paging occurs vm_paging_needed() returns 
false.  Paging out ceases but inactive pages remain on the swap.  They 
are not recalled until and unless they are scheduled to execute.  
Arc_reclaim_needed again returns "0".

4. The hold-down timer expires in the ARC cache code ("arc_grow_retry", 
declared as 60 seconds) and the ARC cache begins to expand again.

Go back to #2 until the system's performance starts to deteriorate badly 
enough due to the paging that you notice it, which occurs when something 
that is actually consuming CPU time has to be called in from swap.

This is consistent with what I and others have observed on both 9.2 and 
10.0; the ARC will expand until it hits the maximum configured even at 
the expense of forcing pages onto the swap.  In this specific machine's 
case left to defaults it will grab nearly all physical memory (over 20GB 
of 24) and wire it down.

Limiting arc_max to 16GB sorta fixes it.  I say "sorta" because it turns 
out that 16GB is still too much for the workload; it prevents the 
pathological behavior where system "stalls" happen but only in the 
extreme.  It turns out with the patch in my ARC cache stabilizes at 
about 13.5GB during the busiest part of the day, growing to about 16 
off-hours.

One of the problems with just limiting it in /boot/loader.conf is that 
you have to guess and the system doesn't reasonably adapt to changing 
memory loads.  The code is clearly intended to do that but it doesn't 
end up working that way in practice.
>   
>   > because the pager will not recall pages from the swap until they are next
>   > executed. This leads the ARC to try to fill in all the available RAM even
>   > though pages have been pushed off onto swap. Not good.
>   
>   Unused physical memory is a waste.  It is true that ARC tries to use as much of
>   memory as it is allowed.  The same applies to the page cache (Active, Inactive).
>   Memory management is a dynamic system and there are a few competing agents.
>   
That's true.  However, what the stock code does is force working set out 
of memory and into the swap.  The ideal situation is one in which there 
is no free memory because cache has sized itself to consume everything 
*not* necessary for the working set of the processes that are running.  
Unfortunately we cannot determine this presciently because a new process 
may come along and we do not necessarily know for how long a process 
that is blocked on an event will remain blocked (e.g. something waiting 
on network I/O, etc.)

However, it is my contention that you do not want to evict a process 
that is scheduled to run (or is going to be) in favor of disk cache 
because you're defeating yourself by doing so.  The point of the disk 
cache is to avoid going to the physical disk for I/O, but if you page 
something you have ditched a physical I/O for data in favor of having to 
go to physical disk *twice* -- first to write the paged-out data to 
swap, and then to retrieve it when it is to be executed.  This also 
appears to be consistent with what is present for Solaris machines.

 From the Sun code:

#ifdef sun
         /*
          * take 'desfree' extra pages, so we reclaim sooner, rather than later
          */
         extra = desfree;

         /*
          * check that we're out of range of the pageout scanner.  It starts to
          * schedule paging if freemem is less than lotsfree and needfree.
          * lotsfree is the high-water mark for pageout, and needfree is the
          * number of needed free pages.  We add extra pages here to make sure
          * the scanner doesn't start up while we're freeing memory.
          */
         if (freemem < lotsfree + needfree + extra)
                 return (1);

         /*
          * check to make sure that swapfs has enough space so that anon
          * reservations can still succeed. anon_resvmem() checks that the
          * availrmem is greater than swapfs_minfree, and the number of reserved
          * swap pages.  We also add a bit of extra here just to prevent
          * circumstances from getting really dire.
          */
         if (availrmem < swapfs_minfree + swapfs_reserve + extra)
                 return (1);

"freemem" is not virtual memory, it's actual memory.  "Lotsfree" is the 
point where the system considers free RAM to be "ample"; "needfree" is 
the "desperation" point and "extra" is the margin (presumably for image 
activation.)

The base code on FreeBSD doesn't look at physical memory at all; it 
looks at kvm space instead.

>   It is hard to correctly tune that system using a large hummer such as your
>   patch.  I believe that with your patch ARC will get shrunk to its minimum size
>   in due time.  Active + Inactive will grow to use the memory that you are denying
>   to ARC driving Free below a threshold, which will reduce ARC.  Repeated enough
>   times this will drive ARC to its minimum.
I disagree both in design theory and based on the empirical evidence of 
actual operation.

First, I don't (ever) want to give memory to the ARC cache that 
otherwise would go to "active", because any time I do that I'm going to 
force two page events, which is double the amount of I/O I would take on 
a cache *miss*, and even with the ARC at minimum I get a reasonable hit 
percentage.  If I therefore prefer ARC over "active" pages I am going to 
take *at least* a 200% penalty on physical I/O and if I get an 80% hit 
ratio with the ARC at a minimum the penalty is closer to 800%!

For inactive pages it's a bit more complicated as those may not be 
reactivated.  However, I am trusting FreeBSD's VM subsystem to demote 
those that are unlikely to be reactivated to the cache bucket and then 
to "free", where they are able to be re-used.  This is consistent with 
what I actually see on a running system -- the "inact" bucket is 
typically fairly large (often on a busy machine close to that of 
"active") but pages demoted to "cache" don't stay there long - they 
either get re-promoted back up or they are freed and go on the free list.

The only time I see "inact" get out of control is when there's a kernel 
memory leak somewhere (such as what I ran into the other day with the 
in-kernel NAT subsystem on 10-STABLE.)  But that's a bug and if it 
happens you're going to get bit anyway.

For example right now on one of my very busy systems with 24GB of 
installed RAM and many terabytes of storage across three ZFS pools I'm 
seeing 17GB wired of which 13.5 is ARC cache.  That's the adaptive 
figure it currently is running at, with a maximum of 22.3 and a minimum 
of 2.79 (8:1 ratio.)  The remainder is wired down for other reasons 
(there's a fairly large Postgres server running on that box, among other 
things, and it has a big shared buffer declaration -- that's most of the 
difference.)  Cache hit efficiency is currently 97.8%.

Active is 2.26G right now, and inactive is 2.09G.  Both are stable. 
Overnight inactive will drop to about 1.1GB while active will not change 
all that much since most of it postgres and the middleware that talks to 
it along with apache, which leaves most of its processes present even 
when they go idle.  Peak load times are about right now (mid-day), and 
again when the system is running backups nightly.

Cache is 7448, in other words, insignificant.  Free memory is 2.6G.

The tunable is set to 10%, which is almost exactly what free memory is.  
I find that when the system gets under 1G free transient image 
activation can drive it into paging and performance starts to suffer for 
my particular workload.

>   
>   Also, there are a few technical problems with the patch:
>   - you don't need to use sysctl interface in kernel, the values you need are
>   available directly, just take a look at e.g. implementation of vm_paging_needed()
That's easily fixed.  I will look at it.
>   - similarly, querying vfs.zfs.arc_freepage_percent_target value via
>   kernel_sysctlbyname is just bogus; you can use percent_target directly
I did not know if during setup of the OID the value was copied (and thus 
you had to reference it later on) or the entry simply took the pointer 
and stashed that.  Easily corrected.
>   - you don't need to sum various page counters to get a total count, there is
>   v_page_count
>   
Fair enough as well.
>   Lastly, can you try to test reverting your patch and instead setting
>   vm.lowmem_period=0 ?
>   
Yes.  By default it's 10; I have not tampered with that default.

Let me do a bit of work and I'll post back with a revised patch. Perhaps 
a tunable for percentage free + a free reserve that is a "floor"?  The 
problem with that is where to put the defaults.  One option would be to 
grab total size at init time and compute something similar to what 
"lotsfree" is for Solaris, allowing that to be tuned with the percentage 
if desired.  I selected 25% because that's what the original test was 
expressing and it should be reasonable for modest RAM configurations.  
It's clearly too high for moderately large (or huge) memory machines 
unless they have a lot of RAM -hungry processes running on them.

The percentage test, however, is an easy knob to twist that is unlikely 
to severely harm you if you dial it too far in either direction; anyone 
setting it to zero obviously knows what they're getting into, and if you 
crank it too high all you end up doing is limiting the ARC to the 
minimum value.

-- 
-- Karl
karl at denninger.net

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2711 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20140318/9f9d9d42/attachment.bin>