kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix
Karl Denninger
karl at denninger.net
Wed Mar 19 14:18:48 UTC 2014
On 3/18/2014 12:19 PM, Karl Denninger wrote:
>
> On 3/18/2014 10:20 AM, Andriy Gapon wrote:
>> The following reply was made to PR kern/187594; it has been noted by
>> GNATS.
>>
>> From: Andriy Gapon <avg at FreeBSD.org>
>> To: bug-followup at FreeBSD.org, karl at fs.denninger.net
>> Cc:
>> Subject: Re: kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix
>> Date: Tue, 18 Mar 2014 17:15:05 +0200
>>
>> Karl Denninger <karl at fs.denninger.net> wrote:
>> > ZFS can be convinced to engage in pathological behavior due to a bad
>> > low-memory test in arc.c
>> >
>> > The offending file is at
>> > /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c; it
>> allegedly
>> > checks for 25% free memory, and if it is less asks for the cache
>> to shrink.
>> >
>> > (snippet from arc.c around line 2494 of arc.c in 10-STABLE; path
>> > /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs)
>> >
>> > #else /* !sun */
>> > if (kmem_used() > (kmem_size() * 3) / 4)
>> > return (1);
>> > #endif /* sun */
>> >
>> > Unfortunately these two functions do not return what the authors
>> thought
>> > they did. It's clear what they're trying to do from the
>> Solaris-specific
>> > code up above this test.
>> No, these functions do return what the authors think they do.
>> The check is for KVA usage (kernel virtual address space), not for
>> physical memory.
> I understand, but that's nonsensical in the context of the Solaris
> code. "lotsfree" is *not* a declaration of free kvm space, it's a
> declaration of when the system has "lots" of free *physical* memory.
>
> Further it makes no sense at all to allow the ARC cache to force
> things into virtual (e.g. swap-space backed) memory. But that's the
> behavior that has been observed, and it fits with the code as
> originally written.
>
>> > The result is that the cache only shrinks when
>> vm_paging_needed() tests
>> > true, but by that time the system is in serious memory trouble
>> and by
>> No, it is not.
>> The description and numbers here are a little bit outdated but they
>> should give
>> an idea of how paging works in general:
>> https://wiki.freebsd.org/AvgPageoutAlgorithm
>> > triggering only there it actually drives the system further
>> into paging,
>> How does ARC eviction drives the system further into paging?
> 1. System gets low on physical memory but the ARC cache is looking at
> available kvm (of which there is plenty.) The ARC cache continues to
> expand.
>
> 2. vm_paging_needed() returns true and the system begins to page off
> to the swap. At the same time the ARC cache is pared down because
> arc_reclaim_needed has returned "1".
>
> 3. As the ARC cache shrinks and paging occurs vm_paging_needed()
> returns false. Paging out ceases but inactive pages remain on the
> swap. They are not recalled until and unless they are scheduled to
> execute. Arc_reclaim_needed again returns "0".
>
> 4. The hold-down timer expires in the ARC cache code
> ("arc_grow_retry", declared as 60 seconds) and the ARC cache begins to
> expand again.
>
> Go back to #2 until the system's performance starts to deteriorate
> badly enough due to the paging that you notice it, which occurs when
> something that is actually consuming CPU time has to be called in from
> swap.
>
> This is consistent with what I and others have observed on both 9.2
> and 10.0; the ARC will expand until it hits the maximum configured
> even at the expense of forcing pages onto the swap. In this specific
> machine's case left to defaults it will grab nearly all physical
> memory (over 20GB of 24) and wire it down.
>
> Limiting arc_max to 16GB sorta fixes it. I say "sorta" because it
> turns out that 16GB is still too much for the workload; it prevents
> the pathological behavior where system "stalls" happen but only in the
> extreme. It turns out with the patch in my ARC cache stabilizes at
> about 13.5GB during the busiest part of the day, growing to about 16
> off-hours.
>
> One of the problems with just limiting it in /boot/loader.conf is that
> you have to guess and the system doesn't reasonably adapt to changing
> memory loads. The code is clearly intended to do that but it doesn't
> end up working that way in practice.
>> > because the pager will not recall pages from the swap until
>> they are next
>> > executed. This leads the ARC to try to fill in all the available
>> RAM even
>> > though pages have been pushed off onto swap. Not good.
>> Unused physical memory is a waste. It is true that ARC tries to
>> use as much of
>> memory as it is allowed. The same applies to the page cache
>> (Active, Inactive).
>> Memory management is a dynamic system and there are a few competing
>> agents.
> That's true. However, what the stock code does is force working set
> out of memory and into the swap. The ideal situation is one in which
> there is no free memory because cache has sized itself to consume
> everything *not* necessary for the working set of the processes that
> are running. Unfortunately we cannot determine this presciently
> because a new process may come along and we do not necessarily know
> for how long a process that is blocked on an event will remain blocked
> (e.g. something waiting on network I/O, etc.)
>
> However, it is my contention that you do not want to evict a process
> that is scheduled to run (or is going to be) in favor of disk cache
> because you're defeating yourself by doing so. The point of the disk
> cache is to avoid going to the physical disk for I/O, but if you page
> something you have ditched a physical I/O for data in favor of having
> to go to physical disk *twice* -- first to write the paged-out data to
> swap, and then to retrieve it when it is to be executed. This also
> appears to be consistent with what is present for Solaris machines.
>
> From the Sun code:
>
> #ifdef sun
> /*
> * take 'desfree' extra pages, so we reclaim sooner, rather
> than later
> */
> extra = desfree;
>
> /*
> * check that we're out of range of the pageout scanner. It
> starts to
> * schedule paging if freemem is less than lotsfree and needfree.
> * lotsfree is the high-water mark for pageout, and needfree
> is the
> * number of needed free pages. We add extra pages here to
> make sure
> * the scanner doesn't start up while we're freeing memory.
> */
> if (freemem < lotsfree + needfree + extra)
> return (1);
>
> /*
> * check to make sure that swapfs has enough space so that anon
> * reservations can still succeed. anon_resvmem() checks that the
> * availrmem is greater than swapfs_minfree, and the number of
> reserved
> * swap pages. We also add a bit of extra here just to prevent
> * circumstances from getting really dire.
> */
> if (availrmem < swapfs_minfree + swapfs_reserve + extra)
> return (1);
>
> "freemem" is not virtual memory, it's actual memory. "Lotsfree" is
> the point where the system considers free RAM to be "ample";
> "needfree" is the "desperation" point and "extra" is the margin
> (presumably for image activation.)
>
> The base code on FreeBSD doesn't look at physical memory at all; it
> looks at kvm space instead.
>
>> It is hard to correctly tune that system using a large hummer such
>> as your
>> patch. I believe that with your patch ARC will get shrunk to its
>> minimum size
>> in due time. Active + Inactive will grow to use the memory that
>> you are denying
>> to ARC driving Free below a threshold, which will reduce ARC.
>> Repeated enough
>> times this will drive ARC to its minimum.
> I disagree both in design theory and based on the empirical evidence
> of actual operation.
>
> First, I don't (ever) want to give memory to the ARC cache that
> otherwise would go to "active", because any time I do that I'm going
> to force two page events, which is double the amount of I/O I would
> take on a cache *miss*, and even with the ARC at minimum I get a
> reasonable hit percentage. If I therefore prefer ARC over "active"
> pages I am going to take *at least* a 200% penalty on physical I/O and
> if I get an 80% hit ratio with the ARC at a minimum the penalty is
> closer to 800%!
>
> For inactive pages it's a bit more complicated as those may not be
> reactivated. However, I am trusting FreeBSD's VM subsystem to demote
> those that are unlikely to be reactivated to the cache bucket and then
> to "free", where they are able to be re-used. This is consistent with
> what I actually see on a running system -- the "inact" bucket is
> typically fairly large (often on a busy machine close to that of
> "active") but pages demoted to "cache" don't stay there long - they
> either get re-promoted back up or they are freed and go on the free list.
>
> The only time I see "inact" get out of control is when there's a
> kernel memory leak somewhere (such as what I ran into the other day
> with the in-kernel NAT subsystem on 10-STABLE.) But that's a bug and
> if it happens you're going to get bit anyway.
>
> For example right now on one of my very busy systems with 24GB of
> installed RAM and many terabytes of storage across three ZFS pools I'm
> seeing 17GB wired of which 13.5 is ARC cache. That's the adaptive
> figure it currently is running at, with a maximum of 22.3 and a
> minimum of 2.79 (8:1 ratio.) The remainder is wired down for other
> reasons (there's a fairly large Postgres server running on that box,
> among other things, and it has a big shared buffer declaration --
> that's most of the difference.) Cache hit efficiency is currently 97.8%.
>
> Active is 2.26G right now, and inactive is 2.09G. Both are stable.
> Overnight inactive will drop to about 1.1GB while active will not
> change all that much since most of it postgres and the middleware that
> talks to it along with apache, which leaves most of its processes
> present even when they go idle. Peak load times are about right now
> (mid-day), and again when the system is running backups nightly.
>
> Cache is 7448, in other words, insignificant. Free memory is 2.6G.
>
> The tunable is set to 10%, which is almost exactly what free memory
> is. I find that when the system gets under 1G free transient image
> activation can drive it into paging and performance starts to suffer
> for my particular workload.
>
>> Also, there are a few technical problems with the patch:
>> - you don't need to use sysctl interface in kernel, the values you
>> need are
>> available directly, just take a look at e.g. implementation of
>> vm_paging_needed()
> That's easily fixed. I will look at it.
>> - similarly, querying vfs.zfs.arc_freepage_percent_target value via
>> kernel_sysctlbyname is just bogus; you can use percent_target directly
> I did not know if during setup of the OID the value was copied (and
> thus you had to reference it later on) or the entry simply took the
> pointer and stashed that. Easily corrected.
>> - you don't need to sum various page counters to get a total count,
>> there is
>> v_page_count
> Fair enough as well.
>> Lastly, can you try to test reverting your patch and instead setting
>> vm.lowmem_period=0 ?
> Yes. By default it's 10; I have not tampered with that default.
>
> Let me do a bit of work and I'll post back with a revised patch.
> Perhaps a tunable for percentage free + a free reserve that is a
> "floor"? The problem with that is where to put the defaults. One
> option would be to grab total size at init time and compute something
> similar to what "lotsfree" is for Solaris, allowing that to be tuned
> with the percentage if desired. I selected 25% because that's what
> the original test was expressing and it should be reasonable for
> modest RAM configurations. It's clearly too high for moderately large
> (or huge) memory machines unless they have a lot of RAM -hungry
> processes running on them.
>
> The percentage test, however, is an easy knob to twist that is
> unlikely to severely harm you if you dial it too far in either
> direction; anyone setting it to zero obviously knows what they're
> getting into, and if you crank it too high all you end up doing is
> limiting the ARC to the minimum value.
>
Responsive to the criticisms and in an attempt to better-track what the
VM system does, I offer this update to the patch. The following changes
have been made:
1. There are now two tunables:
vfs.zfs.arc_freepages -- the number of free pages below which we declare
low memory and ask for ARC paring.
vfs.zfs.arc_freepage_percent -- the additional free RAM to reserve in
percent of total, if any (added to freepages)
2. vfs.zfs.arc_freepages, if zero (as is the default at boot), defaults
to "vm.stats.vm.v_free_target" less 20%. This allows the system to get
into the page-stealing paradigm before the ARC cache is invaded. While
I do not run into a situation of unbridled inact page growth here the
criticism that the original patch could allow this appears to be
well-founded. Setting the low memory alert here should prevent this, as
the system will now allow the ARC to grow to the point that
page-stealing takes place.
3. The previous option to reserve either a hard amount of RAM or a
percentage of RAM remains.
4. The defaults should auto-tune for any particular RAM configuration to
reasonable values that prevent stalls, yet if you have circumstances
that argue for reserving more memory you may do so.
Updated patch follows:
*** arc.c.original Thu Mar 13 09:18:48 2014
--- arc.c Wed Mar 19 07:44:01 2014
***************
*** 18,23 ****
--- 18,99 ----
*
* CDDL HEADER END
*/
+
+ /* Karl Denninger (karl at denninger.net), 3/18/2014, FreeBSD-specific
+ *
+ * If "NEWRECLAIM" is defined, change the "low memory" warning that causes
+ * the ARC cache to be pared down. The reason for the change is that the
+ * apparent attempted algorithm is to start evicting ARC cache when free
+ * pages fall below 25% of installed RAM. This maps reasonably well to how
+ * Solaris is documented to behave; when "lotsfree" is invaded ZFS is told
+ * to pare down.
+ *
+ * The problem is that on FreeBSD machines the system doesn't appear to be
+ * getting what the authors of the original code thought they were looking at
+ * with its test -- or at least not what Solaris did -- and as a result that
+ * test never triggers. That leaves the only reclaim trigger as the "paging
+ * needed" status flag, and by the time * that trips the system is already
+ * in low-memory trouble. This can lead to severe pathological behavior
+ * under the following scenario:
+ * - The system starts to page and ARC is evicted.
+ * - The system stops paging as ARC's eviction drops wired RAM a bit.
+ * - ARC starts increasing its allocation again, and wired memory grows.
+ * - A new image is activated, and the system once again attempts to page.
+ * - ARC starts to be evicted again.
+ * - Back to #2
+ *
+ * Note that ZFS's ARC default (unless you override it in /boot/loader.conf)
+ * is to allow the ARC cache to grab nearly all of free RAM, provided nobody
+ * else needs it. That would be ok if we evicted cache when required.
+ *
+ * Unfortunately the system can get into a state where it never
+ * manages to page anything of materiality back in, as if there is active
+ * I/O the ARC will start grabbing space once again as soon as the memory
+ * contention state drops. For this reason the "paging is occurring" flag
+ * should be the **last resort** condition for ARC eviction; you want to
+ * (as Solaris does) start when there is material free RAM left BUT the
+ * vm system thinks it needs to be active to steal pages back in the attempt
+ * to never get into the condition where you're potentially paging off
+ * executables in favor of leaving disk cache allocated.
+ *
+ * To fix this we change how we look at low memory, declaring two new
+ * runtime tunables.
+ *
+ * The new sysctls are:
+ * vfs.zfs.arc_freepages (free pages required to call RAM "sufficient")
+ * vfs.zfs.arc_freepage_percent (additional reservation percentage, default 0)
+ *
+ * vfs.zfs.arc_freepages is initialized from vm.stats.vm.v_free_target,
+ * less 20% if we find that it is zero. Note that vm.stats.vm.v_free_target
+ * is not initialized at boot -- the system has to be running first, so we
+ * cannot initialize this in arc_init. So we check during runtime; this
+ * also allows the user to return to defaults by setting it to zero.
+ *
+ * This should insure that we allow the VM system to steal pages first,
+ * but pare the cache before we suspend processes attempting to get more
+ * memory, thereby avoiding "stalls." You can set this higher if you wish,
+ * or force a specific percentage reservation as well, but doing so may
+ * cause the cache to pare back while the VM system remains willing to
+ * allow "inactive" pages to accumulate. The challenge is that image
+ * activation can force things into the page space on a repeated basis
+ * if you allow this level to be too small (the above pathological
+ * behavior); the defaults should avoid that behavior but the sysctls
+ * are exposed should your workload require adjustment.
+ *
+ * If we're using this check for low memory we are replacing the previous
+ * ones, including the oddball "random" reclaim that appears to fire far
+ * more often than it should. We still trigger if the system pages.
+ *
+ * If you turn on NEWRECLAIM_DEBUG then the kernel will print on the console
+ * status messages when the reclaim status trips on and off, along with the
+ * page count aggregate that triggered it (and the free space) for each
+ * event.
+ */
+
+ #define NEWRECLAIM
+ #undef NEWRECLAIM_DEBUG
+
+
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 2013 by Delphix. All rights reserved.
***************
*** 139,144 ****
--- 215,226 ----
#include <vm/vm_pageout.h>
+ #ifdef NEWRECLAIM
+ #ifdef __FreeBSD__
+ #include <sys/sysctl.h>
+ #endif
+ #endif /* NEWRECLAIM */
+
#ifdef illumos
#ifndef _KERNEL
/* set with ZFS_DEBUG=watch, to enable watchpoints on frozen buffers */
***************
*** 203,218 ****
--- 285,320 ----
int zfs_arc_shrink_shift = 0;
int zfs_arc_p_min_shift = 0;
int zfs_disable_dup_eviction = 0;
+ #ifdef NEWRECLAIM
+ #ifdef __FreeBSD__
+ static int freepages = 0; /* This much memory is considered critical */
+ static int percent_target = 0; /* Additionally reserve "X" percent free RAM */
+ #endif /* __FreeBSD__ */
+ #endif /* NEWRECLAIM */
TUNABLE_QUAD("vfs.zfs.arc_max", &zfs_arc_max);
TUNABLE_QUAD("vfs.zfs.arc_min", &zfs_arc_min);
TUNABLE_QUAD("vfs.zfs.arc_meta_limit", &zfs_arc_meta_limit);
+ #ifdef NEWRECLAIM
+ #ifdef __FreeBSD__
+ TUNABLE_INT("vfs.zfs.arc_freepages", &freepages);
+ TUNABLE_INT("vfs.zfs.arc_freepage_percent", &percent_target);
+ #endif /* __FreeBSD__ */
+ #endif /* NEWRECLAIM */
+
SYSCTL_DECL(_vfs_zfs);
SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, arc_max, CTLFLAG_RDTUN, &zfs_arc_max, 0,
"Maximum ARC size");
SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, arc_min, CTLFLAG_RDTUN, &zfs_arc_min, 0,
"Minimum ARC size");
+ #ifdef NEWRECLAIM
+ #ifdef __FreeBSD__
+ SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_freepages, CTLFLAG_RWTUN, &freepages, 0, "ARC Free RAM Pages Required");
+ SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_freepage_percent, CTLFLAG_RWTUN, &percent_target, 0, "ARC Free RAM Target percentage");
+ #endif /* __FreeBSD__ */
+ #endif /* NEWRECLAIM */
+
/*
* Note that buffers can be in one of 6 states:
* ARC_anon - anonymous (discussed below)
***************
*** 2438,2443 ****
--- 2540,2557 ----
{
#ifdef _KERNEL
+ #ifdef NEWRECLAIM
+ #ifdef __FreeBSD__
+ u_int vmfree = 0;
+ u_int vmtotal = 0;
+ size_t vmsize;
+ #ifdef NEWRECLAIM_DEBUG
+ static int xval = -1;
+ static int oldpercent = 0;
+ static int oldfreepages = 0;
+ #endif /* NEWRECLAIM_DEBUG */
+ #endif /* __FreeBSD__ */
+ #endif /* NEWRECLAIM */
if (needfree)
return (1);
***************
*** 2476,2481 ****
--- 2590,2596 ----
return (1);
#if defined(__i386)
+
/*
* If we're on an i386 platform, it's possible that we'll exhaust the
* kernel heap space before we ever run out of available physical
***************
*** 2492,2502 ****
return (1);
#endif
#else /* !sun */
if (kmem_used() > (kmem_size() * 3) / 4)
return (1);
#endif /* sun */
- #else
if (spa_get_random(100) == 0)
return (1);
#endif
--- 2607,2680 ----
return (1);
#endif
#else /* !sun */
+
+ #ifdef NEWRECLAIM
+ #ifdef __FreeBSD__
+ /*
+ * Implement the new tunable free RAM algorithm. We check the free pages
+ * against the minimum specified target and the percentage that should be
+ * free. If we're low we ask for ARC cache shrinkage. If this is defined
+ * on a FreeBSD system the older checks are not performed.
+ *
+ * Check first to see if we need to init freepages, then test.
+ */
+ if (!freepages) { /* If zero then (re)init */
+ vmsize = sizeof(vmtotal);
+ kernel_sysctlbyname(curthread, "vm.stats.vm.v_free_target", &vmtotal, &vmsize, NULL, 0, NULL, 0);
+ freepages = vmtotal - (vmtotal / 5);
+ #ifdef NEWRECLAIM_DEBUG
+ printf("ZFS ARC: Default vfs.zfs.arc_freepages to [%u] [%u less 20%%]\n", freepages, vmtotal);
+ #endif /* NEWRECLAIM_DEBUG */
+ }
+
+ vmsize = sizeof(vmtotal);
+ kernel_sysctlbyname(curthread, "vm.stats.vm.v_page_count", &vmtotal, &vmsize, NULL, 0, NULL, 0);
+ vmsize = sizeof(vmfree);
+ kernel_sysctlbyname(curthread, "vm.stats.vm.v_free_count", &vmfree, &vmsize, NULL, 0, NULL, 0);
+ #ifdef NEWRECLAIM_DEBUG
+ if (percent_target != oldpercent) {
+ printf("ZFS ARC: Reservation percent change to [%d], [%d] pages, [%d] free\n", percent_target, vmtotal, vmfree);
+ oldpercent = percent_target;
+ }
+ if (freepages != oldfreepages) {
+ printf("ZFS ARC: Low RAM page change to [%d], [%d] pages, [%d] free\n", freepages, vmtotal, vmfree);
+ oldfreepages = freepages;
+ }
+ #endif /* NEWRECLAIM_DEBUG */
+ if (!vmtotal) {
+ vmtotal = 1; /* Protect against divide by zero */
+ /* (should be impossible, but...) */
+ }
+ /*
+ * Now figure out how much free RAM we require to call the ARC cache status
+ * "ok". Add the percentage specified of the total to the base requirement.
+ */
+
+ if (vmfree < freepages + ((vmtotal / 100) * percent_target)) {
+ #ifdef NEWRECLAIM_DEBUG
+ if (xval != 1) {
+ printf("ZFS ARC: RECLAIM total %u, free %u, free pct (%u), reserved (%u), target pct (%u)\n", vmtotal, vmfree, ((vmfree * 100) / vmtotal), freepages, percent_target);
+ xval = 1;
+ }
+ #endif /* NEWRECLAIM_DEBUG */
+ return(1);
+ } else {
+ #ifdef NEWRECLAIM_DEBUG
+ if (xval != 0) {
+ printf("ZFS ARC: NORMAL total %u, free %u, free pct (%u), reserved (%u), target pct (%u)\n", vmtotal, vmfree, ((vmfree * 100) / vmtotal), freepages, percent_target);
+ xval = 0;
+ }
+ #endif /* NEWRECLAIM_DEBUG */
+ return(0);
+ }
+
+ #endif /* __FreeBSD__ */
+ #endif /* NEWRECLAIM */
+
if (kmem_used() > (kmem_size() * 3) / 4)
return (1);
#endif /* sun */
if (spa_get_random(100) == 0)
return (1);
#endif
--
-- Karl
karl at denninger.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2711 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20140319/e5721e0d/attachment.bin>
More information about the freebsd-fs
mailing list