kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix

Wed Mar 19 14:18:48 UTC 2014

On 3/18/2014 12:19 PM, Karl Denninger wrote:
>
> On 3/18/2014 10:20 AM, Andriy Gapon wrote:
>> The following reply was made to PR kern/187594; it has been noted by 
>> GNATS.
>>
>> From: Andriy Gapon <avg at FreeBSD.org>
>> To: bug-followup at FreeBSD.org, karl at fs.denninger.net
>> Cc:
>> Subject: Re: kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix
>> Date: Tue, 18 Mar 2014 17:15:05 +0200
>>
>>   Karl Denninger <karl at fs.denninger.net> wrote:
>>   > ZFS can be convinced to engage in pathological behavior due to a bad
>>   > low-memory test in arc.c
>>   >
>>   > The offending file is at
>>   > /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c; it 
>> allegedly
>>   > checks for 25% free memory, and if it is less asks for the cache 
>> to shrink.
>>   >
>>   > (snippet from arc.c around line 2494 of arc.c in 10-STABLE; path
>>   > /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs)
>>   >
>>   > #else /* !sun */
>>   > if (kmem_used() > (kmem_size() * 3) / 4)
>>   > return (1);
>>   > #endif /* sun */
>>   >
>>   > Unfortunately these two functions do not return what the authors 
>> thought
>>   > they did. It's clear what they're trying to do from the 
>> Solaris-specific
>>   > code up above this test.
>>     No, these functions do return what the authors think they do.
>>   The check is for KVA usage (kernel virtual address space), not for 
>> physical memory.
> I understand, but that's nonsensical in the context of the Solaris 
> code.  "lotsfree" is *not* a declaration of free kvm space, it's a 
> declaration of when the system has "lots" of free *physical* memory.
>
> Further it makes no sense at all to allow the ARC cache to force 
> things into virtual (e.g. swap-space backed) memory.  But that's the 
> behavior that has been observed, and it fits with the code as 
> originally written.
>
>>     > The result is that the cache only shrinks when 
>> vm_paging_needed() tests
>>   > true, but by that time the system is in serious memory trouble 
>> and by
>>     No, it is not.
>>   The description and numbers here are a little bit outdated but they 
>> should give
>>   an idea of how paging works in general:
>>   https://wiki.freebsd.org/AvgPageoutAlgorithm
>>     > triggering only there it actually drives the system further 
>> into paging,
>>     How does ARC eviction drives the system further into paging?
> 1. System gets low on physical memory but the ARC cache is looking at 
> available kvm (of which there is plenty.)  The ARC cache continues to 
> expand.
>
> 2. vm_paging_needed() returns true and the system begins to page off 
> to the swap.  At the same time the ARC cache is pared down because 
> arc_reclaim_needed has returned "1".
>
> 3. As the ARC cache shrinks and paging occurs vm_paging_needed() 
> returns false.  Paging out ceases but inactive pages remain on the 
> swap.  They are not recalled until and unless they are scheduled to 
> execute.  Arc_reclaim_needed again returns "0".
>
> 4. The hold-down timer expires in the ARC cache code 
> ("arc_grow_retry", declared as 60 seconds) and the ARC cache begins to 
> expand again.
>
> Go back to #2 until the system's performance starts to deteriorate 
> badly enough due to the paging that you notice it, which occurs when 
> something that is actually consuming CPU time has to be called in from 
> swap.
>
> This is consistent with what I and others have observed on both 9.2 
> and 10.0; the ARC will expand until it hits the maximum configured 
> even at the expense of forcing pages onto the swap.  In this specific 
> machine's case left to defaults it will grab nearly all physical 
> memory (over 20GB of 24) and wire it down.
>
> Limiting arc_max to 16GB sorta fixes it.  I say "sorta" because it 
> turns out that 16GB is still too much for the workload; it prevents 
> the pathological behavior where system "stalls" happen but only in the 
> extreme.  It turns out with the patch in my ARC cache stabilizes at 
> about 13.5GB during the busiest part of the day, growing to about 16 
> off-hours.
>
> One of the problems with just limiting it in /boot/loader.conf is that 
> you have to guess and the system doesn't reasonably adapt to changing 
> memory loads.  The code is clearly intended to do that but it doesn't 
> end up working that way in practice.
>>     > because the pager will not recall pages from the swap until 
>> they are next
>>   > executed. This leads the ARC to try to fill in all the available 
>> RAM even
>>   > though pages have been pushed off onto swap. Not good.
>>     Unused physical memory is a waste.  It is true that ARC tries to 
>> use as much of
>>   memory as it is allowed.  The same applies to the page cache 
>> (Active, Inactive).
>>   Memory management is a dynamic system and there are a few competing 
>> agents.
> That's true.  However, what the stock code does is force working set 
> out of memory and into the swap.  The ideal situation is one in which 
> there is no free memory because cache has sized itself to consume 
> everything *not* necessary for the working set of the processes that 
> are running.  Unfortunately we cannot determine this presciently 
> because a new process may come along and we do not necessarily know 
> for how long a process that is blocked on an event will remain blocked 
> (e.g. something waiting on network I/O, etc.)
>
> However, it is my contention that you do not want to evict a process 
> that is scheduled to run (or is going to be) in favor of disk cache 
> because you're defeating yourself by doing so.  The point of the disk 
> cache is to avoid going to the physical disk for I/O, but if you page 
> something you have ditched a physical I/O for data in favor of having 
> to go to physical disk *twice* -- first to write the paged-out data to 
> swap, and then to retrieve it when it is to be executed.  This also 
> appears to be consistent with what is present for Solaris machines.
>
> From the Sun code:
>
> #ifdef sun
>         /*
>          * take 'desfree' extra pages, so we reclaim sooner, rather 
> than later
>          */
>         extra = desfree;
>
>         /*
>          * check that we're out of range of the pageout scanner. It 
> starts to
>          * schedule paging if freemem is less than lotsfree and needfree.
>          * lotsfree is the high-water mark for pageout, and needfree 
> is the
>          * number of needed free pages.  We add extra pages here to 
> make sure
>          * the scanner doesn't start up while we're freeing memory.
>          */
>         if (freemem < lotsfree + needfree + extra)
>                 return (1);
>
>         /*
>          * check to make sure that swapfs has enough space so that anon
>          * reservations can still succeed. anon_resvmem() checks that the
>          * availrmem is greater than swapfs_minfree, and the number of 
> reserved
>          * swap pages.  We also add a bit of extra here just to prevent
>          * circumstances from getting really dire.
>          */
>         if (availrmem < swapfs_minfree + swapfs_reserve + extra)
>                 return (1);
>
> "freemem" is not virtual memory, it's actual memory.  "Lotsfree" is 
> the point where the system considers free RAM to be "ample"; 
> "needfree" is the "desperation" point and "extra" is the margin 
> (presumably for image activation.)
>
> The base code on FreeBSD doesn't look at physical memory at all; it 
> looks at kvm space instead.
>
>>   It is hard to correctly tune that system using a large hummer such 
>> as your
>>   patch.  I believe that with your patch ARC will get shrunk to its 
>> minimum size
>>   in due time.  Active + Inactive will grow to use the memory that 
>> you are denying
>>   to ARC driving Free below a threshold, which will reduce ARC. 
>> Repeated enough
>>   times this will drive ARC to its minimum.
> I disagree both in design theory and based on the empirical evidence 
> of actual operation.
>
> First, I don't (ever) want to give memory to the ARC cache that 
> otherwise would go to "active", because any time I do that I'm going 
> to force two page events, which is double the amount of I/O I would 
> take on a cache *miss*, and even with the ARC at minimum I get a 
> reasonable hit percentage.  If I therefore prefer ARC over "active" 
> pages I am going to take *at least* a 200% penalty on physical I/O and 
> if I get an 80% hit ratio with the ARC at a minimum the penalty is 
> closer to 800%!
>
> For inactive pages it's a bit more complicated as those may not be 
> reactivated.  However, I am trusting FreeBSD's VM subsystem to demote 
> those that are unlikely to be reactivated to the cache bucket and then 
> to "free", where they are able to be re-used. This is consistent with 
> what I actually see on a running system -- the "inact" bucket is 
> typically fairly large (often on a busy machine close to that of 
> "active") but pages demoted to "cache" don't stay there long - they 
> either get re-promoted back up or they are freed and go on the free list.
>
> The only time I see "inact" get out of control is when there's a 
> kernel memory leak somewhere (such as what I ran into the other day 
> with the in-kernel NAT subsystem on 10-STABLE.)  But that's a bug and 
> if it happens you're going to get bit anyway.
>
> For example right now on one of my very busy systems with 24GB of 
> installed RAM and many terabytes of storage across three ZFS pools I'm 
> seeing 17GB wired of which 13.5 is ARC cache.  That's the adaptive 
> figure it currently is running at, with a maximum of 22.3 and a 
> minimum of 2.79 (8:1 ratio.)  The remainder is wired down for other 
> reasons (there's a fairly large Postgres server running on that box, 
> among other things, and it has a big shared buffer declaration -- 
> that's most of the difference.)  Cache hit efficiency is currently 97.8%.
>
> Active is 2.26G right now, and inactive is 2.09G.  Both are stable. 
> Overnight inactive will drop to about 1.1GB while active will not 
> change all that much since most of it postgres and the middleware that 
> talks to it along with apache, which leaves most of its processes 
> present even when they go idle.  Peak load times are about right now 
> (mid-day), and again when the system is running backups nightly.
>
> Cache is 7448, in other words, insignificant.  Free memory is 2.6G.
>
> The tunable is set to 10%, which is almost exactly what free memory 
> is.  I find that when the system gets under 1G free transient image 
> activation can drive it into paging and performance starts to suffer 
> for my particular workload.
>
>>     Also, there are a few technical problems with the patch:
>>   - you don't need to use sysctl interface in kernel, the values you 
>> need are
>>   available directly, just take a look at e.g. implementation of 
>> vm_paging_needed()
> That's easily fixed.  I will look at it.
>>   - similarly, querying vfs.zfs.arc_freepage_percent_target value via
>>   kernel_sysctlbyname is just bogus; you can use percent_target directly
> I did not know if during setup of the OID the value was copied (and 
> thus you had to reference it later on) or the entry simply took the 
> pointer and stashed that.  Easily corrected.
>>   - you don't need to sum various page counters to get a total count, 
>> there is
>>   v_page_count
> Fair enough as well.
>>   Lastly, can you try to test reverting your patch and instead setting
>>   vm.lowmem_period=0 ?
> Yes.  By default it's 10; I have not tampered with that default.
>
> Let me do a bit of work and I'll post back with a revised patch. 
> Perhaps a tunable for percentage free + a free reserve that is a 
> "floor"?  The problem with that is where to put the defaults.  One 
> option would be to grab total size at init time and compute something 
> similar to what "lotsfree" is for Solaris, allowing that to be tuned 
> with the percentage if desired.  I selected 25% because that's what 
> the original test was expressing and it should be reasonable for 
> modest RAM configurations.  It's clearly too high for moderately large 
> (or huge) memory machines unless they have a lot of RAM -hungry 
> processes running on them.
>
> The percentage test, however, is an easy knob to twist that is 
> unlikely to severely harm you if you dial it too far in either 
> direction; anyone setting it to zero obviously knows what they're 
> getting into, and if you crank it too high all you end up doing is 
> limiting the ARC to the minimum value.
>

Responsive to the criticisms and in an attempt to better-track what the 
VM system does, I offer this update to the patch.  The following changes 
have been made:

1. There are now two tunables:
vfs.zfs.arc_freepages -- the number of free pages below which we declare 
low memory and ask for ARC paring.
vfs.zfs.arc_freepage_percent -- the additional free RAM to reserve in 
percent of total, if any (added to freepages)

2. vfs.zfs.arc_freepages, if zero (as is the default at boot), defaults 
to "vm.stats.vm.v_free_target" less 20%.  This allows the system to get 
into the page-stealing paradigm before the ARC cache is invaded.  While 
I do not run into a situation of unbridled inact page growth here the 
criticism that the original patch could allow this appears to be 
well-founded.  Setting the low memory alert here should prevent this, as 
the system will now allow the ARC to grow to the point that 
page-stealing takes place.

3. The previous option to reserve either a hard amount of RAM or a 
percentage of RAM remains.

4. The defaults should auto-tune for any particular RAM configuration to 
reasonable values that prevent stalls, yet if you have circumstances 
that argue for reserving more memory you may do so.

Updated patch follows:

*** arc.c.original	Thu Mar 13 09:18:48 2014
--- arc.c	Wed Mar 19 07:44:01 2014
***************
*** 18,23 ****
--- 18,99 ----
    *
    * CDDL HEADER END
    */
+
+ /* Karl Denninger (karl at denninger.net), 3/18/2014, FreeBSD-specific
+  *
+  * If "NEWRECLAIM" is defined, change the "low memory" warning that causes
+  * the ARC cache to be pared down.  The reason for the change is that the
+  * apparent attempted algorithm is to start evicting ARC cache when free
+  * pages fall below 25% of installed RAM.  This maps reasonably well to how
+  * Solaris is documented to behave; when "lotsfree" is invaded ZFS is told
+  * to pare down.
+  *
+  * The problem is that on FreeBSD machines the system doesn't appear to be
+  * getting what the authors of the original code thought they were looking at
+  * with its test -- or at least not what Solaris did -- and as a result that
+  * test never triggers.  That leaves the only reclaim trigger as the "paging
+  * needed" status flag, and by the time * that trips the system is already
+  * in low-memory trouble.  This can lead to severe pathological behavior
+  * under the following scenario:
+  * - The system starts to page and ARC is evicted.
+  * - The system stops paging as ARC's eviction drops wired RAM a bit.
+  * - ARC starts increasing its allocation again, and wired memory grows.
+  * - A new image is activated, and the system once again attempts to page.
+  * - ARC starts to be evicted again.
+  * - Back to #2
+  *
+  * Note that ZFS's ARC default (unless you override it in /boot/loader.conf)
+  * is to allow the ARC cache to grab nearly all of free RAM, provided nobody
+  * else needs it.  That would be ok if we evicted cache when required.
+  *
+  * Unfortunately the system can get into a state where it never
+  * manages to page anything of materiality back in, as if there is active
+  * I/O the ARC will start grabbing space once again as soon as the memory
+  * contention state drops.  For this reason the "paging is occurring" flag
+  * should be the **last resort** condition for ARC eviction; you want to
+  * (as Solaris does) start when there is material free RAM left BUT the
+  * vm system thinks it needs to be active to steal pages back in the attempt
+  * to never get into the condition where you're potentially paging off
+  * executables in favor of leaving disk cache allocated.
+  *
+  * To fix this we change how we look at low memory, declaring two new
+  * runtime tunables.
+  *
+  * The new sysctls are:
+  * vfs.zfs.arc_freepages (free pages required to call RAM "sufficient")
+  * vfs.zfs.arc_freepage_percent (additional reservation percentage, default 0)
+  *
+  * vfs.zfs.arc_freepages is initialized from vm.stats.vm.v_free_target,
+  * less 20% if we find that it is zero.  Note that vm.stats.vm.v_free_target
+  * is not initialized at boot -- the system has to be running first, so we
+  * cannot initialize this in arc_init.  So we check during runtime; this
+  * also allows the user to return to defaults by setting it to zero.
+  *
+  * This should insure that we allow the VM system to steal pages first,
+  * but pare the cache before we suspend processes attempting to get more
+  * memory, thereby avoiding "stalls."  You can set this higher if you wish,
+  * or force a specific percentage reservation as well, but doing so may
+  * cause the cache to pare back while the VM system remains willing to
+  * allow "inactive" pages to accumulate.  The challenge is that image
+  * activation can force things into the page space on a repeated basis
+  * if you allow this level to be too small (the above pathological
+  * behavior); the defaults should avoid that behavior but the sysctls
+  * are exposed should your workload require adjustment.
+  *
+  * If we're using this check for low memory we are replacing the previous
+  * ones, including the oddball "random" reclaim that appears to fire far
+  * more often than it should.  We still trigger if the system pages.
+  *
+  * If you turn on NEWRECLAIM_DEBUG then the kernel will print on the console
+  * status messages when the reclaim status trips on and off, along with the
+  * page count aggregate that triggered it (and the free space) for each
+  * event.
+  */
+
+ #define	NEWRECLAIM
+ #undef	NEWRECLAIM_DEBUG
+
+
   /*
    * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
    * Copyright (c) 2013 by Delphix. All rights reserved.
***************
*** 139,144 ****
--- 215,226 ----

   #include <vm/vm_pageout.h>

+ #ifdef	NEWRECLAIM
+ #ifdef	__FreeBSD__
+ #include <sys/sysctl.h>
+ #endif
+ #endif	/* NEWRECLAIM */
+
   #ifdef illumos
   #ifndef _KERNEL
   /* set with ZFS_DEBUG=watch, to enable watchpoints on frozen buffers */
***************
*** 203,218 ****
--- 285,320 ----
   int zfs_arc_shrink_shift = 0;
   int zfs_arc_p_min_shift = 0;
   int zfs_disable_dup_eviction = 0;
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ static	int freepages = 0;	/* This much memory is considered critical */
+ static	int percent_target = 0;	/* Additionally reserve "X" percent free RAM */
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */

   TUNABLE_QUAD("vfs.zfs.arc_max", &zfs_arc_max);
   TUNABLE_QUAD("vfs.zfs.arc_min", &zfs_arc_min);
   TUNABLE_QUAD("vfs.zfs.arc_meta_limit", &zfs_arc_meta_limit);
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ TUNABLE_INT("vfs.zfs.arc_freepages", &freepages);
+ TUNABLE_INT("vfs.zfs.arc_freepage_percent", &percent_target);
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */
+
   SYSCTL_DECL(_vfs_zfs);
   SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, arc_max, CTLFLAG_RDTUN, &zfs_arc_max, 0,
       "Maximum ARC size");
   SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, arc_min, CTLFLAG_RDTUN, &zfs_arc_min, 0,
       "Minimum ARC size");

+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_freepages, CTLFLAG_RWTUN, &freepages, 0, "ARC Free RAM Pages Required");
+ SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_freepage_percent, CTLFLAG_RWTUN, &percent_target, 0, "ARC Free RAM Target percentage");
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */
+
   /*
    * Note that buffers can be in one of 6 states:
    *	ARC_anon	- anonymous (discussed below)
***************
*** 2438,2443 ****
--- 2540,2557 ----
   {

   #ifdef _KERNEL
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ 	u_int	vmfree = 0;
+ 	u_int	vmtotal = 0;
+ 	size_t	vmsize;
+ #ifdef	NEWRECLAIM_DEBUG
+ 	static	int	xval = -1;
+ 	static	int	oldpercent = 0;
+ 	static	int	oldfreepages = 0;
+ #endif	/* NEWRECLAIM_DEBUG */
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */

   	if (needfree)
   		return (1);
***************
*** 2476,2481 ****
--- 2590,2596 ----
   		return (1);

   #if defined(__i386)
+
   	/*
   	 * If we're on an i386 platform, it's possible that we'll exhaust the
   	 * kernel heap space before we ever run out of available physical
***************
*** 2492,2502 ****
   		return (1);
   #endif
   #else	/* !sun */
   	if (kmem_used() > (kmem_size() * 3) / 4)
   		return (1);
   #endif	/* sun */

- #else
   	if (spa_get_random(100) == 0)
   		return (1);
   #endif
--- 2607,2680 ----
   		return (1);
   #endif
   #else	/* !sun */
+
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ /*
+  * Implement the new tunable free RAM algorithm.  We check the free pages
+  * against the minimum specified target and the percentage that should be
+  * free.  If we're low we ask for ARC cache shrinkage.  If this is defined
+  * on a FreeBSD system the older checks are not performed.
+  *
+  * Check first to see if we need to init freepages, then test.
+  */
+ 	if (!freepages) {		/* If zero then (re)init */
+ 		vmsize = sizeof(vmtotal);
+ 		kernel_sysctlbyname(curthread, "vm.stats.vm.v_free_target", &vmtotal, &vmsize, NULL, 0, NULL, 0);
+ 		freepages = vmtotal - (vmtotal / 5);
+ #ifdef	NEWRECLAIM_DEBUG
+ 		printf("ZFS ARC: Default vfs.zfs.arc_freepages to [%u] [%u less 20%%]\n", freepages, vmtotal);
+ #endif	/* NEWRECLAIM_DEBUG */
+ 	}
+
+ 	vmsize = sizeof(vmtotal);
+         kernel_sysctlbyname(curthread, "vm.stats.vm.v_page_count", &vmtotal, &vmsize, NULL, 0, NULL, 0);
+ 	vmsize = sizeof(vmfree);
+         kernel_sysctlbyname(curthread, "vm.stats.vm.v_free_count", &vmfree, &vmsize, NULL, 0, NULL, 0);
+ #ifdef	NEWRECLAIM_DEBUG
+ 	if (percent_target != oldpercent) {
+ 		printf("ZFS ARC: Reservation percent change to [%d], [%d] pages, [%d] free\n", percent_target, vmtotal, vmfree);
+ 		oldpercent = percent_target;
+ 	}
+ 	if (freepages != oldfreepages) {
+ 		printf("ZFS ARC: Low RAM page change to [%d], [%d] pages, [%d] free\n", freepages, vmtotal, vmfree);
+ 		oldfreepages = freepages;
+ 	}
+ #endif	/* NEWRECLAIM_DEBUG */
+ 	if (!vmtotal) {
+ 		vmtotal = 1;	/* Protect against divide by zero */
+ 				/* (should be impossible, but...) */
+ 	}
+ /*
+  * Now figure out how much free RAM we require to call the ARC cache status
+  * "ok".  Add the percentage specified of the total to the base requirement.
+  */
+
+ 	if (vmfree < freepages + ((vmtotal / 100) * percent_target)) {
+ #ifdef	NEWRECLAIM_DEBUG
+ 		if (xval != 1) {
+ 			printf("ZFS ARC: RECLAIM total %u, free %u, free pct (%u), reserved (%u), target pct (%u)\n", vmtotal, vmfree, ((vmfree * 100) / vmtotal), freepages, percent_target);
+ 			xval = 1;
+ 		}
+ #endif	/* NEWRECLAIM_DEBUG */
+ 		return(1);
+ 	} else {
+ #ifdef	NEWRECLAIM_DEBUG
+ 		if (xval != 0) {
+ 			printf("ZFS ARC: NORMAL total %u, free %u, free pct (%u), reserved (%u), target pct (%u)\n", vmtotal, vmfree, ((vmfree * 100) / vmtotal), freepages, percent_target);
+ 			xval = 0;
+ 		}
+ #endif	/* NEWRECLAIM_DEBUG */
+ 		return(0);
+ 	}
+
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */
+
   	if (kmem_used() > (kmem_size() * 3) / 4)
   		return (1);
   #endif	/* sun */

   	if (spa_get_random(100) == 0)
   		return (1);
   #endif

-- 
-- Karl
karl at denninger.net

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2711 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20140319/e5721e0d/attachment.bin>