kern/187572: ZFS ARC cache code does not properly handle low memory

Fri Mar 14 11:20:00 UTC 2014

>Number:         187572
>Category:       kern
>Synopsis:       ZFS ARC cache code does not properly handle low memory
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Fri Mar 14 11:20:00 UTC 2014
>Closed-Date:
>Last-Modified:
>Originator:     Karl Denninger
>Release:        FreeBSD 10.0-STABLE amd64
>Organization:
Karls Sushi and Packet Smashers
>Environment:
System: FreeBSD NewFS.denninger.net 10.0-STABLE FreeBSD 10.0-STABLE #11 r263037M: Thu Mar 13 15:47:15 CDT 2014 karl at NewFS.denninger.net:/usr/obj/usr/src/sys/KSD-SMP amd64

Note: Also applies to previous releases

>Description:
ZFS can be convinced to engage in what I can only surmise is pathological
behavior, and I've seen no fix for it when it happens -- but there are
things you can do to mitigate it.

What IMHO _*should*_ happen is that the ARC cache should shrink as necessary
to prevent paging, subject to vfs.zfs.arc_min.  To prevent pathological
problems with segments that have been paged off hours (or more!) ago and
never get paged back in because that particular piece of code never executes
again (but the process is also still alive so the system cannot reclaim it
and thus it shows "committed" in pstat -s but unless it is paged back in has
no impact on system performance) the policing on this would have to apply a
"reasonableness" filter to those pages (e.g. if it has been out on the page
file for longer than "X", ignore that particular allocation unit for this
purpose.)

This would cause the ARC cache to flush itself down automatically as
executable and data segment RAM commitments increase.

The documentation says that this is the case and how it should work but it
doesn't appear to actually be this way in practice for many workloads.  I
have seen "wired" RAM pinned at 20GB on one of my servers here with a fairly
large DBMS running -- with pieces of its working set and even the a user's
shell (!) getting paged off, yet the ARC cache is not pared down to release
memory.  Indeed you can let the system run for hours under these conditions
and the ARC wired memory will not decrease.  Cutting back the DBMS's
internal buffering does not help.

What I've done here is restrict the ARC cache size in an attempt to prevent
this particular bit of bogosity from biting me, and it appears to (sort of)
work.  Unfortunately you cannot tune this while the system is running
(otherwise a user daemon could conceivably slash away at the arc_max sysctl
and force the deallocation of wired memory if it detected paging -- or
near-paging, such as free memory below some user-configured threshold), only
at boot time in /boot/loader.conf.

This is something that, should I get myself a nice hunk of free time, I may
dive into and attempt to fix.  It would likely take me quite a while to get
up to speed on this as I've not gotten into the zfs code at all -- and
mistakes in there could easily corrupt files....  (in other words definitely
NOT something to play with on a production system!)

I have to assume there's a pretty-good reason why you can't change arc_max
while the system is running; it _*can*_ be changed on a running system on
some other implementations (e.g. Solaris.)  It is marked with CTLFLAG_RDTUN
in the arc management file which prohibits run-time changes and the only
place I see it referenced with a quick look is in the arc_init code.

Note that the test in arc.c for "arc_reclaim_needed" appears to be pretty
basic -- essentially the system will not aggressively try to reclaim memory
unless used kmem > 3/4 of its size.

(snippet from arc.c around line 2494 of arc.c in 10-STABLE; path
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs)

#else   /* !sun */
        if (kmem_used() > (kmem_size() * 3) / 4)
                return (1);
#endif  /* sun */

Up above that there's a test for "vm_paging_needed()" that would
(theoretically) appear to trigger first in these situations, but it doesn't
in many cases.

IMHO this is too-basic of a test and leads to pathological situations in
that the system may wind up paging things off as opposed to paring back the
ARC cache.  As soon as the working set of something that's actually getting
cycles gets paged out in most cases system performance goes straight in the
trash.

On sun machines (from reading the code) it will allegedly try to pare any
time the "lotsfree" (plus "needfree" + "extra") amount of free memory is
invaded.

As an example this is what a server I own that is exhibiting this behavior
now shows:
20202500 wire
  1414052 act
  2323280 inact
  110340 cache
   414484 free
 1694896 buf

Of that "wired" mem 15.7G of it is ARC cache (with a target of 15.81, so
it's essentially right up against it.)

That "free" number would be ok if it didn't result in the system having
trashy performance -- but it does on occasion. Incidentally the allocated
swap is about 195k blocks (~200 Megabytes) which isn't much all-in, but it's
enough to force actual fetches of recently-used programs (e.g. your shell!)
from paged-off space.  The thing is that if the test in the code (75% of
kmem available consumed) was looking only at "free" the system should be
aggressively trying to free up ARC cache.  It clearly is not; the included
code calls this:

uint64_t
kmem_used(void)
{

        return (vmem_size(kmem_arena, VMEM_ALLOC));
}

What's quite clear is that the system _*thinks*_ it has plenty of free 
memory when it very-clearly is essentially out!  In fact free memory at 
the moment (~400MB) is 1.7% of the total, _*not*_ 25%.  From this I surmise 
that the "vmem_size" call is not returning the sum of all the above "in use" 
sizes (except perhaps "inact"); were it to do so that would be essentially 
100% of installed RAM and the ARC cache should be actively under shrinkage, 
but it clearly is not. 

>How-To-Repeat:
	Set up a cache-heavy workload on large (~terabyte sized or bigger)
	ZFS filesystems and note that free RAM drops to the point that
	starvation occurs, while "wired" memory pins at the maximum ARC
	cache size, even though you have other demands for RAM that should
	cause the ARC memory congestion control algorithm to evict some of
	the cache as demand rises.

>Fix:

	The context diff below resolves the problem.

	We now add up wired, active, inactive, cache and free memory and
	compute the percentage of RAM that is free of that whole.  If the
	percentage free drops below the selected value then the flag is set
	that asks the ARC cache to free RAM.

	This also introduces a runtime tunable that allows you to select the
	free RAM target for the ARC cache in real time, rather than forcing
	you to reboot to set the ARC's maximum size in /boot/loader.conf.  
	The target is exported via sysctl as:

		vfs.zfs.arc_freepage_percent_target: 25

	Changes to this value are effective immediately allowing runtime
	configuration to suit your workload. The default is set to 25% to
	match the original code's intent, but for large RAM sizes this is
	probably more conservative than required.

	Defining NEWRECLAIM_DEBUG will cause the code to print (on the
	console) state change status messages along with picked up changes
	in the reservation percentage.  Note that on a busy system that is
	actively trying to invade the free space reservation these notices
	can get rather "busy" and as such it is turned off by default.

*** arc.c.original	Thu Mar 13 09:18:48 2014
--- /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c	Thu Mar 13 15:43:38 2014
***************
*** 18,23 ****
--- 18,84 ----
   *
   * CDDL HEADER END
   */
+ 
+ /* Karl Denninger (karl at denninger.net), 3/13/2014, FreeBSD-specific
+  * 
+  * If "NEWRECLAIM" is defined, change the "low memory" warning that causes
+  * the ARC cache to be pared down.  The reason for the change is that the
+  * apparent attempted algorithm is to start evicting ARC cache when free
+  * pages fall below 25% of installed RAM.  This maps reasonably well to how
+  * Solaris is documented to behave; when "lotsfree" is invaded ZFS is told 
+  * to pare down.  
+  *
+  * The problem is that on FreeBSD machines the system doesn't appear to be 
+  * getting what the authors of the original code thought they were looking at
+  * with its test and as a result that test never triggers.  That leaves the 
+  * only reclaim trigger as the "paging needed" status flag, and by the time 
+  * that trips the system is already in low-memory trouble.  This can lead to 
+  * severe pathological behavior under the following scenario:
+  * - The system starts to page and ARC is evicted.
+  * - The system stops paging as ARC's eviction drops wired RAM a bit.
+  * - ARC starts increasing its allocation again, and wired memory grows.
+  * - A new image is activated, and the system once again attempts to page.
+  * - ARC starts to be evicted again.
+  * - Back to #2
+  * 
+  * Note that ZFS's ARC default (unless you override it in /boot/loader.conf)
+  * is to allow the ARC cache to grab nearly all of free RAM, provided nobody
+  * else needs it.  That would be ok if we evicted cache when required.
+  * 
+  * Unfortunately the system can get into a state where it never
+  * manages to page anything of materiality back in, as if there is active
+  * I/O the ARC will start grabbing space once again as soon as the memory 
+  * contention state drops.  For this reason the "paging is occurring" flag 
+  * should be the **last resort** condition for ARC eviction; you want to 
+  * (as Solaris does) start when there is material free RAM left in the hope 
+  * of never getting into the condition where you're potentially paging off 
+  * executables in favor of leaving disk cache allocated.  That's a recipe 
+  * for terrible overall system performance.
+  *
+  * To fix this we instead grab four OIDs out of the sysctl status
+  * messages -- wired pages, active pages, inactive pages and cache (vnodes?)
+  * pages, sum those and compare against the free page count from the
+  * VM sysctl status OID, giving us a percentage of pages free.  This
+  * is checked against a new tunable "vfs.zfs.arc_freepage_percent_target"
+  * and if less, we declare the system low on memory.
+  * 
+  * Note that this sysctl variable is runtime tunable if you have reason
+  * to change it (e.g. you want more or less RAM free to be the "clean up"
+  * threshold.)
+  *
+  * If this test is enabled the previous algorithm is still checked in the 
+  * event this test fails, although that previous test should be a no-op.
+  *
+  * If you turn on NEWRECLAIM_DEBUG then the kernel will print on the console
+  * status messages when the reclaim status trips on and off, along with the
+  * page count aggregate that triggered it (and the free space) for each
+  * event. 
+  */
+ 
+ #define	NEWRECLAIM
+ #undef	NEWRECLAIM_DEBUG
+ 
+ 
  /*
   * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
   * Copyright (c) 2013 by Delphix. All rights reserved.
***************
*** 139,144 ****
--- 200,211 ----

  #include <vm/vm_pageout.h>

+ #ifdef	NEWRECLAIM
+ #ifdef	__FreeBSD__
+ #include <sys/sysctl.h>
+ #endif
+ #endif	/* NEWRECLAIM */
+ 
  #ifdef illumos
  #ifndef _KERNEL
  /* set with ZFS_DEBUG=watch, to enable watchpoints on frozen buffers */
***************
*** 203,218 ****
--- 270,302 ----
  int zfs_arc_shrink_shift = 0;
  int zfs_arc_p_min_shift = 0;
  int zfs_disable_dup_eviction = 0;
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ static	int percent_target = 25;
+ #endif
+ #endif

  TUNABLE_QUAD("vfs.zfs.arc_max", &zfs_arc_max);
  TUNABLE_QUAD("vfs.zfs.arc_min", &zfs_arc_min);
  TUNABLE_QUAD("vfs.zfs.arc_meta_limit", &zfs_arc_meta_limit);
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ TUNABLE_INT("vfs.zfs.arc_freepage_percent_target", &percent_target);
+ #endif
+ #endif
+ 
  SYSCTL_DECL(_vfs_zfs);
  SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, arc_max, CTLFLAG_RDTUN, &zfs_arc_max, 0,
      "Maximum ARC size");
  SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, arc_min, CTLFLAG_RDTUN, &zfs_arc_min, 0,
      "Minimum ARC size");

+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_freepage_percent_target, CTLFLAG_RWTUN, &percent_target, 0, "ARC Free RAM Target percentage");
+ #endif
+ #endif
+ 
  /*
   * Note that buffers can be in one of 6 states:
   *	ARC_anon	- anonymous (discussed below)
***************
*** 2438,2443 ****
--- 2522,2543 ----
  {

  #ifdef _KERNEL
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+         u_int	vmwire = 0;
+ 	u_int	vmactive = 0;
+ 	u_int	vminactive = 0;
+ 	u_int	vmcache = 0;
+ 	u_int	vmfree = 0;
+ 	u_int	vmtotal = 0;
+ 	int	percent = 25;
+ 	size_t	vmsize;
+ #ifdef	NEWRECLAIM_DEBUG
+ 	static	int	xval = -1;
+ 	static	int	oldpercent = 0;
+ #endif	/* NEWRECLAIM_DEBUG */
+ #endif	/* NEWRECLAIM */
+ #endif

  	if (needfree)
  		return (1);
***************
*** 2492,2502 ****
  		return (1);
  #endif
  #else	/* !sun */
  	if (kmem_used() > (kmem_size() * 3) / 4)
  		return (1);
  #endif	/* sun */

- #else
  	if (spa_get_random(100) == 0)
  		return (1);
  #endif
--- 2592,2656 ----
  		return (1);
  #endif
  #else	/* !sun */
+ 
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ /*
+  * Implement the new tunable free RAM algorithm.  We check the various page
+  * VM stats and add them up, then check the free count percentage against
+  * the specified target.  If we're under the target we are memory constrained
+  * and ask for ARC cache shrinkage.
+  */
+ 	vmsize = sizeof(vmwire);
+         kernel_sysctlbyname(curthread, "vm.stats.vm.v_wire_count", &vmwire, &vmsize, NULL, 0, NULL, 0);
+ 	vmsize = sizeof(vmactive);
+         kernel_sysctlbyname(curthread, "vm.stats.vm.v_active_count", &vmactive, &vmsize, NULL, 0, NULL, 0);
+ 	vmsize = sizeof(vminactive);
+         kernel_sysctlbyname(curthread, "vm.stats.vm.v_inactive_count", &vminactive, &vmsize, NULL, 0, NULL, 0);
+ 	vmsize = sizeof(vmcache);
+         kernel_sysctlbyname(curthread, "vm.stats.vm.v_cache_count", &vmcache, &vmsize, NULL, 0, NULL, 0);
+ 	vmsize = sizeof(vmfree);
+         kernel_sysctlbyname(curthread, "vm.stats.vm.v_free_count", &vmfree, &vmsize, NULL, 0, NULL, 0);
+ 	vmsize = sizeof(percent);
+         kernel_sysctlbyname(curthread, "vfs.zfs.arc_freepage_percent_target", &percent, &vmsize, NULL, 0, NULL, 0);
+ 	vmtotal = vmwire + vmactive + vminactive + vmcache + vmfree;
+ #ifdef	NEWRECLAIM_DEBUG
+ 	if (percent != oldpercent) {
+ 		printf("ZFS ARC: Reservation change to [%d], [%d] pages, [%d] free\n", percent, vmtotal, vmfree);
+ 		oldpercent = percent;
+ 	}
+ #endif
+ 
+ 	if (!vmtotal) {
+ 		vmtotal = 1;	/* Protect against divide by zero */
+ 				/* (should be impossible, but...) */
+ 	}
+ 
+ 	if (((vmfree * 100) / vmtotal) < percent) {
+ #ifdef	NEWRECLAIM_DEBUG
+ 		if (xval != 1) {
+ 			printf("ZFS ARC: RECLAIM total %u, free %u, free pct (%u), target pct (%u)\n", vmtotal, vmfree, ((vmfree * 100) / vmtotal), percent);
+ 			xval = 1;
+ 		}
+ #endif	/* NEWRECLAIM_DEBUG */
+ 		return(1);
+ #ifdef	NEWRECLAIM_DEBUG
+ 	} else {
+ 		if (xval != 0) {
+ 			printf("ZFS ARC: NORMAL total %u, free %u, free pct (%u), target pct (%u)\n", vmtotal, vmfree, ((vmfree * 100) / vmtotal), percent);
+ 			xval = 0;
+ 		}
+ #endif	/* NEWRECLAIM_DEBUG */
+ 	}
+ 	
+ 
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */
+ 
  	if (kmem_used() > (kmem_size() * 3) / 4)
  		return (1);
  #endif	/* sun */

  	if (spa_get_random(100) == 0)
  		return (1);
  #endif

>Release-Note:
>Audit-Trail:
>Unformatted: