Strange ARC/Swap/CPU on yesterday's -CURRENT

Wed Apr 4 17:49:57 UTC 2018

On Tue, Apr 03, 2018 at 09:42:48PM -0700, Don Lewis wrote:
> On  3 Apr, Don Lewis wrote:
> > I reconfigured my Ryzen box to be more similar to my default package
> > builder by disabling SMT and half of the RAM, to limit it to 8 cores
> > and 32 GB and then started bisecting to try to track down the problem.
> > For each test, I first filled ARC by tarring /usr/ports/distfiles to
> > /dev/null.  The commit range that I was searching was r329844 to
> > r331716.  I narrowed the range to r329844 to r329904.  With r329904
> > and newer, ARC is totally unresponsive to memory pressure and the
> > machine pages heavily.  I see ARC sizes of 28-29GB and 30GB of wired
> > RAM, so there is not much leftover for getting useful work done.  Active
> > memory and free memory both hover under 1GB each.  Looking at the
> > commit logs over this range, the most likely culprit is:
> > 
> > r329882 | jeff | 2018-02-23 14:51:51 -0800 (Fri, 23 Feb 2018) | 13 lines
> > 
> > Add a generic Proportional Integral Derivative (PID) controller algorithm and
> > use it to regulate page daemon output.
> > 
> > This provides much smoother and more responsive page daemon output, anticipating
> > demand and avoiding pageout stalls by increasing the number of pages to match
> > the workload.  This is a reimplementation of work done by myself and mlaier at
> > Isilon.
> > 
> > 
> > It is quite possible that the recent fixes to the PID controller will
> > fix the problem.  Not that r329844 was trouble free ... I left tar
> > running over lunchtime to fill ARC and the OOM killer nuked top, tar,
> > ntpd, both of my ssh sessions into the machine, and multiple instances
> > of getty while I was away.  I was able to log in again and successfully
> > run poudriere, and ARC did respond to the memory pressure and cranked
> > itself down to about 5 GB by the end of the run.  I did not see the same
> > problem with tar when I did the same with r329904.
> 
> I just tried r331966 and see no improvement.  No OOM process kills
> during the tar run to fill ARC, but with ARC filled, the machine is
> thrashing itself at the start of the poudriere run while trying to build
> ports-mgmt/pkg (39 minutes so far).  ARC appears to be unresponsive to
> memory demand.  I've seen no decrease in ARC size or wired memory since
> starting poudriere.

Re-reading the ARC reclaim code, I see a couple of issues which might be
at the root of the behaviour you're seeing.

1. zfs_arc_free_target is too low now. It is initialized to the page
   daemon wakeup threshold, which is slightly above v_free_min. With the
   PID controller, the page daemon uses a setpoint of v_free_target.
   Moreover, it now wakes up regularly rather than having wakeups be
   synchronized by a mutex, so it will respond quickly if the free page
   count dips below v_free_target. The free page count will dip below
   zfs_arc_free_target only in the face of sudden and extreme memory
   pressure now, so the FMT_LOTSFREE case probably isn't getting
   exercised. Try initializing zfs_arc_free_target to v_free_target.

2. In the inactive queue scan, we used to compute the shortage after
   running uma_reclaim() and the lowmem handlers (which includes a
   synchronous call to arc_lowmem()). Now it's computed before, so we're
   not taking into account the pages that get freed by the ARC and UMA.
   The following rather hacky patch may help. I note that the lowmem
   logic is now somewhat broken when multiple NUMA domains are
   configured, however, since it fires only when domain 0 has a free
   page shortage.

Index: sys/vm/vm_pageout.c
===================================================================

--- sys/vm/vm_pageout.c	(revision 331933)
+++ sys/vm/vm_pageout.c	(working copy)
@@ -1114,25 +1114,6 @@
 	boolean_t queue_locked;
 
 	/*
-	 * If we need to reclaim memory ask kernel caches to return
-	 * some.  We rate limit to avoid thrashing.
-	 */
-	if (vmd == VM_DOMAIN(0) && pass > 0 &&
-	    (time_uptime - lowmem_uptime) >= lowmem_period) {
-		/*
-		 * Decrease registered cache sizes.
-		 */
-		SDT_PROBE0(vm, , , vm__lowmem_scan);
-		EVENTHANDLER_INVOKE(vm_lowmem, VM_LOW_PAGES);
-		/*
-		 * We do this explicitly after the caches have been
-		 * drained above.
-		 */
-		uma_reclaim();
-		lowmem_uptime = time_uptime;
-	}
-
-	/*
 	 * The addl_page_shortage is the number of temporarily
 	 * stuck pages in the inactive queue.  In other words, the
 	 * number of pages from the inactive count that should be
@@ -1824,6 +1805,26 @@
 		atomic_store_int(&vmd->vmd_pageout_wanted, 1);
 
 		/*
+		 * If we need to reclaim memory ask kernel caches to return
+		 * some.  We rate limit to avoid thrashing.
+		 */
+		if (vmd == VM_DOMAIN(0) &&
+		    vmd->vmd_free_count < vmd->vmd_free_target &&
+		    (time_uptime - lowmem_uptime) >= lowmem_period) {
+			/*
+			 * Decrease registered cache sizes.
+			 */
+			SDT_PROBE0(vm, , , vm__lowmem_scan);
+			EVENTHANDLER_INVOKE(vm_lowmem, VM_LOW_PAGES);
+			/*
+			 * We do this explicitly after the caches have been
+			 * drained above.
+			 */
+			uma_reclaim();
+			lowmem_uptime = time_uptime;
+		}
+
+		/*
 		 * Use the controller to calculate how many pages to free in
 		 * this interval.
 		 */