ZFS arc_reclaim_needed: better cooperation with pagedaemon

Sun Aug 22 21:46:37 UTC 2010

I propose that the following code in arc_reclaim_needed
(sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c)
/*
 * If pages are needed or we're within 2048 pages
 * of needing to page need to reclaim
 */
if (vm_pages_needed || (vm_paging_target() > -2048))

be changed to

if (vm_paging_needed())

Rationale.

1. Why not current checks.

ARC sizing should cooperate with pagedaemon in freeing pages.
If ARC starts shrinking "prematurely", before pagedaemon is waked up then no
potentially eligible inactive pages will be recycled and no potentially eligible
active pages will be inactive (subject to v_inactive_target).
This would lead to ARC size going to its minimum value (which could hurt ZFS
performance).  Only after that there is a chance that pagedaemon would be waked
up to do its cleaning.
And conversely, if ARC doesn't shrink in time, then pagedaemon would have to
recycle pages with data that could be needed again soon and that would lead to
excessive swapping and disk I/O.

vm_paging_target() is used only by pagedaemon internally, it effectively sets
_upper_ limit on how many pages pagedaemon would free when it's activated.
It is no indication of whether pagedaemon should be scanning/freeing pages.
Thus check of vm_paging_target() leads to premature ARC shrinking.
I believe that many people observe this behavior on sufficiently active systems
(not dedicated file servers) with few GB of RAM (1-8).

vm_pages_needed check is redundant, because this is a flag that is used to wake
up pagedaemon.  So when it is set vm_paging_needed() is true and
vm_paging_target() is "way" above zero.  And this flag is reset to zero when
vm_page_count_min() becomes false, which corresponds to even fewer free pages
than when vm_paging_needed() is true.

2. Why the new check.

vm_paging_needed() is the (earliest) condition that wakes up pagedaemon (see
vm_page_alloc).  pagedaemon would first of all run vm_lowmem event for which ARC
already has a handler and which would cause ARC size to shrink.
It would seems like having vm_paging_needed() check would be redundant then.
Almost - if memory pressure is significant, then vm_paging_needed() may stay
true for a while and that would cause additional ARC reduction by
arc_reclaim_thread.

Final notes.

I think that
vm_paging_target() > -2048
check was modeled after the check in the original OpenSolaris code:
freemem < lotsfree + needfree + extra

The issue is that in my understanding OpenSolaris pagedaemon works differently
from FreeBSD pagedaemon.

OpenSolaris pagedaemon is activated when freemem (equivalent of our free +
cache) falls down to a certain higher mark (lotsfree).  Initially it scans pages
at a slow rate.  If freemem falls further the rate linearly increases until it
reaches its maximum when freemem goes to or below certain lower mark.

Our pagedaemon is activated when free + cache falls down to a value when
vm_paging_needed() is true (see definition of this function).  When it is
activated it makes a scan pass though inactive and active pages setting a
certain target for free+cache, but that target is "soft" and actually is an
upper limit of how many pages could be freed during the pass. pagedaemon would
make the second (or subsequent) pass only if free+cache falls to value that is
even lower than the threshold in vm_paging_needed(), which means significant
(severe even) memory pressure/shortage.
So on sufficiently active system free+cache would typically oscillate between
v_free_reserved+v_cache_min at the bottom and some semi-random values "near"
v_free_target+v_cache_min at the tops.  That's when excluding ARC from the picture.

And about pictures :-)
Behavior of free+cache with current arc_reclaim_needed code:
http://people.freebsd.org/~avg/avail-mem-before.png
and its behavior after the patch:
http://people.freebsd.org/~avg/avail-mem-after.png

The legends on the pictures are incorrect, sorry, my mastery of drraw is not
good yet.
Correct legends:
"aqua" color - v_free_target+v_cache_min (vm_paging_target() == 0)
"fuchsia" color - v_free_reserved+v_cache_min (vm_paging_needed() threshold)
"lime" color - v_free_count+v_cache_count indeed :)
Y axis - % of total page count.

I think the graphs speak for themselves.

-- 
Andriy Gapon