ZFS ARC under memory pressure

Fri Aug 19 20:39:09 UTC 2016

On 8/19/2016 15:18, Slawa Olhovchenkov wrote:
> On Thu, Aug 18, 2016 at 03:31:26PM -0500, Karl Denninger wrote:
>
>> On 8/18/2016 15:26, Slawa Olhovchenkov wrote:
>>> On Thu, Aug 18, 2016 at 11:00:28PM +0300, Andriy Gapon wrote:
>>>
>>>> On 16/08/2016 22:34, Slawa Olhovchenkov wrote:
>>>>> I see issuses with ZFS ARC inder memory pressure.
>>>>> ZFS ARC size can be dramaticaly reduced, up to arc_min.
>>>>>
>>>>> As I see memory pressure event cause call arc_lowmem and set needfree:
>>>>>
>>>>> arc.c:arc_lowmem
>>>>>
>>>>>         needfree = btoc(arc_c >> arc_shrink_shift);
>>>>>
>>>>> After this, arc_available_memory return negative vaules (PAGESIZE *
>>>>> (-needfree)) until needfree is zero. Independent how too much memory
>>>>> freed. needfree set to 0 in arc_reclaim_thread(), when arc_size <=
>>>>> arc_c. Until arc_size don't drop below arc_c (arc_c deceased at every
>>>>> loop interation).
>>>>>
>>>>> arc_c droped to minimum value if arc_size fast enough droped.
>>>>>
>>>>> No control current to initial memory allocation.
>>>>>
>>>>> As result, I can see needless arc reclaim, from 10x to 100x times.
>>>>>
>>>>> Can some one check me and comment this?
>>>> You might have found a real problem here, but I am short of time right now to
>>>> properly analyze the issue.  I think that on illumos 'needfree' is a variable
>>>> that's managed by the virtual memory system and it is akin to our
>>>> vm_pageout_deficit.  But during the porting it became an artificial value and
>>>> its handling might be sub-optimal.
>>> As I see, totaly not optimal.
>>> I am create some patch for sub-optimal handling and now test it.
>>> _______________________________________________
>>> freebsd-fs at freebsd.org mailing list
>>> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
>>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
>> You might want to look at the code contained in here:
>>
>> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594
> In may case arc.c issuse cused by revision r286625 in HEAD (and
> r288562 in STABLE) -- all in 2015, not touch in 2014.
>
>> There are some ugly interactions with the VM system you can run into if
>> you're not careful; I've chased this issue before and while I haven't
>> yet done the work to integrate it into 11.x (and the underlying code
>> *has* changed since the 10.x patches I developed) if you wind up driving
>> the VM system to evict pages to swap rather than pare back ARC you're
>> probably making the wrong choice.
>>
>> In addition UMA can come into the picture too and (at least previously)
>> was a severe contributor to pathological behavior.
> I am only do less aggresive (and more controlled) shrink of ARC size.
> Now ARC just collapsed.
>
> Pointed PR is realy BIG. I am can't read and understund all of this.
> r286625 change behaivor of interaction between ARC and VM.
> You problem still exist? Can you explain (in list)?
>

Essentially ZFS is a "bolt-on" and unlike UFS which uses the unified
buffer cache (which the VM system manages) ZFS does not.  ARC is
allocated out of kernel memory and (by default) also uses UMA; the VM
system is not involved in its management.

When the VM system gets constrained (low memory) it thus cannot tell the
ARC to pare back.  So when the VM system gets low on RAM it will start
to page.  The problem with this is that if the VM system is low on RAM
because the ARC is consuming memory you do NOT want to page, you want to
evict some of the ARC.

Consider this: ARC data *at best* prevents one I/O.  That is, if there
is data in the cache when you go to read from disk, you avoid one I/O
per unit of data in the ARC you didn't have to read.

Paging *always* requires one I/O (to write the page(s) to the swap) and
MAY involve two (to later page it back in.)  It is never a "win" to
spend a *guaranteed* I/O when you can instead act in a way that *might*
cause you to (later) need to execute one.

Unfortunately the VM system has another interaction that causes trouble
too.  The VM system will "demote" a page to inactive or cache status but
not actually free it.  It only starts to go through those pages and free
them when the vm system wakes up, and that only happens when free space
gets low enough to trigger it.

Finally, there's another problem that comes into play; UMA.  Kernel
memory allocation is fairly expensive.  UMA grabs memory from the kernel
allocation system in big chunks and manages it, and by doing so gains a
pretty-significant performance boost.  But this means that you can have
large amounts of RAM that are allocated, not in use, and yet the VM
system cannot reclaim them on its own.  The ZFS code has to reap those
caches, but reaping them is a moderately expensive operation too, thus
you don't want to do it unnecessarily.

I've not yet gone through the 11.x code to see what changed from 10.x;
what I do know is that it is materially better-behaved than it used to
be, in that prior to 11.x I would have (by now) pretty much been forced
into rolling that forward and testing it because the misbehavior in one
of my production systems was severe enough to render it basically
unusable without the patch in that PR inline, with the most-serious
misbehavior being paging-induced stalls that could reach 10s of seconds
or more in duration.

11.x hasn't exhibited the severe problems, unpatched, that 10.x was
known to do on my production systems -- but it is far less than great in
that it sure as heck does have UMA coherence issues.....

ARC Size:                               38.58%  8.61    GiB
        Target Size: (Adaptive)         70.33%  15.70   GiB
        Min Size (Hard Limit):          12.50%  2.79    GiB
        Max Size (High Water):          8:1     22.32   GiB

I have 20GB out in kernel memory on this machine right now but only 8.6
of it in ARC; the rest is (mostly) sitting in UMA allocated-but-unused
-- so despite the belief expressed by some that the 11.x code is
"better" at reaping UMA I'm sure not seeing it here.

I'll get around to rolling forward and modifying that PR since that
particular bit of jackassery with UMA is a definite performance
problem.  I suspect a big part of what you're seeing lies there as
well.  When I do get that code done and tested I suspect it may solve
your problems as well.

-- 
Karl Denninger
karl at denninger.net <mailto:karl at denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2996 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20160819/8acebc00/attachment.bin>