ZFS ARC under memory pressure

Fri Aug 19 21:52:09 UTC 2016

On 8/19/2016 16:34, Slawa Olhovchenkov wrote:
> On Fri, Aug 19, 2016 at 03:38:55PM -0500, Karl Denninger wrote:
>
>> On 8/19/2016 15:18, Slawa Olhovchenkov wrote:
>>> On Thu, Aug 18, 2016 at 03:31:26PM -0500, Karl Denninger wrote:
>>>
>>>> On 8/18/2016 15:26, Slawa Olhovchenkov wrote:
>>>>> On Thu, Aug 18, 2016 at 11:00:28PM +0300, Andriy Gapon wrote:
>>>>>
>>>>>> On 16/08/2016 22:34, Slawa Olhovchenkov wrote:
>>>>>>> I see issuses with ZFS ARC inder memory pressure.
>>>>>>> ZFS ARC size can be dramaticaly reduced, up to arc_min.
>>>>>>>
>>>>>>> As I see memory pressure event cause call arc_lowmem and set needfree:
>>>>>>>
>>>>>>> arc.c:arc_lowmem
>>>>>>>
>>>>>>>         needfree = btoc(arc_c >> arc_shrink_shift);
>>>>>>>
>>>>>>> After this, arc_available_memory return negative vaules (PAGESIZE *
>>>>>>> (-needfree)) until needfree is zero. Independent how too much memory
>>>>>>> freed. needfree set to 0 in arc_reclaim_thread(), when arc_size <=
>>>>>>> arc_c. Until arc_size don't drop below arc_c (arc_c deceased at every
>>>>>>> loop interation).
>>>>>>>
>>>>>>> arc_c droped to minimum value if arc_size fast enough droped.
>>>>>>>
>>>>>>> No control current to initial memory allocation.
>>>>>>>
>>>>>>> As result, I can see needless arc reclaim, from 10x to 100x times.
>>>>>>>
>>>>>>> Can some one check me and comment this?
>>>>>> You might have found a real problem here, but I am short of time right now to
>>>>>> properly analyze the issue.  I think that on illumos 'needfree' is a variable
>>>>>> that's managed by the virtual memory system and it is akin to our
>>>>>> vm_pageout_deficit.  But during the porting it became an artificial value and
>>>>>> its handling might be sub-optimal.
>>>>> As I see, totaly not optimal.
>>>>> I am create some patch for sub-optimal handling and now test it.
>>>>> _______________________________________________
>>>>> freebsd-fs at freebsd.org mailing list
>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
>>>>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
>>>> You might want to look at the code contained in here:
>>>>
>>>> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594
>>> In may case arc.c issuse cused by revision r286625 in HEAD (and
>>> r288562 in STABLE) -- all in 2015, not touch in 2014.
>>>
>>>> There are some ugly interactions with the VM system you can run into if
>>>> you're not careful; I've chased this issue before and while I haven't
>>>> yet done the work to integrate it into 11.x (and the underlying code
>>>> *has* changed since the 10.x patches I developed) if you wind up driving
>>>> the VM system to evict pages to swap rather than pare back ARC you're
>>>> probably making the wrong choice.
>>>>
>>>> In addition UMA can come into the picture too and (at least previously)
>>>> was a severe contributor to pathological behavior.
>>> I am only do less aggresive (and more controlled) shrink of ARC size.
>>> Now ARC just collapsed.
>>>
>>> Pointed PR is realy BIG. I am can't read and understund all of this.
>>> r286625 change behaivor of interaction between ARC and VM.
>>> You problem still exist? Can you explain (in list)?
>>>
>> Essentially ZFS is a "bolt-on" and unlike UFS which uses the unified
>> buffer cache (which the VM system manages) ZFS does not.  ARC is
>> allocated out of kernel memory and (by default) also uses UMA; the VM
>> system is not involved in its management.
>>
>> When the VM system gets constrained (low memory) it thus cannot tell the
>> ARC to pare back.  So when the VM system gets low on RAM it will start
> Currently VM generate event and ARC listen for this event, handle it
> by arc.c:arc_lowmem().
>
>> to page.  The problem with this is that if the VM system is low on RAM
>> because the ARC is consuming memory you do NOT want to page, you want to
>> evict some of the ARC.
> Now by event `lowmem` ARC try to evict 1/128 of ARC.
>
>> Unfortunately the VM system has another interaction that causes trouble
>> too.  The VM system will "demote" a page to inactive or cache status but
>> not actually free it.  It only starts to go through those pages and free
>> them when the vm system wakes up, and that only happens when free space
>> gets low enough to trigger it.
>
>> Finally, there's another problem that comes into play; UMA.  Kernel
>> memory allocation is fairly expensive.  UMA grabs memory from the kernel
>> allocation system in big chunks and manages it, and by doing so gains a
>> pretty-significant performance boost.  But this means that you can have
>> large amounts of RAM that are allocated, not in use, and yet the VM
>> system cannot reclaim them on its own.  The ZFS code has to reap those
>> caches, but reaping them is a moderately expensive operation too, thus
>> you don't want to do it unnecessarily.
> Not sure, but some code in ZFS may be handle this.
> arc.c:arc_kmem_reap_now().
> Not sure.
>
>> I've not yet gone through the 11.x code to see what changed from 10.x;
>> what I do know is that it is materially better-behaved than it used to
>> be, in that prior to 11.x I would have (by now) pretty much been forced
>> into rolling that forward and testing it because the misbehavior in one
>> of my production systems was severe enough to render it basically
>> unusable without the patch in that PR inline, with the most-serious
>> misbehavior being paging-induced stalls that could reach 10s of seconds
>> or more in duration.
>>
>> 11.x hasn't exhibited the severe problems, unpatched, that 10.x was
>> known to do on my production systems -- but it is far less than great in
>> that it sure as heck does have UMA coherence issues.....
>>
>> ARC Size:                               38.58%  8.61    GiB
>>         Target Size: (Adaptive)         70.33%  15.70   GiB
>>         Min Size (Hard Limit):          12.50%  2.79    GiB
>>         Max Size (High Water):          8:1     22.32   GiB
>>
>> I have 20GB out in kernel memory on this machine right now but only 8.6
>> of it in ARC; the rest is (mostly) sitting in UMA allocated-but-unused
>> -- so despite the belief expressed by some that the 11.x code is
>> "better" at reaping UMA I'm sure not seeing it here.
> I see.
> In my case:
>
> ARC Size:                               79.65%  98.48   GiB
>         Target Size: (Adaptive)         79.60%  98.42   GiB
>         Min Size (Hard Limit):          12.50%  15.46   GiB
>         Max Size (High Water):          8:1     123.64  GiB
>
> System Memory:
>
>         2.27%   2.83    GiB Active,     9.58%   11.94   GiB Inact
>         86.34%  107.62  GiB Wired,      0.00%   0 Cache
>         1.80%   2.25    GiB Free,       0.00%   0 Gap
>
>         Real Installed:                         128.00  GiB
>         Real Available:                 99.96%  127.95  GiB
>         Real Managed:                   97.41%  124.64  GiB
>
>         Logical Total:                          128.00  GiB
>         Logical Used:                   88.92%  113.81  GiB
>         Logical Free:                   11.08%  14.19   GiB
>
> Kernel Memory:                                  758.25  MiB
>         Data:                           97.81%  741.61  MiB
>         Text:                           2.19%   16.64   MiB
>
> Kernel Memory Map:                              124.64  GiB
>         Size:                           81.84%  102.01  GiB
>         Free:                           18.16%  22.63   GiB
>
> Mem: 2895M Active, 12G Inact, 108G Wired, 528K Buf, 2303M Free
> ARC: 98G Total, 89G MFU, 9535M MRU, 35M Anon, 126M Header, 404M Other
> Swap: 32G Total, 394M Used, 32G Free, 1% Inuse
>
> Is this 12G Inactive as 'UMA allocated-but-unused'?
> This is also may be freed but not reclaimed network bufs.
>
>> I'll get around to rolling forward and modifying that PR since that
>> particular bit of jackassery with UMA is a definite performance
>> problem.  I suspect a big part of what you're seeing lies there as
>> well.  When I do get that code done and tested I suspect it may solve
>> your problems as well.
> No. May problem is completly different: under memory pressure, after arc_lowmem()
> set needfree to non-zero arc_reclaim_thread() start to shrink ARC. But
> arc_reclaim_thread (in FreeBSD case) don't correctly control this process
> and shrink stoped in random time (when after next iteration arc_size <= arc_c),
> mostly after drop to Min Size (Hard Limit).
>
> I am just resore control of shrink process.
Not quite due to the UMA issue, among other things.  There's also a
potential "stall" issue that can arise also having to do with dirty_max
sizing, especially if you are using rotating media.  The PR patch scaled
that back dynamically as well under memory pressure and eliminated that
issue as well.

I won't have time to look at this for at least another week on my test
machine as I'm unfortunately buried with unrelated work at present, but
I should be able to put some effort into this within the next couple
weeks and see if I can quickly roll forward the important parts of the
previous PR patch.

I think you'll find that it stops the behavior you're seeing - I'm just
pointing out that this was more-complex internally than it first
appeared in the 10.x branch and I have no reason to believe the
interactions that lead to bad behavior are not still in play given what
you're describing for symptoms.

-- 
Karl Denninger
karl at denninger.net <mailto:karl at denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2996 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20160819/2333a1ad/attachment.bin>