ZFS ARC under memory pressure

Fri Aug 19 21:34:50 UTC 2016

On Fri, Aug 19, 2016 at 03:38:55PM -0500, Karl Denninger wrote:

> On 8/19/2016 15:18, Slawa Olhovchenkov wrote:
> > On Thu, Aug 18, 2016 at 03:31:26PM -0500, Karl Denninger wrote:
> >
> >> On 8/18/2016 15:26, Slawa Olhovchenkov wrote:
> >>> On Thu, Aug 18, 2016 at 11:00:28PM +0300, Andriy Gapon wrote:
> >>>
> >>>> On 16/08/2016 22:34, Slawa Olhovchenkov wrote:
> >>>>> I see issuses with ZFS ARC inder memory pressure.
> >>>>> ZFS ARC size can be dramaticaly reduced, up to arc_min.
> >>>>>
> >>>>> As I see memory pressure event cause call arc_lowmem and set needfree:
> >>>>>
> >>>>> arc.c:arc_lowmem
> >>>>>
> >>>>>         needfree = btoc(arc_c >> arc_shrink_shift);
> >>>>>
> >>>>> After this, arc_available_memory return negative vaules (PAGESIZE *
> >>>>> (-needfree)) until needfree is zero. Independent how too much memory
> >>>>> freed. needfree set to 0 in arc_reclaim_thread(), when arc_size <=
> >>>>> arc_c. Until arc_size don't drop below arc_c (arc_c deceased at every
> >>>>> loop interation).
> >>>>>
> >>>>> arc_c droped to minimum value if arc_size fast enough droped.
> >>>>>
> >>>>> No control current to initial memory allocation.
> >>>>>
> >>>>> As result, I can see needless arc reclaim, from 10x to 100x times.
> >>>>>
> >>>>> Can some one check me and comment this?
> >>>> You might have found a real problem here, but I am short of time right now to
> >>>> properly analyze the issue.  I think that on illumos 'needfree' is a variable
> >>>> that's managed by the virtual memory system and it is akin to our
> >>>> vm_pageout_deficit.  But during the porting it became an artificial value and
> >>>> its handling might be sub-optimal.
> >>> As I see, totaly not optimal.
> >>> I am create some patch for sub-optimal handling and now test it.
> >>> _______________________________________________
> >>> freebsd-fs at freebsd.org mailing list
> >>> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> >>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
> >> You might want to look at the code contained in here:
> >>
> >> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594
> > In may case arc.c issuse cused by revision r286625 in HEAD (and
> > r288562 in STABLE) -- all in 2015, not touch in 2014.
> >
> >> There are some ugly interactions with the VM system you can run into if
> >> you're not careful; I've chased this issue before and while I haven't
> >> yet done the work to integrate it into 11.x (and the underlying code
> >> *has* changed since the 10.x patches I developed) if you wind up driving
> >> the VM system to evict pages to swap rather than pare back ARC you're
> >> probably making the wrong choice.
> >>
> >> In addition UMA can come into the picture too and (at least previously)
> >> was a severe contributor to pathological behavior.
> > I am only do less aggresive (and more controlled) shrink of ARC size.
> > Now ARC just collapsed.
> >
> > Pointed PR is realy BIG. I am can't read and understund all of this.
> > r286625 change behaivor of interaction between ARC and VM.
> > You problem still exist? Can you explain (in list)?
> >
> 
> Essentially ZFS is a "bolt-on" and unlike UFS which uses the unified
> buffer cache (which the VM system manages) ZFS does not.  ARC is
> allocated out of kernel memory and (by default) also uses UMA; the VM
> system is not involved in its management.
> 
> When the VM system gets constrained (low memory) it thus cannot tell the
> ARC to pare back.  So when the VM system gets low on RAM it will start

Currently VM generate event and ARC listen for this event, handle it
by arc.c:arc_lowmem().

> to page.  The problem with this is that if the VM system is low on RAM
> because the ARC is consuming memory you do NOT want to page, you want to
> evict some of the ARC.

Now by event `lowmem` ARC try to evict 1/128 of ARC.

> Unfortunately the VM system has another interaction that causes trouble
> too.  The VM system will "demote" a page to inactive or cache status but
> not actually free it.  It only starts to go through those pages and free
> them when the vm system wakes up, and that only happens when free space
> gets low enough to trigger it.

> Finally, there's another problem that comes into play; UMA.  Kernel
> memory allocation is fairly expensive.  UMA grabs memory from the kernel
> allocation system in big chunks and manages it, and by doing so gains a
> pretty-significant performance boost.  But this means that you can have
> large amounts of RAM that are allocated, not in use, and yet the VM
> system cannot reclaim them on its own.  The ZFS code has to reap those
> caches, but reaping them is a moderately expensive operation too, thus
> you don't want to do it unnecessarily.

Not sure, but some code in ZFS may be handle this.
arc.c:arc_kmem_reap_now().
Not sure.

> I've not yet gone through the 11.x code to see what changed from 10.x;
> what I do know is that it is materially better-behaved than it used to
> be, in that prior to 11.x I would have (by now) pretty much been forced
> into rolling that forward and testing it because the misbehavior in one
> of my production systems was severe enough to render it basically
> unusable without the patch in that PR inline, with the most-serious
> misbehavior being paging-induced stalls that could reach 10s of seconds
> or more in duration.
> 
> 11.x hasn't exhibited the severe problems, unpatched, that 10.x was
> known to do on my production systems -- but it is far less than great in
> that it sure as heck does have UMA coherence issues.....
> 
> ARC Size:                               38.58%  8.61    GiB
>         Target Size: (Adaptive)         70.33%  15.70   GiB
>         Min Size (Hard Limit):          12.50%  2.79    GiB
>         Max Size (High Water):          8:1     22.32   GiB
> 
> I have 20GB out in kernel memory on this machine right now but only 8.6
> of it in ARC; the rest is (mostly) sitting in UMA allocated-but-unused
> -- so despite the belief expressed by some that the 11.x code is
> "better" at reaping UMA I'm sure not seeing it here.

I see.
In my case:

ARC Size:                               79.65%  98.48   GiB
        Target Size: (Adaptive)         79.60%  98.42   GiB
        Min Size (Hard Limit):          12.50%  15.46   GiB
        Max Size (High Water):          8:1     123.64  GiB

System Memory:

        2.27%   2.83    GiB Active,     9.58%   11.94   GiB Inact
        86.34%  107.62  GiB Wired,      0.00%   0 Cache
        1.80%   2.25    GiB Free,       0.00%   0 Gap

        Real Installed:                         128.00  GiB
        Real Available:                 99.96%  127.95  GiB
        Real Managed:                   97.41%  124.64  GiB

        Logical Total:                          128.00  GiB
        Logical Used:                   88.92%  113.81  GiB
        Logical Free:                   11.08%  14.19   GiB

Kernel Memory:                                  758.25  MiB
        Data:                           97.81%  741.61  MiB
        Text:                           2.19%   16.64   MiB

Kernel Memory Map:                              124.64  GiB
        Size:                           81.84%  102.01  GiB
        Free:                           18.16%  22.63   GiB

Mem: 2895M Active, 12G Inact, 108G Wired, 528K Buf, 2303M Free
ARC: 98G Total, 89G MFU, 9535M MRU, 35M Anon, 126M Header, 404M Other
Swap: 32G Total, 394M Used, 32G Free, 1% Inuse

Is this 12G Inactive as 'UMA allocated-but-unused'?
This is also may be freed but not reclaimed network bufs.

> I'll get around to rolling forward and modifying that PR since that
> particular bit of jackassery with UMA is a definite performance
> problem.  I suspect a big part of what you're seeing lies there as
> well.  When I do get that code done and tested I suspect it may solve
> your problems as well.

No. May problem is completly different: under memory pressure, after arc_lowmem()
set needfree to non-zero arc_reclaim_thread() start to shrink ARC. But
arc_reclaim_thread (in FreeBSD case) don't correctly control this process
and shrink stoped in random time (when after next iteration arc_size <= arc_c),
mostly after drop to Min Size (Hard Limit).

I am just resore control of shrink process.