ZFS ARC under memory pressure
Slawa Olhovchenkov
slw at zxy.spb.ru
Fri Aug 19 21:34:50 UTC 2016
On Fri, Aug 19, 2016 at 03:38:55PM -0500, Karl Denninger wrote:
> On 8/19/2016 15:18, Slawa Olhovchenkov wrote:
> > On Thu, Aug 18, 2016 at 03:31:26PM -0500, Karl Denninger wrote:
> >
> >> On 8/18/2016 15:26, Slawa Olhovchenkov wrote:
> >>> On Thu, Aug 18, 2016 at 11:00:28PM +0300, Andriy Gapon wrote:
> >>>
> >>>> On 16/08/2016 22:34, Slawa Olhovchenkov wrote:
> >>>>> I see issuses with ZFS ARC inder memory pressure.
> >>>>> ZFS ARC size can be dramaticaly reduced, up to arc_min.
> >>>>>
> >>>>> As I see memory pressure event cause call arc_lowmem and set needfree:
> >>>>>
> >>>>> arc.c:arc_lowmem
> >>>>>
> >>>>> needfree = btoc(arc_c >> arc_shrink_shift);
> >>>>>
> >>>>> After this, arc_available_memory return negative vaules (PAGESIZE *
> >>>>> (-needfree)) until needfree is zero. Independent how too much memory
> >>>>> freed. needfree set to 0 in arc_reclaim_thread(), when arc_size <=
> >>>>> arc_c. Until arc_size don't drop below arc_c (arc_c deceased at every
> >>>>> loop interation).
> >>>>>
> >>>>> arc_c droped to minimum value if arc_size fast enough droped.
> >>>>>
> >>>>> No control current to initial memory allocation.
> >>>>>
> >>>>> As result, I can see needless arc reclaim, from 10x to 100x times.
> >>>>>
> >>>>> Can some one check me and comment this?
> >>>> You might have found a real problem here, but I am short of time right now to
> >>>> properly analyze the issue. I think that on illumos 'needfree' is a variable
> >>>> that's managed by the virtual memory system and it is akin to our
> >>>> vm_pageout_deficit. But during the porting it became an artificial value and
> >>>> its handling might be sub-optimal.
> >>> As I see, totaly not optimal.
> >>> I am create some patch for sub-optimal handling and now test it.
> >>> _______________________________________________
> >>> freebsd-fs at freebsd.org mailing list
> >>> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> >>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
> >> You might want to look at the code contained in here:
> >>
> >> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594
> > In may case arc.c issuse cused by revision r286625 in HEAD (and
> > r288562 in STABLE) -- all in 2015, not touch in 2014.
> >
> >> There are some ugly interactions with the VM system you can run into if
> >> you're not careful; I've chased this issue before and while I haven't
> >> yet done the work to integrate it into 11.x (and the underlying code
> >> *has* changed since the 10.x patches I developed) if you wind up driving
> >> the VM system to evict pages to swap rather than pare back ARC you're
> >> probably making the wrong choice.
> >>
> >> In addition UMA can come into the picture too and (at least previously)
> >> was a severe contributor to pathological behavior.
> > I am only do less aggresive (and more controlled) shrink of ARC size.
> > Now ARC just collapsed.
> >
> > Pointed PR is realy BIG. I am can't read and understund all of this.
> > r286625 change behaivor of interaction between ARC and VM.
> > You problem still exist? Can you explain (in list)?
> >
>
> Essentially ZFS is a "bolt-on" and unlike UFS which uses the unified
> buffer cache (which the VM system manages) ZFS does not. ARC is
> allocated out of kernel memory and (by default) also uses UMA; the VM
> system is not involved in its management.
>
> When the VM system gets constrained (low memory) it thus cannot tell the
> ARC to pare back. So when the VM system gets low on RAM it will start
Currently VM generate event and ARC listen for this event, handle it
by arc.c:arc_lowmem().
> to page. The problem with this is that if the VM system is low on RAM
> because the ARC is consuming memory you do NOT want to page, you want to
> evict some of the ARC.
Now by event `lowmem` ARC try to evict 1/128 of ARC.
> Unfortunately the VM system has another interaction that causes trouble
> too. The VM system will "demote" a page to inactive or cache status but
> not actually free it. It only starts to go through those pages and free
> them when the vm system wakes up, and that only happens when free space
> gets low enough to trigger it.
> Finally, there's another problem that comes into play; UMA. Kernel
> memory allocation is fairly expensive. UMA grabs memory from the kernel
> allocation system in big chunks and manages it, and by doing so gains a
> pretty-significant performance boost. But this means that you can have
> large amounts of RAM that are allocated, not in use, and yet the VM
> system cannot reclaim them on its own. The ZFS code has to reap those
> caches, but reaping them is a moderately expensive operation too, thus
> you don't want to do it unnecessarily.
Not sure, but some code in ZFS may be handle this.
arc.c:arc_kmem_reap_now().
Not sure.
> I've not yet gone through the 11.x code to see what changed from 10.x;
> what I do know is that it is materially better-behaved than it used to
> be, in that prior to 11.x I would have (by now) pretty much been forced
> into rolling that forward and testing it because the misbehavior in one
> of my production systems was severe enough to render it basically
> unusable without the patch in that PR inline, with the most-serious
> misbehavior being paging-induced stalls that could reach 10s of seconds
> or more in duration.
>
> 11.x hasn't exhibited the severe problems, unpatched, that 10.x was
> known to do on my production systems -- but it is far less than great in
> that it sure as heck does have UMA coherence issues.....
>
> ARC Size: 38.58% 8.61 GiB
> Target Size: (Adaptive) 70.33% 15.70 GiB
> Min Size (Hard Limit): 12.50% 2.79 GiB
> Max Size (High Water): 8:1 22.32 GiB
>
> I have 20GB out in kernel memory on this machine right now but only 8.6
> of it in ARC; the rest is (mostly) sitting in UMA allocated-but-unused
> -- so despite the belief expressed by some that the 11.x code is
> "better" at reaping UMA I'm sure not seeing it here.
I see.
In my case:
ARC Size: 79.65% 98.48 GiB
Target Size: (Adaptive) 79.60% 98.42 GiB
Min Size (Hard Limit): 12.50% 15.46 GiB
Max Size (High Water): 8:1 123.64 GiB
System Memory:
2.27% 2.83 GiB Active, 9.58% 11.94 GiB Inact
86.34% 107.62 GiB Wired, 0.00% 0 Cache
1.80% 2.25 GiB Free, 0.00% 0 Gap
Real Installed: 128.00 GiB
Real Available: 99.96% 127.95 GiB
Real Managed: 97.41% 124.64 GiB
Logical Total: 128.00 GiB
Logical Used: 88.92% 113.81 GiB
Logical Free: 11.08% 14.19 GiB
Kernel Memory: 758.25 MiB
Data: 97.81% 741.61 MiB
Text: 2.19% 16.64 MiB
Kernel Memory Map: 124.64 GiB
Size: 81.84% 102.01 GiB
Free: 18.16% 22.63 GiB
Mem: 2895M Active, 12G Inact, 108G Wired, 528K Buf, 2303M Free
ARC: 98G Total, 89G MFU, 9535M MRU, 35M Anon, 126M Header, 404M Other
Swap: 32G Total, 394M Used, 32G Free, 1% Inuse
Is this 12G Inactive as 'UMA allocated-but-unused'?
This is also may be freed but not reclaimed network bufs.
> I'll get around to rolling forward and modifying that PR since that
> particular bit of jackassery with UMA is a definite performance
> problem. I suspect a big part of what you're seeing lies there as
> well. When I do get that code done and tested I suspect it may solve
> your problems as well.
No. May problem is completly different: under memory pressure, after arc_lowmem()
set needfree to non-zero arc_reclaim_thread() start to shrink ARC. But
arc_reclaim_thread (in FreeBSD case) don't correctly control this process
and shrink stoped in random time (when after next iteration arc_size <= arc_c),
mostly after drop to Min Size (Hard Limit).
I am just resore control of shrink process.
More information about the freebsd-fs
mailing list