kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix

Wed Mar 19 13:07:02 UTC 2014

On 3/19/2014 7:51 AM, Adrian Gschwend wrote:
> On 18.03.14 18:45, Andriy Gapon wrote:
>
>>> This is consistent with what I and others have observed on both 9.2
>>> and 10.0; the ARC will expand until it hits the maximum configured
>>> even at the expense of forcing pages onto the swap. In this
>>> specific machine's case left to defaults it will grab nearly all
>>> physical memory (over 20GB of 24) and wire it down.
>> Well, this does not match my experience from before 10.x times.
> I reported the issue on which Karl gave feedback and developed the
> patch. The original thread of my report started here:
>
> http://lists.freebsd.org/pipermail/freebsd-fs/2014-March/019043.html
>
> Note that I don't have big memory eaters like VMs, it's just a bunch of
> jails and services running in them. Including some JVMs.
>
> Check out the munin graphs before and after:
>
> Daily which does not seem to grow much anymore now:
> http://ktk.netlabs.org/misc/munin-mem-zfs1.png
>
> Weekly:
> http://ktk.netlabs.org/misc/munin-mem-zfs2.png
>
> You can actually see where I activated the patch (16.3), the system
> behaves *much* better since then. I did one more reboot that's why it
> goes down again but since then I did not reboot anymore.
>
> The moments where munin did not report anything the system was in the
> ARC-swap lock and virtually dead. From working on the system it feels
> like a new machine, everything is super fast and snappy.
>
> I don't understand much of the discussions you guys are having but I'm
> pretty sure Karl fixed an issue which gave me headache on BSD over
> years. I first saw this in 8.x when I started to use ZFS productively
> and I've seen it in all 9.x release as well up to this patch.
>
> regards
>
> Adrian
>
I have a newer version of this patch responding to the criticisms given 
on gnats; it is being tested now.

The salient difference is that it now does two things that are a bit 
different:

1. It grabs the VM "first level" warning (vm_v_free_target), deducts 20% 
from that, and sets that as the low-RAM warning level.

2. It also allows the setting of a freemem reservation in percentage as 
an "additional" reservation (plus the low RAM warning level.)

Both are exposed via sysctl and thus can be tuned during runtime.

The reason for the change is that there is a legitimate criticism that 
the pager may allow inact pages to grow without boundary if you never 
get into the VM system's first warning level on free pages; that is, it 
is never called upon to perform page stealing.  "Never" seems like a bad 
decision (shouldn't you clean things up eventually anyway?) but it is 
what it is and the VM system has proved over time to be stable and fast, 
and for mixed workloads I can see where there could be trouble there in 
that ARC cache could be convinced to evict unnecessarily.  Unbounded 
inact page growth doesn't happen on my systems here but since it might 
and appears to be reasonably easy to defend against without causing 
other bad side effects that appears to be worth eliminating as a 
potential problem.

So instead I try to get more intelligent about choosing the arc eviction 
level; I want it into the zone where the system will steal pages back, 
but I *do not*, under any circumstance, want to allow vm.v_free_min to 
be invaded, because that's where processes asking for memory get 
**SUSPENDED** (that is, where stalls start to happen.)

Since the knobs are exposed you can get the behavior you have now if you 
want it, or you can leave it alone and
let the code choose what it thinks are intelligent values.  If you 
diddle the knobs and don't like them you can reset the percentage 
reservation to zero along with freepages and the system will pick up the 
defaults again for you in real time and without rebooting.

Also, and very importantly, I can now trivially provoke an INTENTIONAL 
stall with the knobs exposed; set the reservation down far enough (which 
effectively reverts to the system only paring cache when paging_needed 
is set as is the case with the default arc.c "as-shipped") and then 
simply copy a huge file to /dev/null (big enough to fill up the cache) 
and bang -- INSTANT 15 second stall.  Turn it back up so the ARC cache 
is not allowed to drive the system into hard paging and the problem 
disappears.

I'm going to let it run through the day today before sending it up; it 
ran overnight without problems and looks good, but I want to go through 
a heavy load period before publishing it.

I note that there are list complaints about this behavior going back to 
at least 2010.....

-- 
-- Karl
karl at denninger.net

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2711 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20140319/4ecc0aac/attachment.bin>