kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix

Tue Mar 18 16:55:27 UTC 2014

18.03.2014 17:20, Andriy Gapon wrote:
>   Karl Denninger <karl at fs.denninger.net> wrote:
>   > ZFS can be convinced to engage in pathological behavior due to a bad
>   > low-memory test in arc.c
>   >
>   > The offending file is at
>   > /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c; it allegedly
>   > checks for 25% free memory, and if it is less asks for the cache to shrink.
>   >
>   > (snippet from arc.c around line 2494 of arc.c in 10-STABLE; path
>   > /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs)
>   >
>   > #else /* !sun */
>   > if (kmem_used() > (kmem_size() * 3) / 4)
>   > return (1);
>   > #endif /* sun */
>   >
>   > Unfortunately these two functions do not return what the authors thought
>   > they did. It's clear what they're trying to do from the Solaris-specific
>   > code up above this test.
>
>   No, these functions do return what the authors think they do.
>   The check is for KVA usage (kernel virtual address space), not for physical memory.
>
>   > The result is that the cache only shrinks when vm_paging_needed() tests
>   > true, but by that time the system is in serious memory trouble and by
>
>   No, it is not.
>   The description and numbers here are a little bit outdated but they should give
>   an idea of how paging works in general:
>   https://wiki.freebsd.org/AvgPageoutAlgorithm
>
>   > triggering only there it actually drives the system further into paging,
>
>   How does ARC eviction drives the system further into paging?
>
>   > because the pager will not recall pages from the swap until they are next
>   > executed. This leads the ARC to try to fill in all the available RAM even
>   > though pages have been pushed off onto swap. Not good.
>
>   Unused physical memory is a waste.  It is true that ARC tries to use as much of
>   memory as it is allowed.  The same applies to the page cache (Active, Inactive).
>   Memory management is a dynamic system and there are a few competing agents.

I'd better like it to be a maximum of 500M or 5% memory. On a loaded 
server this wouldn't hurt performance but will provide a good window for 
VM system to stay reasonable.

>   It is hard to correctly tune that system using a large hummer such as your
>   patch.  I believe that with your patch ARC will get shrunk to its minimum size
>   in due time.  Active + Inactive will grow to use the memory that you are denying
>   to ARC driving Free below a threshold, which will reduce ARC.  Repeated enough
>   times this will drive ARC to its minimum.

But what is worse - having programs memory paged to the disk or some 
random data from the disk to be cached? Yes, I know that there are 
situations where a big amount of inactive memory would hurt performance. 
But putting file cache above inactive memory is bad too. I see no 
benefit in having 4G ARC cache but 2G inactive memory swapped out 
leaving inactive at 50M. Any Java service can hold a number of memory 
and it will require it occasionally so most of this memory would be 
swapped out so the process would be slow but we can browse the disk 
faster...

The only solution for this is making pages of ARC and inactive even in 
their odds to evict.

>   Also, there are a few technical problems with the patch:
>   - you don't need to use sysctl interface in kernel, the values you need are
>   available directly, just take a look at e.g. implementation of vm_paging_needed()
>   - similarly, querying vfs.zfs.arc_freepage_percent_target value via
>   kernel_sysctlbyname is just bogus; you can use percent_target directly
>   - you don't need to sum various page counters to get a total count, there is
>   v_page_count
>
>   Lastly, can you try to test reverting your patch and instead setting
>   vm.lowmem_period=0 ?

Actually I already tried that patch and compared it to lowmem_period. 
The patch works much better despite actually been a crutch...

The whole thing is because of two issues:

1. Kernel cannot reorder memory when some process (like VirtualBox) 
needs to allocate a big hunk at once. Right now the only working 
solution for kernel is to push inactive to the swap even when there is 
enough free memory to hold whole allocation. There's no in-memory 
reordering. And as ARC is shrinking only when free memory is low it 
completely ignores this condition and doesn't return a single page to 
the vm.

2. What ARC takes can't be freed because there's no simple opposite 
interface to get X blocks from ARC. It would be much better if ARC 
whould be arranged in a way that system can shrink it with a simple 
syscall, like cache. Without this we are already taking this route:

* systems needs space;
* arc starts shrinking;
* while arc shrinks some mem is cached to swap and becomes available;
* mem freed from swapping is taken and process starts working;
* arc completes shrinking and starts to grow because of a disk activity.

As far as I understand our VM systems tries to maintain a predefined 
percent of mem clean or at least cached to swap so this mem can be 
quickly claimed. So swapping wins, ARC losts and swap is never read back 
again unless explicitly required. This is because it's too late to evict 
anything from ARC when we need memory.

If there would be a way for ARC to mark some pages as freely purgeable 
(probably with a callback to tell ARC which pages where purged) I think 
this problem would be gone.

-- 
Sphinx of black quartz, judge my vow.