ZFS ARC and mmap/page cache coherency question

Tue Jul 5 19:09:14 UTC 2016

You'd get most of the way to what Oracle did, I suspect, if the system:

1. Dynamically resized the write cache on a per-vdev basis so as to
prevent a flush from stalling all write I/O for a material amount of
time (which can and *does* happen now)

2. Made VM aware of UMA "committed-but-free" on an ongoing basis and
policed it on a sliding basis (that is, as RAM pressure rises VM
considers it more-important to reap UMA so as to prevent
marked-used-but-in-fact-free RAM from accumulating when RAM is under
pressure.)

3. Bi-directionally hooked VM so that it initiates and cooperates with
ZFS on ARC size management.  Specifically, if ZFS decides ARC is to be
reaped then it must notify VM so that (1) UMA can be reaped first, if
necessary and then if ARC *still* needs to be reaped it occurs *before*
VM pages anything out.  If and only if ARC is at minimum should the VM
system evict working set to the pagefile.

#1 is entirely within ZFS but is fairly hard to do well, and neither
Illumos or FreeBSD's team have taken a serious crack at it.

#2 I've taken a fairly decent look at but not implemented code on the VM
side to do it.  What I *have* done is implemented code on the ZFS side
to do it within the ZFS paradigm, which is technically in the wrong
place but works pretty well -- so long as the UMA fragmentation is
coming from ZFS.

#3 is a bear, especially if you don't move that code into VM (which
intimately "marries" the ZFS and VM code; that's very bad from a
maintainability perspective.)  What I've implemented is somewhat of a
hack in that regard in that it has ZFS triggering before VM does, it
gets aggressive with reaping its own UMA areas and the writeback cache
when there is RAM pressure and thus *most* of the time avoids the paging
pathology while allowing the ARC to use the truly-free RAM.  It ought to
be in the VM code however, because the pressure sometimes does not come
from ZFS.

This is why one of my production machines looks like right now with the
patch in -- this system runs a quite-active Postgres database along with
a material number of other things at the same time; this doesn't look
bad at all in terms of efficiency.

[karl at NewFS ~]$ zfs-stats -A

------------------------------------------------------------------------
ZFS Subsystem Report                            Tue Jul  5 14:05:06 2016
------------------------------------------------------------------------

ARC Summary: (HEALTHY)
        Memory Throttle Count:                  0

ARC Misc:
        Deleted:                                29.11m
        Recycle Misses:                         0
        Mutex Misses:                           67.14k
        Evict Skips:                            72.84m

ARC Size:                               72.10%  16.10   GiB
        Target Size: (Adaptive)         83.00%  18.53   GiB
        Min Size (Hard Limit):          12.50%  2.79    GiB
        Max Size (High Water):          8:1     22.33   GiB

ARC Size Breakdown:
        Recently Used Cache Size:       81.84%  15.17   GiB
        Frequently Used Cache Size:     18.16%  3.37    GiB

ARC Hash Breakdown:
        Elements Max:                           1.84m
        Elements Current:               33.47%  614.39k
        Collisions:                             41.78m
        Chain Max:                              6
        Chains:                                 39.45k

------------------------------------------------------------------------

ARC Efficiency:                                 1.88b
        Cache Hit Ratio:                78.45%  1.48b
        Cache Miss Ratio:               21.55%  405.88m
        Actual Hit Ratio:               77.46%  1.46b

        Data Demand Efficiency:         77.97%  1.45b
        Data Prefetch Efficiency:       24.82%  9.07m

        CACHE HITS BY CACHE LIST:
          Anonymously Used:             0.52%   7.62m
          Most Recently Used:           8.38%   123.87m
          Most Frequently Used:         90.36%  1.34b
          Most Recently Used Ghost:     0.18%   2.65m
          Most Frequently Used Ghost:   0.56%   8.30m

        CACHE HITS BY DATA TYPE:
          Demand Data:                  76.71%  1.13b
          Prefetch Data:                0.15%   2.25m
          Demand Metadata:              21.82%  322.33m
          Prefetch Metadata:            1.33%   19.58m

        CACHE MISSES BY DATA TYPE:
          Demand Data:                  78.91%  320.29m
          Prefetch Data:                1.68%   6.82m
          Demand Metadata:              16.70%  67.79m
          Prefetch Metadata:            2.70%   10.97m

------------------------------------------------------------------------

The system currently has 20Gb wired, ~3Gb free and ~1Gb inactive with a
tiny amount in the cache bucket (~46mb)

On 7/5/2016 13:40, Lionel Cons wrote:
> So what Oracle did (based on work done by SUN for Opensolaris) was to:
> 1. Modify ZFS to prevent *ANY* double/multi caching [this is
> considered a design defect]
> 2. Introduce a new VM subsystem which scales a lot better and provides
> hooks for [1] so there are never two or more copies of the same data
> in the system
>
> Given that this was a huge, paid, multiyear effort its not likely
> going to happen that the design defects in opensource ZFS will ever go
> away.
>
> Lionel
>
> On 5 July 2016 at 19:50, Karl Denninger <karl at denninger.net> wrote:
>> On 7/5/2016 12:19, Matthew Macy wrote:
>>>
>>>  ---- On Mon, 04 Jul 2016 19:26:06 -0700 Karl Denninger <karl at denninger.net> wrote ----
>>>  >
>>>  >
>>>  > On 7/4/2016 18:45, Matthew Macy wrote:
>>>  > >
>>>  > >
>>>  > >  ---- On Sun, 03 Jul 2016 08:43:19 -0700 Karl Denninger <karl at denninger.net> wrote ----
>>>  > >  >
>>>  > >  > On 7/3/2016 02:45, Matthew Macy wrote:
>>>  > >  > >
>>>  > >  > >             Cedric greatly overstates the intractability of resolving it. Nonetheless, since the initial import very little has been done to improve integration, and I don't know of anyone who is up to the task taking an interest in it. Consequently, mmap() performance is likely "doomed" for the foreseeable future.-M----
>>>  > >  >
>>>  > >  > Wellllll....
>>>  > >  >
>>>  > >  > I've done a fair bit of work here (see
>>>  > >  > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594) and the
>>>  > >  > political issues are at least as bad as the coding ones.
>>>  > >  >
>>>  > >
>>>  > >
>>>  > > Strictly speaking, the root of the problem is the ARC. Not ZFS per se. Have you ever tried disabling MFU caching to see how much worse LRU only is? I'm not really convinced the ARC's benefits justify its cost.
>>>  > >
>>>  > > -M
>>>  > >
>>>  >
>>>  > The ARC is very useful when it gets a hit as it avoid an I/O that would
>>>  > otherwise take place.
>>>  >
>>>  > Where it sucks is when the system evicts working set to preserve ARC.
>>>  > That's always wrong in that you're trading a speculative I/O (if the
>>>  > cache is hit later) for a *guaranteed* one (to page out) and maybe *two*
>>>  > (to page back in.)
>>>
>>> The question wasn't ARC vs. no-caching. It was LRU only vs LRU + MFU. There are a lot of issues stemming from the fact that ZFS is a transactional object store with a POSIX FS on top. One is that it caches disk blocks as opposed to file blocks. However, if one could resolve that and have the page cache manage these blocks life would be much much better. However, you'd lose MFU. Hence my question.
>>>
>>> -M
>>>
>> I suspect there's an argument to be made there but the present problems
>> make determining the impact of that difficult or impossible as those
>> effects are swamped by the other issues.
>>
>> I can fairly-easily create workloads on the base code where simply
>> typing "vi <some file>", making a change and hitting ":w" will result in
>> a stall of tens of seconds or more while the cache flush that gets
>> requested is run down.  I've resolved a good part (but not all
>> instances) of this through my work.
>>
>> My understanding is that 11- has had additional work done to the base
>> code, but three underlying issues are not, from what I can see in the
>> commit logs and discussions, addressed: The VM system will page out
>> working set while leaving ARC alone, UMA reserved-but-not-in-use space
>> is not policed adequately when memory pressure exists *before* the pager
>> starts considering evicting working set and the write-back cache is for
>> many machine configurations grossly inappropriate and cannot be tuned
>> adequately by hand (particularly being true on a system with vdevs that
>> have materially-varying performance levels.)
>>
>> I have more-or-less stopped work on the tree on a forward basis since I
>> got to a place with 10.2 that (1) works for my production requirements,
>> resolving the problems and (2) ran into what I deemed to be intractable
>> political issues within core on progress toward eradicating the root of
>> the problem.
>>
>> I will probably revisit the situation with 11- at some point, as I'll
>> want to roll my production systems forward.  However, I don't know when
>> that will be -- right now 11- is stable enough for some of my embedded
>> work (e.g. on the Raspberry Pi2) but is not on my server and
>> client-class machines.  Indeed just yesterday I got a lock-order
>> reversal panic while doing a shutdown after a kernel update on one of my
>> lab boxes running a just-updated 11- codebase.
>>
>> --
>> Karl Denninger
>> karl at denninger.net <mailto:karl at denninger.net>
>> /The Market Ticker/
>> /[S/MIME encrypted email preferred]/
>
>

-- 
Karl Denninger
karl at denninger.net <mailto:karl at denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2996 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.freebsd.org/pipermail/freebsd-hackers/attachments/20160705/834cb2ad/attachment.bin>