ZFS ARC and mmap/page cache coherency question

Tue Jul 5 18:40:33 UTC 2016

So what Oracle did (based on work done by SUN for Opensolaris) was to:
1. Modify ZFS to prevent *ANY* double/multi caching [this is
considered a design defect]
2. Introduce a new VM subsystem which scales a lot better and provides
hooks for [1] so there are never two or more copies of the same data
in the system

Given that this was a huge, paid, multiyear effort its not likely
going to happen that the design defects in opensource ZFS will ever go
away.

Lionel

On 5 July 2016 at 19:50, Karl Denninger <karl at denninger.net> wrote:
>
> On 7/5/2016 12:19, Matthew Macy wrote:
>>
>>
>>  ---- On Mon, 04 Jul 2016 19:26:06 -0700 Karl Denninger <karl at denninger.net> wrote ----
>>  >
>>  >
>>  > On 7/4/2016 18:45, Matthew Macy wrote:
>>  > >
>>  > >
>>  > >  ---- On Sun, 03 Jul 2016 08:43:19 -0700 Karl Denninger <karl at denninger.net> wrote ----
>>  > >  >
>>  > >  > On 7/3/2016 02:45, Matthew Macy wrote:
>>  > >  > >
>>  > >  > >             Cedric greatly overstates the intractability of resolving it. Nonetheless, since the initial import very little has been done to improve integration, and I don't know of anyone who is up to the task taking an interest in it. Consequently, mmap() performance is likely "doomed" for the foreseeable future.-M----
>>  > >  >
>>  > >  > Wellllll....
>>  > >  >
>>  > >  > I've done a fair bit of work here (see
>>  > >  > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594) and the
>>  > >  > political issues are at least as bad as the coding ones.
>>  > >  >
>>  > >
>>  > >
>>  > > Strictly speaking, the root of the problem is the ARC. Not ZFS per se. Have you ever tried disabling MFU caching to see how much worse LRU only is? I'm not really convinced the ARC's benefits justify its cost.
>>  > >
>>  > > -M
>>  > >
>>  >
>>  > The ARC is very useful when it gets a hit as it avoid an I/O that would
>>  > otherwise take place.
>>  >
>>  > Where it sucks is when the system evicts working set to preserve ARC.
>>  > That's always wrong in that you're trading a speculative I/O (if the
>>  > cache is hit later) for a *guaranteed* one (to page out) and maybe *two*
>>  > (to page back in.)
>>
>> The question wasn't ARC vs. no-caching. It was LRU only vs LRU + MFU. There are a lot of issues stemming from the fact that ZFS is a transactional object store with a POSIX FS on top. One is that it caches disk blocks as opposed to file blocks. However, if one could resolve that and have the page cache manage these blocks life would be much much better. However, you'd lose MFU. Hence my question.
>>
>> -M
>>
> I suspect there's an argument to be made there but the present problems
> make determining the impact of that difficult or impossible as those
> effects are swamped by the other issues.
>
> I can fairly-easily create workloads on the base code where simply
> typing "vi <some file>", making a change and hitting ":w" will result in
> a stall of tens of seconds or more while the cache flush that gets
> requested is run down.  I've resolved a good part (but not all
> instances) of this through my work.
>
> My understanding is that 11- has had additional work done to the base
> code, but three underlying issues are not, from what I can see in the
> commit logs and discussions, addressed: The VM system will page out
> working set while leaving ARC alone, UMA reserved-but-not-in-use space
> is not policed adequately when memory pressure exists *before* the pager
> starts considering evicting working set and the write-back cache is for
> many machine configurations grossly inappropriate and cannot be tuned
> adequately by hand (particularly being true on a system with vdevs that
> have materially-varying performance levels.)
>
> I have more-or-less stopped work on the tree on a forward basis since I
> got to a place with 10.2 that (1) works for my production requirements,
> resolving the problems and (2) ran into what I deemed to be intractable
> political issues within core on progress toward eradicating the root of
> the problem.
>
> I will probably revisit the situation with 11- at some point, as I'll
> want to roll my production systems forward.  However, I don't know when
> that will be -- right now 11- is stable enough for some of my embedded
> work (e.g. on the Raspberry Pi2) but is not on my server and
> client-class machines.  Indeed just yesterday I got a lock-order
> reversal panic while doing a shutdown after a kernel update on one of my
> lab boxes running a just-updated 11- codebase.
>
> --
> Karl Denninger
> karl at denninger.net <mailto:karl at denninger.net>
> /The Market Ticker/
> /[S/MIME encrypted email preferred]/

-- 
Lionel