ZFS ARC and mmap/page cache coherency question

Sun Jul 3 15:50:58 UTC 2016

On 7/3/2016 02:45, Matthew Macy wrote:
>         
>             Cedric greatly overstates the intractability of resolving it. Nonetheless, since the initial import very little has been done to improve integration, and I don't know of anyone who is up to the task taking an interest in it. Consequently, mmap() performance is likely "doomed" for the foreseeable future.-M---- 

Wellllll....

I've done a fair bit of work here (see
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594) and the
political issues are at least as bad as the coding ones.

In short what Cedric says about the root of the issue is real.  VM is
really-well implemented for what it handles, but the root of the issue
is that while the UFS data cache is part of VM and thus it "knows" about
it, ZFS is not because it is a "bolt-on."  UMA leads to further (severe)
complications for certain workloads. 

Finally the underlying ZFS dmu_tx sizing code is just plain wrong and in
fact this is one of the biggest issues as when the system runs into
trouble it can take a bad situation and make it a *lot* worse.  There is
only one write-back cache maintained instead of one per zvol, and that's
flat-out broken.  Being able to re-order async writes to disk (where
fsync() has not been called) and minimizing seek latency is excellent. 
Sadly rotating media these days sabotages much of this due to opacity
introduced at the drive level (e.g. varying sector counts per track,
etc) but it can still help.  But where things go dramatically wrong is
on a system where a large write-back cache is allocated relative to the
underlying zvol I/O performance (this occurs on moderately-large and
bigger RAM systems) with moderate numbers of modest-performance rotating
media; in this case it is entirely possible for a flush of the write
buffers to require upwards of a *minute* to complete, during which all
other writes block.  If this happens during periods of high RAM demand
and you manage to trigger a page-out at the same time system performance
will go straight into the toilet.  I have seen instances where simply
trying to edit a text file with vi (or a "select" against a database
table) will hang for upwards of a minute leading you to believe the
system has crashed, when it fact it has not.

The interaction of VM with the above can lead to severe pathological
behavior because the VM system has no way to tell the ZFS subsystem to
pare back ARC (and at least as important, perhaps more-so -- unused but
allocated UMA) when memory pressure exists *before* it pages.  ZFS tries
to detect memory pressure and do this itself but it winds up competing
with the VM system.  This leads to demonstrably wrong behavior because
you never want to hold disk cache in preference to RSS; if you have a
block of data from the disk the best case is you avoid one I/O (to
re-read it); if you page you are *guaranteed* to take one I/O (to write
the paged-out RSS to disk) and *might* take two (if you then must read
it back in.)

In short trading the avoidance of one *possible* I/O for a *guaranteed*
I/O and a second possible one is *always* a net lose.

To "fix" all of this "correctly" (for all cases, instead of certain
cases) VM would have to "know" about ARC and its use of UMA, along with
being able to police both.  ZFS also must have the dmu_tx writeback
cache sized per-zvol with its size chosen by the actual I/O performance
characteristics of the disks in the zvol itself.  I've looked into doing
both and it's fairly complex, and what's worse is that it would
effectively "marry" VM and ZFS, removing the "bolt-on" aspect of
things.  This then leads to a lot of maintenance work over time because
any time ZFS code changes (and it does, quite a bit) you then have to go
back through that process in order to become coherent with Illumos.

The PR above resolved (completely) the issues I was having along with a
number of other people on 10.x and before (I've not yet rolled it
forward to 11.) but it's quite clearly a hack of sorts, in that it
detects and treats symptoms (e.g. dynamic TX cache size modification,
etc) rather than integrating VM and ZFS cache management.

-- 
Karl Denninger
karl at denninger.net <mailto:karl at denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2996 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.freebsd.org/pipermail/freebsd-hackers/attachments/20160703/5d430552/attachment.bin>