ZFS ARC and mmap/page cache coherency question

Fri Jul 1 01:32:53 UTC 2016

Hi Andrew, further info below...

> Heya Paul,
> 
> How is your ZFS configured ( zfs get all tank0 )?
> 
> These certainly aren't absolute, law, or perfect - but if you haven't yet,
> I suggest you take a peek at the following:
> 
> * http://open-zfs.org/wiki/Performance_tuning
> * https://www.joyent.com/blog/bruning-questions-zfs-record-size
> * http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide
> 
> On Wed, Jun 29, 2016 at 9:06 PM, Paul Koch <paul.koch137 at gmail.com> wrote:
>
> >
> > Posted this to -stable on the 15th June, but no feedback...
> >
> > We are trying to understand a performance issue when syncing large mmap'ed
> > files on ZFS.
> >
> > Example test box setup:
> >  FreeBSD 10.3-p5
> >  Intel i7-5820K 3.30GHz with 64G RAM
> >  6 * 2 Tbyte Seagate ST2000DM001-1ER164 in a ZFS stripe
> >
> > Read performance of a sequentially written large file on the pool is
> > typically around 950Mbytes/sec using dd.
> >
> > Our software mmap's some large database files using MAP_NOSYNC, and we
> > call fsync() every 10 minutes when we know the file system is mostly
> > idle.  In our test setup, the database files are 1.1G, 2G, 1.4G, 12G,
> > 4.7G and ~20 small files (under 10M).  All of the memory pages in the
> > mmap'ed files are updated every minute with new values, so the entire
> > mmap'ed file needs to be
> > synced to disk, not just fragments.
> >
> > When the 10 minute fsync() occurs, gstat typically shows very little disk
> > reads and very high write speeds, which is what we expect.  But, every 80
> > minutes we process the data in the large mmap'ed files and store it in
> > highly
> > compressed blocks of a ~300G file using pread/pwrite (i.e. not mmap'ed).
> > After that, the performance of the next fsync() of the mmap'ed files falls
> > off a cliff.  We are assuming it is because the ARC has thrown away the
> > cached data of the mmap'ed files.  gstat shows lots of read/write
> > contention
> > and lots of things tend to stall waiting for disk.
> >
> > Is this just a lack of ZFS ARC and page cache coherency ??
> >
> > Is there a way to prime the ARC with the mmap'ed files again before we
> > call fsync() ?
> >
> > We've tried cat and read() on the mmap'ed files but doesn't seem to touch
> > the
> > disk at all and the fsync() performance is still poor, so it looks like
> > the ARC is not being filled.  msync() doesn't seem to be much different.
> > mincore() stats show the mmap'ed data is entirely incore and referenced.
> >
> >         Paul.

Here is our 

zfs get all akips
NAME   PROPERTY              VALUE                  SOURCE
akips  type                  filesystem             -
akips  creation              Sat Apr  9  7:29 2016  -
akips  used                  835G                   -
akips  available             9.70T                  -
akips  referenced            96K                    -
akips  compressratio         1.00x                  -
akips  mounted               no                     -
akips  quota                 none                   default
akips  reservation           none                   default
akips  recordsize            128K                   default
akips  mountpoint            none                   local
akips  sharenfs              off                    default
akips  checksum              on                     default
akips  compression           off                    default
akips  atime                 off                    local
akips  devices               on                     default
akips  exec                  on                     default
akips  setuid                on                     default
akips  readonly              off                    default
akips  jailed                off                    default
akips  snapdir               hidden                 default
akips  aclmode               discard                default
akips  aclinherit            restricted             default
akips  canmount              on                     default
akips  xattr                 on                     default
akips  copies                1                      default
akips  version               5                      -
akips  utf8only              off                    -
akips  normalization         none                   -
akips  casesensitivity       sensitive              -
akips  vscan                 off                    default
akips  nbmand                off                    default
akips  sharesmb              off                    default
akips  refquota              none                   default
akips  refreservation        none                   default
akips  primarycache          all                    default
akips  secondarycache        all                    default
akips  usedbysnapshots       0                      -
akips  usedbydataset         96K                    -
akips  usedbychildren        835G                   -
akips  usedbyrefreservation  0                      -
akips  logbias               latency                default
akips  dedup                 off                    default
akips  mlslabel                                     -
akips  sync                  standard               default
akips  refcompressratio      1.00x                  -
akips  written               96K                    -
akips  logicalused           834G                   -
akips  logicalreferenced     9.50K                  -
akips  volmode               default                default
akips  filesystem_limit      none                   default
akips  snapshot_limit        none                   default
akips  filesystem_count      none                   default
akips  snapshot_count        none                   default
akips  redundant_metadata    all                    default

The problem appears to be similar to what is described here:
 http://zfs-discuss.opensolaris.narkive.com/tgP1NV9l/remedies-for-suboptimal-mmap-performance-on-zfs

So basically our problem is, we mmap a large file, which gets double cached in
both ZFS ARC and page cache.  We update every page in the mmap'ed data, then
flush it our every 10 minutes when we know the disks are mostly idle.
Performance is great.... unless the ZFS ARC no longer has the doubled cached
mmap'ed data, which means it has to go to disk and cause heaps of read/write
contention, then performance falls off a cliff.

What we want to know is if there is no ARC/page cache coherency, how do we
prime the ARC cache again with the same data so we can get good write
performance on the fsync().

	Paul.