ZFS ARC and mmap/page cache coherency question
Paul Koch
paul.koch137 at gmail.com
Fri Jul 1 01:32:53 UTC 2016
Hi Andrew, further info below...
> Heya Paul,
>
> How is your ZFS configured ( zfs get all tank0 )?
>
> These certainly aren't absolute, law, or perfect - but if you haven't yet,
> I suggest you take a peek at the following:
>
> * http://open-zfs.org/wiki/Performance_tuning
> * https://www.joyent.com/blog/bruning-questions-zfs-record-size
> * http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide
>
> On Wed, Jun 29, 2016 at 9:06 PM, Paul Koch <paul.koch137 at gmail.com> wrote:
>
> >
> > Posted this to -stable on the 15th June, but no feedback...
> >
> > We are trying to understand a performance issue when syncing large mmap'ed
> > files on ZFS.
> >
> > Example test box setup:
> > FreeBSD 10.3-p5
> > Intel i7-5820K 3.30GHz with 64G RAM
> > 6 * 2 Tbyte Seagate ST2000DM001-1ER164 in a ZFS stripe
> >
> > Read performance of a sequentially written large file on the pool is
> > typically around 950Mbytes/sec using dd.
> >
> > Our software mmap's some large database files using MAP_NOSYNC, and we
> > call fsync() every 10 minutes when we know the file system is mostly
> > idle. In our test setup, the database files are 1.1G, 2G, 1.4G, 12G,
> > 4.7G and ~20 small files (under 10M). All of the memory pages in the
> > mmap'ed files are updated every minute with new values, so the entire
> > mmap'ed file needs to be
> > synced to disk, not just fragments.
> >
> > When the 10 minute fsync() occurs, gstat typically shows very little disk
> > reads and very high write speeds, which is what we expect. But, every 80
> > minutes we process the data in the large mmap'ed files and store it in
> > highly
> > compressed blocks of a ~300G file using pread/pwrite (i.e. not mmap'ed).
> > After that, the performance of the next fsync() of the mmap'ed files falls
> > off a cliff. We are assuming it is because the ARC has thrown away the
> > cached data of the mmap'ed files. gstat shows lots of read/write
> > contention
> > and lots of things tend to stall waiting for disk.
> >
> > Is this just a lack of ZFS ARC and page cache coherency ??
> >
> > Is there a way to prime the ARC with the mmap'ed files again before we
> > call fsync() ?
> >
> > We've tried cat and read() on the mmap'ed files but doesn't seem to touch
> > the
> > disk at all and the fsync() performance is still poor, so it looks like
> > the ARC is not being filled. msync() doesn't seem to be much different.
> > mincore() stats show the mmap'ed data is entirely incore and referenced.
> >
> > Paul.
Here is our
zfs get all akips
NAME PROPERTY VALUE SOURCE
akips type filesystem -
akips creation Sat Apr 9 7:29 2016 -
akips used 835G -
akips available 9.70T -
akips referenced 96K -
akips compressratio 1.00x -
akips mounted no -
akips quota none default
akips reservation none default
akips recordsize 128K default
akips mountpoint none local
akips sharenfs off default
akips checksum on default
akips compression off default
akips atime off local
akips devices on default
akips exec on default
akips setuid on default
akips readonly off default
akips jailed off default
akips snapdir hidden default
akips aclmode discard default
akips aclinherit restricted default
akips canmount on default
akips xattr on default
akips copies 1 default
akips version 5 -
akips utf8only off -
akips normalization none -
akips casesensitivity sensitive -
akips vscan off default
akips nbmand off default
akips sharesmb off default
akips refquota none default
akips refreservation none default
akips primarycache all default
akips secondarycache all default
akips usedbysnapshots 0 -
akips usedbydataset 96K -
akips usedbychildren 835G -
akips usedbyrefreservation 0 -
akips logbias latency default
akips dedup off default
akips mlslabel -
akips sync standard default
akips refcompressratio 1.00x -
akips written 96K -
akips logicalused 834G -
akips logicalreferenced 9.50K -
akips volmode default default
akips filesystem_limit none default
akips snapshot_limit none default
akips filesystem_count none default
akips snapshot_count none default
akips redundant_metadata all default
The problem appears to be similar to what is described here:
http://zfs-discuss.opensolaris.narkive.com/tgP1NV9l/remedies-for-suboptimal-mmap-performance-on-zfs
So basically our problem is, we mmap a large file, which gets double cached in
both ZFS ARC and page cache. We update every page in the mmap'ed data, then
flush it our every 10 minutes when we know the disks are mostly idle.
Performance is great.... unless the ZFS ARC no longer has the doubled cached
mmap'ed data, which means it has to go to disk and cause heaps of read/write
contention, then performance falls off a cliff.
What we want to know is if there is no ARC/page cache coherency, how do we
prime the ARC cache again with the same data so we can get good write
performance on the fsync().
Paul.
More information about the freebsd-hackers
mailing list