ZFS ARC and mmap/page cache coherency question
cedric.blancher at gmail.com
Sat Jul 2 18:26:57 UTC 2016
Short story: ZFS was tacked on the kernel and was never properly
integrated into the VM page management, which leads to DRAMATIC poor
performance for anything which uses mmap() for write IO. This was
solved in Oracle Solaris with the great VM allocator rewrite which
landed after Opensolaris was made closed source again.
Without a complete rewrite of the VM system this problem is unsolvable.
On 30 June 2016 at 06:06, Paul Koch <paul.koch137 at gmail.com> wrote:
> Posted this to -stable on the 15th June, but no feedback...
> We are trying to understand a performance issue when syncing large mmap'ed
> files on ZFS.
> Example test box setup:
> FreeBSD 10.3-p5
> Intel i7-5820K 3.30GHz with 64G RAM
> 6 * 2 Tbyte Seagate ST2000DM001-1ER164 in a ZFS stripe
> Read performance of a sequentially written large file on the pool is
> typically around 950Mbytes/sec using dd.
> Our software mmap's some large database files using MAP_NOSYNC, and we call
> fsync() every 10 minutes when we know the file system is mostly idle. In
> our test setup, the database files are 1.1G, 2G, 1.4G, 12G, 4.7G and ~20
> small files (under 10M). All of the memory pages in the mmap'ed files are
> updated every minute with new values, so the entire mmap'ed file needs to be
> synced to disk, not just fragments.
> When the 10 minute fsync() occurs, gstat typically shows very little disk
> reads and very high write speeds, which is what we expect. But, every 80
> minutes we process the data in the large mmap'ed files and store it in highly
> compressed blocks of a ~300G file using pread/pwrite (i.e. not mmap'ed).
> After that, the performance of the next fsync() of the mmap'ed files falls
> off a cliff. We are assuming it is because the ARC has thrown away the
> cached data of the mmap'ed files. gstat shows lots of read/write contention
> and lots of things tend to stall waiting for disk.
> Is this just a lack of ZFS ARC and page cache coherency ??
> Is there a way to prime the ARC with the mmap'ed files again before we call
> fsync() ?
> We've tried cat and read() on the mmap'ed files but doesn't seem to touch the
> disk at all and the fsync() performance is still poor, so it looks like the
> ARC is not being filled. msync() doesn't seem to be much different.
> mincore() stats show the mmap'ed data is entirely incore and referenced.
> freebsd-hackers at freebsd.org mailing list
> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe at freebsd.org"
Cedric Blancher <cedric.blancher at gmail.com>
More information about the freebsd-hackers