ZFS-backed NFS export with vSphere

Rick Macklem rmacklem at uoguelph.ca
Sat Jun 29 00:19:35 UTC 2013


Zoltan Nagy wrote:
> 
> 
> 
> 
> Right. As I said, increasing it to 1M increased my throughput from
> 17MB/s to 76MB/s.
> 
I vaguely recall running into problems with UFS when I made MAXBSIZE > 128Kbytes,
but it has been a while since I tried it. Even doubling it would be a significant
change in the default, with the potential for secondary effects I have no idea on.
(That's why I haven't even doubled it in head as of yet. I may get that daring
 someday, after having an email list discussion about implications of it.)

You didn't mention what client you are using? (The FreeBSD client won't do rsize/wsize
greater than MAXBSIZE either, so it will be capped at 64Kbytes.)

You can also try things like changing rsize,wsize and readahead in the client mount.
(Most clients have some variant of these mount options.)

Also, ken@ recently added NFSv3 file handle affinity support, which apparently helps
reading from ZFS, since without it some ZFS algorithm that recognized sequential
access fails and that slows things down. (ie. You could try a recent 10/current or
stable/9 on the server and see what effect that has.)

> However, the SSD can do much more random writes; any idea why I don't
> see the ZIL go over
> this value?
> (vSphere always uses sync writes).
> 
That's a question for someone else. I have never used ZFS, rick

> Thanks,
> Zoltan
> 
> 
> 
> 
> On Thu, Jun 27, 2013 at 11:58 PM, Rick Macklem < rmacklem at uoguelph.ca
> > wrote:
> 
> 
> 
> 
> Zoltan Nagy wrote:
> > Hi list,
> > 
> > I'd love to have a ZFS-backed NFS export as my VM datastore, but as
> > much as
> > I'd like to tune
> > it, the performance doesn't even get close to Solaris 11's.
> > 
> > I currently have the system set up as this:
> > 
> > pool: tank
> > state: ONLINE
> > scan: none requested
> > config:
> > 
> > NAME STATE READ WRITE CKSUM
> > tank ONLINE 0 0 0
> > mirror-0 ONLINE 0 0 0
> > da0 ONLINE 0 0 0
> > da1 ONLINE 0 0 0
> > mirror-1 ONLINE 0 0 0
> > da2 ONLINE 0 0 0
> > da3 ONLINE 0 0 0
> > logs
> > ada0p4 ONLINE 0 0 0
> > spares
> > da4 AVAIL
> > 
> > ada0 is a samsung 840pro SSD, which I'm using for system+ZIL.
> > daX is 1TB, 7200rpm seagate disks.
> > (from this test's perspective, if I use a separate ZIL device or
> > just
> > a
> > partition, doesn't matter - I get roughly the same numbers).
> > 
> > The first thing I noticed is that the FSINFO reply from FreeBSD is
> > advertising untunable values (I did not find them documented either
> > in the
> > manpage, or as a sysctl).
> > 
> > rtmax, rtpref, wtmax, wtpref: 64k (fbsd), 1M (solaris)
> > dtpref: 64k (fbsd), 8k (solaris)
> > 
> > After manually patching the nfs code (changing NFS_MAXBSIZE to 1M
> > instead
> > of MAXBSIZE) to adversize the same read/write values (didn't touch
> > dtpref),
> > my performance went up from 17MB/s to 76MB/s.
> > 
> > Is there a reason NFS_MAXBSIZE is not tunable and/or is it so slow?
> > 
> For exporting other file system types (UFS, ...) the buffer cache is
> used and MAXBSIZE is the largest block you can use for the buffer
> cache.
> Some increase of MAXBSIZE would be nice. (I've tried 128Kb without
> observing
> difficulties and from what I've been told 128Kb is the ZFS block
> size.)
> 
> 
> 
> > Here's my iozone output (which is run on an ext4 partition created
> > on
> > a
> > linux VM which has a disk backed by the NFS exported from the
> > FreeBSD
> > box):
> > 
> > Record Size 4096 KB
> > File size set to 2097152 KB
> > Command line used: iozone -b results.xls -r 4m -s 2g -t 6 -i 0 -i
> > 1 -i 2
> > Output is in Kbytes/sec
> > Time Resolution = 0.000001 seconds.
> > Processor cache size set to 1024 Kbytes.
> > Processor cache line size set to 32 bytes.
> > File stride size set to 17 * record size.
> > Throughput test with 6 processes
> > Each process writes a 2097152 Kbyte file in 4096 Kbyte records
> > 
> > Children see throughput for 6 initial writers = 76820.31
> > KB/sec
> > Parent sees throughput for 6 initial writers = 74899.44
> > KB/sec
> > Min throughput per process = 12298.62 KB/sec
> > Max throughput per process = 12972.72 KB/sec
> > Avg throughput per process = 12803.38 KB/sec
> > Min xfer = 1990656.00 KB
> > 
> > Children see throughput for 6 rewriters = 76030.99 KB/sec
> > Parent sees throughput for 6 rewriters = 75062.91 KB/sec
> > Min throughput per process = 12620.45 KB/sec
> > Max throughput per process = 12762.80 KB/sec
> > Avg throughput per process = 12671.83 KB/sec
> > Min xfer = 2076672.00 KB
> > 
> > Children see throughput for 6 readers = 114221.39
> > KB/sec
> > Parent sees throughput for 6 readers = 113942.71 KB/sec
> > Min throughput per process = 18920.14 KB/sec
> > Max throughput per process = 19183.80 KB/sec
> > Avg throughput per process = 19036.90 KB/sec
> > Min xfer = 2068480.00 KB
> > 
> > Children see throughput for 6 re-readers = 117018.50 KB/sec
> > Parent sees throughput for 6 re-readers = 116917.01 KB/sec
> > Min throughput per process = 19436.28 KB/sec
> > Max throughput per process = 19590.40 KB/sec
> > Avg throughput per process = 19503.08 KB/sec
> > Min xfer = 2080768.00 KB
> > 
> > Children see throughput for 6 random readers = 110072.68
> > KB/sec
> > Parent sees throughput for 6 random readers = 109698.99
> > KB/sec
> > Min throughput per process = 18260.33 KB/sec
> > Max throughput per process = 18442.55 KB/sec
> > Avg throughput per process = 18345.45 KB/sec
> > Min xfer = 2076672.00 KB
> > 
> > Children see throughput for 6 random writers = 76389.71
> > KB/sec
> > Parent sees throughput for 6 random writers = 74816.45
> > KB/sec
> > Min throughput per process = 12592.09 KB/sec
> > Max throughput per process = 12843.75 KB/sec
> > Avg throughput per process = 12731.62 KB/sec
> > Min xfer = 2056192.00 KB
> > 
> > The other interesting this is that you can notice the system
> > doesn't
> > cache
> > the data file to ram (the box has 32G), so even for re-reads I get
> > miserable numbers. With solaris, the re-reads happen at nearly wire
> > spead.
> > 
> > Any ideas what else I could tune? While 76MB/s is much better than
> > the
> > original 17MB I was seeing, it's still far from Solaris's
> > ~220MB/s...
> > 
> > Thanks a lot,
> > Zoltan
> 
> 
> > _______________________________________________
> > freebsd-fs at freebsd.org mailing list
> > http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> > To unsubscribe, send any mail to "
> > freebsd-fs-unsubscribe at freebsd.org "
> > 
> 
> 


More information about the freebsd-fs mailing list