Terrible NFS4 performance: FreeBSD 9.1 + ZFS + AWS EC2

Tue Jul 9 23:57:03 UTC 2013

Garrett Wollman wrote:
> <<On Mon, 8 Jul 2013 21:43:52 -0400 (EDT), Rick Macklem
> <rmacklem at uoguelph.ca> said:
> 
> > Berend de Boer wrote:
> >> >>>>> "Rick" == Rick Macklem <rmacklem at uoguelph.ca> writes:
> >> 
> Rick> After you apply the patch and boot the rebuilt kernel, the
> Rick> cpu overheads should be reduced after you increase the
> >> value
> Rick> of vfs.nfsd.tcphighwater.
> >> 
> >> What number would I be looking at? 100? 100,000?
> >> 
> > Garrett Wollman might have more insight into this, but I would say
> > on
> > the order of 100s to maybe 1000s.
> 
> On my production servers, I'm running with the following tuning
> (after Rick's drc4.patch):
> 
> ----loader.conf----
> kern.ipc.nmbclusters="1048576"
> vfs.zfs.scrub_limit="16"
> vfs.zfs.vdev.max_pending="24"
> vfs.zfs.arc_max="48G"
> #
> # Tunable per mps(4).  We had sigificant numbers of allocation
> failures
> # with the default value of 2048, so bump it up and see whether
> there's
> # still an issue.
> #
> hw.mps.max_chains="4096"
> #
> # Simulate the 10-CURRENT autotuning of maxusers based on available
> memory
> #
> kern.maxusers="8509"
> #
> # Attempt to make the message buffer big enough to retain all the
> crap
> # that gets spewed on the console when we boot.  64K (the default)
> isn't
> # enough to even list all of the disks.
> #
> kern.msgbufsize="262144"
> #
> # Tell the TCP implementation to use the specialized, faster but
> possibly
> # fragile implementation of soreceive.  NFS calls soreceive() a lot
> and
> # using this implementation, if it works, should improve performance
> # significantly.
> #
> net.inet.tcp.soreceive_stream="1"
> #
> # Six queues per interface means twelve queues total
> # on this hardware, which is a good match for the number
> # of processor cores we have.
> #
> hw.ixgbe.num_queues="6"
> 
> ----sysctl.conf----
> # Make sure that device interrupts are not throttled (10GbE can make
> # lots and lots of interrupts).
> hw.intr_storm_threshold=12000
> 
> # If the NFS replay cache isn't larger than the number of operations
> nfsd
> # can perform in a second, the nfsd service threads will spend all of
> their
> # time contending for the mutex that protects the cache data
> structure so
> # that they can trim them.  If the cache is big enough, it will only
> do this
> # once a second.
> vfs.nfsd.tcpcachetimeo=300
> vfs.nfsd.tcphighwater=150000
> 
> ----modules/nfs/server/freebsd.pp----
>   exec {'sysctl vfs.nfsd.minthreads':
>     command  => "sysctl vfs.nfsd.minthreads=${min_threads}",
>     onlyif   => "test $(sysctl -n vfs.nfsd.minthreads) -ne
>     ${min_threads}",
>     require  => Service['nfsd'],
>   }
> 
>   exec {'sysctl vfs.nfsd.maxthreads':
>     command  => "sysctl vfs.nfsd.maxthreads=${max_threads}",
>     onlyif   => "test $(sysctl -n vfs.nfsd.maxthreads) -ne
>     ${max_threads}",
>     require  => Service['nfsd'],
>   }
> 
> ($min_threads and $max_threads are manually configured based on
> hardware, currently 16/64 on 8-core machines and 16/96 on 12-core
> machines.)
> 
> As this is the summer, we are currently very lightly loaded.  There's
> apparently still a bug in drc4.patch, because both of my non-scratch
> production servers show a negative CacheSize in nfsstat.
> 
> (I hope that all of these patches will make it into 9.2 so we don't
> have to maintain our own mutant NFS implementation.)
> 
Afraid not. I was planning on getting it in, but the release schedule
appeared with a short time to code slush. Hopefully a cleaned up version
of this will be in 10.0 and 9.3.

rick

> -GAWollman
> 
>