NFS on 10G interface terribly slow

Fri Jun 26 23:54:01 UTC 2015

Damien Fleuriot wrote:
> Gerrit,
> 
> 
> Everyone's talking about the network performance and to some extent NFS
> tuning.
> I would argue that given your iperf results, the network itself is not at
> fault.
> 
In this case, I think you might be correct.

However, I need to note that NFS traffic is very different than what iperf
generates and a good result from iperf does not imply that there isn't a
network related problem causing NFS grief.
A couple of examples:
- NFS generates TSO segments that are sometimes just under 64K in length.
  If the network interface has TSO enabled but cannot handle a list of
  35 or more transmit segments (mbufs in list), this can cause problems.
  Systems more than about 1year old could fail completely when the TSO
  segment + IP header exceeded 64K for network interfaces limited to 32
  transmit segments (32 * MCLBYTES == 64K). Also, some interfaces used
  m_collapse() to try and fix the case where the TSO segment had too many
  transmit segments in it and this almost always failed (you need to use
  m_defrag()).
  --> The worst case failures have been fixed by reducing the default
      maximum TSO segment size to slightly less than 64K (by the maximum
      MAC header length).
      However, drivers limited to less than 35 transmit segments (which
      includes at least one of the most common Intel chips) still end up
      generating a lot of overhead by calling m_defrag() over and over and
      over again (with the possibility of failure if mbuf clusters become
      exhausted).
  --> To fix this well, net device drivers need to set a field called
      if_hw_tsomaxsegcount, but if you look in -head, you won't find it
      set in many drivers. (I've posted to freebsd-net multiple times
      asking the net device driver authors to do this, but it hasn't happened
      yet.)
Usually avoided by disabling TSO.

Another failure case I've seen in the past was where a network interface
would drop a packet in a stream of closely spaced packets on the receive
side while concurrently transmitting. (NFS traffic is bi-directional and
it is common to be receiving and transmitting on a TCP socket concurrently.)

NFS traffic is also very bursty, and that seems to cause problems for certain
network interfaces.
These can usually be worked around by reducing rsize, wsize. (Reducing rsize, wsize
also "fixes" the 64K TSO segment problem, since the TSO segments won't be as
large.)

There are also issues w.r.t. kernel address space (the area used for mbuf cluster
mapping) exhaustion when jumbo packets are used, resulting in allocation
of multiple sized mbuf clusters.

I think you can see not all of these will be evident from iperf results.

rick

> In your first post I see no information regarding the local performance of
> your disks, sans le NFS that is.
> 
> You may want to look into that first and ensure you get good read and write
> results on the Solaris box, before trying to fix that which might not be at
> fault.
> Perhaps your NFS implementation is already giving you the maximum speed the
> disks can achieve, or close enough.
> 
> You may also want to compare the results with another NFS client to the
> Oracle server, say, god forbid, a *nux box for example.
> 
> 
> On 26 June 2015 at 11:59, Gerrit Kühn <gerrit.kuehn at aei.mpg.de> wrote:
> 
> > On Thu, 25 Jun 2015 20:49:11 -0400 (EDT) Rick Macklem
> > <rmacklem at uoguelph.ca> wrote about Re: NFS on 10G interface terribly slow:
> >
> >
> > RM> Recent commits to stable/10 (not in 10.1) done by Alexander Motin
> > RM> (mav@) might help w.r.t. write performance (it avoids large writes
> > RM> doing synchronous writes when the wcommitsize is exceeded). If you can
> > RM> try stable/10, that might be worth it.
> >
> > Ok, I'll schedule an update then, I guess. OTOH, Scott reported that a
> > similar setup is working fine for him with 10.0 and 10.1, so there is
> > probably not much to gain. I'll try anyway...
> >
> > RM> Otherwise, the main mount option you can try is "wcommitsize", which
> > RM> you probably want to make larger.
> >
> > Hm, which size would you recommend? I cannot find anything about this
> > setting, not even what the default value would be. Is this reflected in
> > some sysctl, or how can I find out what the actual value is?
> >
> >
> > cu
> >   Gerrit
> > _______________________________________________
> > freebsd-net at freebsd.org mailing list
> > http://lists.freebsd.org/mailman/listinfo/freebsd-net
> > To unsubscribe, send any mail to "freebsd-net-unsubscribe at freebsd.org"
> >
> _______________________________________________
> freebsd-net at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe at freebsd.org"