review/test: NFS patch to use pagesize mbuf clusters

Thu Mar 27 22:05:35 UTC 2014

Marcelo Araujo wrote:
> 
> Hello Rick,
> 
> 
> We made few tests here, and we could see a little improvement for
> READ!
Cool. Just eyeballing the graphs, it looks like about 10-20% improvement.

Btw, "rsize=262144" will be ignored and it will use a maximum of MAXBSIZE
(65536). (I don't think it's in 9.1, but on newer systems you can "nfsstat -m"
to see what is actually being used.)

A couple of things you might try:
- You didn't mention this, so I don't know, but you probably want more
  than the default # of nfsd threads on the server.
  You can set that with nfs_server_flags="-u -t -n 64" (to set it to 64)
  in /etc/rc.conf. (Double check in /etc/default/rc.conf to make sure I got
  the name of it correct.)

- You might want to try increasing readahead with the "readahead=8" mount
  option. (It defaults to only 1, but can be increased to 16. It's kinda
  fun to try values and see what works best.

> We are still double checking it. All our systems have 10G Intel
> Interface with TSO enabled and we have those 32 transmit segments as
> limitation. We ran the test for several times, and we didn't see any
> regression.
> 
The regression is threads stuck looping in the kernel, so it will be pretty
obvious when it happens (due to exhaustion of kernel memory causing it to
not be able to allocate "boundary tags" if I understand the problem correctly).
(I doubt this will happen for your hardware. I was able to intermittently
 reproduce it on a 256Mbyte i386->77Mbytes kernel memory size.)

Have fun with it, rick

> 
> All our system is based on 9.1-RELEASE with some merges on NFS and
> IXGBE from 10-RELEASE.
> 
> 
> Our machine:
> NIC - 10G Intel X540 that is based on 82599 chipset.
> 
> RAM - 24G
> CPU - Intel Xeon E5-2448L 1.80Ghz.
> Motherboard - Homemade.
> 
> 
> Here attached there is a small report, from page number 18, you can
> see some graphs that will make easier for you to see the results.
> So, let me know if you want try anything else, any other patch and
> so on. I can keep the environment for more 1 week and I can make
> more tests.
> 
> 
> Best Regards,
> 
> 
> 
> 2014-03-19 8:06 GMT+08:00 Rick Macklem < rmacklem at uoguelph.ca > :
> 
> 
> 
> Marcelo Araujo wrote:
> > 
> > Hello Rick,
> > 
> > 
> > I have couple machines with 10G interface capable with TSO.
> > Which kind of result do you expecting? Is it a speed up in read?
> > 
> Well, if NFS is working well on these systems, I would hope you
> don't see any regression.
> 
> If your TSO enabled interfaces can handle more than 32 transmit
> segments (there is usually a #define constant in the driver with
> something like TX_SEGMAX in it and if this is >= 34 you should
> see very little effect).
> 
> Even if your network interface is one of the ones limited to 32
> transmit segments, the driver usually fixes the list via a call
> to m_defrag(). Although this involves a bunch of bcopy()'ng, you
> still might not see any easily measured performance improvement,
> assuming m_defrag() is getting the job done.
> (Network latency and disk latency in the server will predominate,
> I suspect. A server built entirely using SSDs might be a different
> story?)
> 
> Thanks for doing testing, since a lack of a regression is what I
> care about most. (I am hoping this resolves cases where users have
> had to disable TSO to make NFS work ok for them.)
> 
> rick
> 
> 
> 
> > 
> > I'm gonna make some tests today, but against 9.1-RELEASE, where my
> > servers are working on.
> > 
> > 
> > Best Regards,
> > 
> > 
> > 
> > 
> > 
> > 2014-03-18 9:26 GMT+08:00 Rick Macklem < rmacklem at uoguelph.ca > :
> > 
> > 
> > Hi,
> > 
> > Several of the TSO capable network interfaces have a limit of
> > 32 mbufs in the transmit mbuf chain (the drivers call these
> > transmit
> > segments, which I admit I find confusing).
> > 
> > For a 64K read/readdir reply or 64K write request, NFS passes
> > a list of 34 mbufs down to TCP. TCP will split the list, since
> > it is slightly more than 64K bytes, but that split will normally
> > be a copy by reference of the last mbuf cluster. As such, normally
> > the network interface will get a list of 34 mbufs.
> > 
> > For TSO enabled interfaces that are limited to 32 mbufs in the
> > list, the usual workaround in the driver is to copy { real copy,
> > not copy by reference } the list to 32 mbuf clusters via
> > m_defrag().
> > (A few drivers use m_collapse() which is less likely to succeed.)
> > 
> > As a workaround to this problem, the attached patch modifies NFS
> > to use larger pagesize clusters, so that the 64K RPC message is
> > in 18 mbufs (assuming a 4K pagesize).
> > 
> > Testing on my slow hardware which does not have TSO capability
> > shows it to be performance neutral, but I believe avoiding the
> > overhead of copying via m_defrag() { and possible failures
> > resulting in the message never being transmitted } makes this
> > patch worth doing.
> > 
> > As such, I'd like to request review and/or testing of this patch
> > by anyone who can do so.
> > 
> > Thanks in advance for your help, rick
> > ps: If you don't get the attachment, just email and I'll
> > send you a copy.
> > 
> > _______________________________________________
> > freebsd-fs at freebsd.org mailing list
> > http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> > To unsubscribe, send any mail to "
> > freebsd-fs-unsubscribe at freebsd.org
> > "
> > 
> > 
> > 
> > 
> > --
> > Marcelo Araujo
> > araujo at FreeBSD.org
> 
> 
> 
> 
> --
> Marcelo Araujo
> araujo at FreeBSD.org