Terrible NFS performance under 9.2-RELEASE?

Sun Feb 2 16:15:39 UTC 2014

Daniel Braniss wrote:
> hi Rick, et.all.
> 
> tried your patch but it didn’t help,the server is stuck.
Oh well. I was hoping that was going to make TSO work reliably.
Just to comfirm it, this server works reliably when TSO is disabled?

Thanks for doing the testing, rick

> just for fun, I tried a different client/host, this one has a
> broadcom NextXtreme II  that was
> MFC’ed lately, and the results are worse than the Intel (5hs instead
> of 4hs) but faster without TSO
> 
> with TSO enabled and bs=32k:
> 5.09hs		18325.62 real      1109.23 user      4591.60 sys
> 
> without TSO:
> 4.75hs		17120.40 real      1114.08 user      3537.61 sys
> 
> So what is the advantage of using TSO? (no complain here, just
> curious)
> 
> I’ll try to see if as a server it has the same TSO related issues.
> 
> cheers,
> 	danny
> 
> On Jan 28, 2014, at 3:51 AM, Rick Macklem <rmacklem at uoguelph.ca>
> wrote:
> 
> > Jack Vogel wrote:
> >> That header file is for the VF driver :) which I don't believe is
> >> being
> >> used in this case.
> >> The driver is capable of handling 256K but its limited by the
> >> stack
> >> to 64K
> >> (look in
> >> ixgbe.h), so its not a few bytes off due to the vlan header.
> >> 
> >> The scatter size is not an arbitrary one, its due to hardware
> >> limitations
> >> in Niantic
> >> (82599).  Turning off TSO in the 10G environment is not practical,
> >> you will
> >> have
> >> trouble getting good performance.
> >> 
> >> Jack
> >> 
> > Well, if you look at this thread, Daniel got much better
> > performance
> > by turning off TSO. However, I agree that this is not an ideal
> > solution.
> > http://docs.FreeBSD.org/cgi/mid.cgi?2C287272-7B57-4AAD-B22F-6A65D9F8677B
> > 
> > rick
> > 
> >> 
> >> 
> >> On Mon, Jan 27, 2014 at 4:58 PM, Yonghyeon PYUN <pyunyh at gmail.com>
> >> wrote:
> >> 
> >>> On Mon, Jan 27, 2014 at 06:27:19PM -0500, Rick Macklem wrote:
> >>>> pyunyh at gmail.com wrote:
> >>>>> On Sun, Jan 26, 2014 at 09:16:54PM -0500, Rick Macklem wrote:
> >>>>>> Adam McDougall wrote:
> >>>>>>> Also try rsize=32768,wsize=32768 in your mount options,
> >>>>>>> made a
> >>>>>>> huge
> >>>>>>> difference for me.  I've noticed slow file transfers on NFS
> >>>>>>> in 9
> >>>>>>> and
> >>>>>>> finally did some searching a couple months ago, someone
> >>>>>>> suggested
> >>>>>>> it
> >>>>>>> and
> >>>>>>> they were on to something.
> >>>>>>> 
> >>>>>> I have a "hunch" that might explain why 64K NFS reads/writes
> >>>>>> perform
> >>>>>> poorly for some network environments.
> >>>>>> A 64K NFS read reply/write request consists of a list of 34
> >>>>>> mbufs
> >>>>>> when
> >>>>>> passed to TCP via sosend() and a total data length of around
> >>>>>> 65680bytes.
> >>>>>> Looking at a couple of drivers (virtio and ixgbe), they seem
> >>>>>> to
> >>>>>> expect
> >>>>>> no more than 32-33 mbufs in a list for a 65535 byte TSO xmit.
> >>>>>> I
> >>>>>> think
> >>>>>> (I don't have anything that does TSO to confirm this) that
> >>>>>> NFS will
> >>>>>> pass
> >>>>>> a list that is longer (34 plus a TCP/IP header).
> >>>>>> At a glance, it appears that the drivers call m_defrag() or
> >>>>>> m_collapse()
> >>>>>> when the mbuf list won't fit in their scatter table (32 or 33
> >>>>>> elements)
> >>>>>> and if this fails, just silently drop the data without
> >>>>>> sending it.
> >>>>>> If I'm right, there would considerable overhead from
> >>>>>> m_defrag()/m_collapse()
> >>>>>> and near disaster if they fail to fix the problem and the
> >>>>>> data is
> >>>>>> silently
> >>>>>> dropped instead of xmited.
> >>>>>> 
> >>>>> 
> >>>>> I think the actual number of DMA segments allocated for the
> >>>>> mbuf
> >>>>> chain is determined by bus_dma(9).  bus_dma(9) will coalesce
> >>>>> current segment with previous segment if possible.
> >>>>> 
> >>>> Ok, I'll have to take a look, but I thought that an array of
> >>>> sized
> >>>> by "num_segs" is passed in as an argument. (And num_segs is set
> >>>> to
> >>>> either IXGBE_82598_SCATTER (100) or IXGBE_82599_SCATTER (32).)
> >>>> It looked to me that the ixgbe driver called itself ix, so it
> >>>> isn't
> >>>> obvious to me which we are talking about. (I know that Daniel
> >>>> Braniss
> >>>> had an ix0 and ix1, which were fixed for NFS by disabling TSO.)
> >>>> 
> >>> 
> >>> It's ix(4). ixbge(4) is a different driver.
> >>> 
> >>>> I'll admit I mostly looked at virtio's network driver, since
> >>>> that
> >>>> was the one being used by J David.
> >>>> 
> >>>> Problems w.r.t. TSO enabled for NFS using 64K rsize/wsize have
> >>>> been
> >>>> cropping up for quite a while, and I am just trying to find out
> >>>> why.
> >>>> (I have no hardware/software that exhibits the problem, so I can
> >>>> only look at the sources and ask others to try testing stuff.)
> >>>> 
> >>>>> I'm not sure whether you're referring to ixgbe(4) or ix(4) but
> >>>>> I
> >>>>> see the total length of all segment size of ix(4) is 65535 so
> >>>>> it has no room for ethernet/VLAN header of the mbuf chain.  The
> >>>>> driver should be fixed to transmit a 64KB datagram.
> >>>> Well, if_hw_tsomax is set to 65535 by the generic code (the
> >>>> driver
> >>>> doesn't set it) and the code in tcp_output() seems to subtract
> >>>> the
> >>>> size of an tcp/ip header from that before passing data to the
> >>>> driver,
> >>>> so I think the mbuf chain passed to the driver will fit in one
> >>>> ip datagram. (I'd assume all sorts of stuff would break for TSO
> >>>> enabled drivers if that wasn't the case?)
> >>> 
> >>> I believe the generic code is doing right.  I'm under the
> >>> impression the non-working TSO indicates a bug in driver.  Some
> >>> drivers didn't account for additional ethernet/VLAN header so the
> >>> total size of DMA segments exceeded 65535.  I've attached a diff
> >>> for ix(4). It wasn't tested at all as I don't have hardware to
> >>> test.
> >>> 
> >>>> 
> >>>>> I think the use of m_defrag(9) in TSO is suboptimal. All TSO
> >>>>> capable controllers are able to handle multiple TX buffers so
> >>>>> it
> >>>>> should have used m_collapse(9) rather than copying entire chain
> >>>>> with m_defrag(9).
> >>>>> 
> >>>> I haven't looked at these closely yet (plan on doing so to-day),
> >>>> but
> >>>> even m_collapse() looked like it copied data between mbufs and
> >>>> that
> >>>> is certainly suboptimal, imho. I don't see why a driver can't
> >>>> split
> >>>> the mbuf list, if there are too many entries for the
> >>>> scatter/gather
> >>>> and do it in two iterations (much like tcp_output() does
> >>>> already,
> >>>> since the data length exceeds 65535 - tcp/ip header size).
> >>>> 
> >>> 
> >>> It can split the mbuf list if controllers supports increased
> >>> number
> >>> of TX buffers.  Because controller shall consume the same number
> >>> of
> >>> DMA descriptors for the mbuf list, drivers tend to impose a limit
> >>> on the number of TX buffers to save resources.
> >>> 
> >>>> However, at this point, I just want to find out if the long
> >>>> chain
> >>>> of mbufs is why TSO is problematic for these drivers, since I'll
> >>>> admit I'm getting tired of telling people to disable TSO (and I
> >>>> suspect some don't believe me and never try it).
> >>>> 
> >>> 
> >>> TSO capable controllers tend to have various limitations(the
> >>> first
> >>> TX buffer should have complete ethernet/IP/TCP header, ip_len of
> >>> IP
> >>> header should be reset to 0, TCP pseudo checksum should be
> >>> recomputed etc) and cheap controllers need more assistance from
> >>> driver to let its firmware know various IP/TCP header offset
> >>> location in the mbuf.  Because this requires a IP/TCP header
> >>> parsing, it's error prone and very complex.
> >>> 
> >>>>>> Anyhow, I have attached a patch that makes NFS use
> >>>>>> MJUMPAGESIZE
> >>>>>> clusters,
> >>>>>> so the mbuf count drops from 34 to 18.
> >>>>>> 
> >>>>> 
> >>>>> Could we make it conditional on size?
> >>>>> 
> >>>> Not sure what you mean? If you mean "the size of the
> >>>> read/write",
> >>>> that would be possible for NFSv3, but less so for NFSv4. (The
> >>>> read/write
> >>>> is just one Op. in the compound for NFSv4 and there is no way to
> >>>> predict how much more data is going to be generated by
> >>>> subsequent
> >>>> Ops.)
> >>>> 
> >>> 
> >>> Sorry, I should have been more clearer. You already answered my
> >>> question.  Thanks.
> >>> 
> >>>> If by "size" you mean amount of memory in the machine then, yes,
> >>>> it
> >>>> certainly could be conditional on that. (I plan to try and look
> >>>> at
> >>>> the allocator to-day as well, but if others know of
> >>>> disadvantages
> >>>> with
> >>>> using MJUMPAGESIZE instead of MCLBYTES, please speak up.)
> >>>> 
> >>>> Garrett Wollman already alluded to the MCLBYTES case being
> >>>> pre-allocated,
> >>>> but I'll admit I have no idea what the implications of that are
> >>>> at this
> >>>> time.
> >>>> 
> >>>>>> If anyone has a TSO scatter/gather enabled net interface and
> >>>>>> can
> >>>>>> test this
> >>>>>> patch on it with NFS I/O (default of 64K rsize/wsize) when
> >>>>>> TSO is
> >>>>>> enabled
> >>>>>> and see what effect it has, that would be appreciated.
> >>>>>> 
> >>>>>> Btw, thanks go to Garrett Wollman for suggesting the change
> >>>>>> to
> >>>>>> MJUMPAGESIZE
> >>>>>> clusters.
> >>>>>> 
> >>>>>> rick
> >>>>>> ps: If the attachment doesn't make it through and you want
> >>>>>> the
> >>>>>> patch, just
> >>>>>>    email me and I'll send you a copy.
> >>>>>> 
> >>> 
> >>> _______________________________________________
> >>> freebsd-net at freebsd.org mailing list
> >>> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> >>> To unsubscribe, send any mail to
> >>> "freebsd-net-unsubscribe at freebsd.org"
> >>> 
> >> _______________________________________________
> >> freebsd-net at freebsd.org mailing list
> >> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> >> To unsubscribe, send any mail to
> >> "freebsd-net-unsubscribe at freebsd.org"
> 
> _______________________________________________
> freebsd-net at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to
> "freebsd-net-unsubscribe at freebsd.org"
>