RFC: How to fix the NFS/iSCSI vs TSO problem

Thu Mar 27 00:27:56 UTC 2014

pyunyh at gmail.com wrote:
> On Tue, Mar 25, 2014 at 07:10:35PM -0400, Rick Macklem wrote:
> > Hi,
> > 
> > First off, I hope you don't mind that I cross-posted this,
> > but I wanted to make sure both the NFS/iSCSI and networking
> > types say it.
> > If you look in this mailing list thread:
> >   http://docs.FreeBSD.org/cgi/mid.cgi?1850411724.1687820.1395621539316.JavaMail.root
> > you'll see that several people have been working hard at testing
> > and
> > thanks to them, I think I now know what is going on.
> 
> 
> Thanks for your hard work on narrowing down that issue.  I'm too
> busy for $work in these days so I couldn't find time to investigate
> the issue.
> 
> > (This applies to network drivers that support TSO and are limited
> > to 32 transmit
> >  segments->32 mbufs in chain.) Doing a quick search I found the
> >  following
> > drivers that appear to be affected (I may have missed some):
> >  jme, fxp, age, sge, msk, alc, ale, ixgbe/ix, nfe, e1000/em, re
> > 
> 
> The magic number 32 was chosen long time ago when I implemented TSO
> in non-Intel drivers.  I tried to find optimal number to reduce the
> size kernel stack usage at that time.  bus_dma(9) will coalesce
> with previous segment if possible so I thought the number 32 was
> not an issue.  Not sure current bus_dma(9) also has the same code
> though.  The number 32 is arbitrary one so you can increase
> it if you want.
> 
Well, in the case of "ix" Jack Vogel says it is a hardware limitation.
I can't change drivers that I can't test and don't know anything about
the hardware. Maybe replacing m_collapse() with m_defrag() is an exception,
since I know what that is doing and it isn't hardware related, but I
would still prefer a review by the driver author/maintainer before making
such a change.

If there are drivers that you know can be increased from 32->35 please do
so, since that will not only avoid the EFBIG failures but also avoid a
lot of calls to m_defrag().

> > Further, of these drivers, the following use m_collapse() and not
> > m_defrag()
> > to try and reduce the # of mbufs in the chain. m_collapse() is not
> > going to
> > get the 35 mbufs down to 32 mbufs, as far as I can see, so these
> > ones are
> > more badly broken:
> >  jme, fxp, age, sge, alc, ale, nfe, re
> 
> I guess m_defeg(9) is more optimized for non-TSO packets. You don't
> want to waste CPU cycles to copy the full frame to reduce the
> number of mbufs in the chain.  For TSO packets, m_defrag(9) looks
> better but if we always have to copy a full TSO packet to make TSO
> work, driver writers have to invent better scheme rather than
> blindly relying on m_defrag(9), I guess.
> 
Yes, avoiding m_defrag() calls would be nice. For this issue, increasing
the transmit segment limit from 32->35 does that, if the change can be
done easily/safely.

Otherwise, all I can think of is my suggestion to add something like
if_hw_tsomaxseg which the driver can use to tell tcp_output() the
driver's limit for # of mbufs in the chain.

> > 
> > The long description is in the above thread, but the short version
> > is:
> > - NFS generates a chain with 35 mbufs in it for (read/readdir
> > replies and write requests)
> >   made up of (tcpip header, RPC header, NFS args, 32 clusters of
> >   file data)
> > - tcp_output() usually trims the data size down to tp->t_tsomax
> > (65535) and
> >   then some more to make it an exact multiple of TCP transmit data
> >   size.
> >   - the net driver prepends an ethernet header, growing the length
> >   by 14 (or
> >     sometimes 18 for vlans), but in the first mbuf and not adding
> >     one to the chain.
> >   - m_defrag() copies this to a chain of 32 mbuf clusters (because
> >   the total data
> >     length is <= 64K) and it gets sent
> > 
> > However, it the data length is a little less than 64K when passed
> > to tcp_output()
> > so that the length including headers is in the range
> > 65519->65535...
> > - tcp_output() doesn't reduce its size.
> >   - the net driver adds an ethernet header, making the total data
> >   length slightly
> >     greater than 64K
> >   - m_defrag() copies it to a chain of 33 mbuf clusters, which
> >   fails with EFBIG
> > --> trainwrecks NFS performance, because the TSO segment is dropped
> > instead of sent.
> > 
> > A tester also stated that the problem could be reproduced using
> > iSCSI. Maybe
> > Edward Napierala might know some details w.r.t. what kind of mbuf
> > chain iSCSI
> > generates?
> > 
> > Also, one tester has reported that setting if_hw_tsomax in the
> > driver before
> > the ether_ifattach() call didn't make the value of tp->t_tsomax
> > smaller.
> > However, reducing IP_MAXPACKET (which is what it is set to by
> > default) did
> > reduce it. I have no idea why this happens or how to fix it, but it
> > implies
> > that setting if_hw_tsomax in the driver isn't a solution until this
> > is resolved.
> > 
> > So, what to do about this?
> > First, I'd like a simple fix/workaround that can go into 9.3.
> > (which is code
> > freeze in May). The best thing I can think of is setting
> > if_hw_tsomax to a
> > smaller default value. (Line# 658 of sys/net/if.c in head.)
> > 
> > Version A:
> > replace
> >   ifp->if_hw_tsomax = IP_MAXPACKET;
> > with
> >   ifp->if_hw_tsomax = min(32 * MCLBYTES - (ETHER_HDR_LEN +
> >   ETHER_VLAN_ENCAP_LEN),
> >       IP_MAXPACKET);
> > plus
> >   replace m_collapse() with m_defrag() in the drivers listed above.
> > 
> > This would only reduce the default from 65535->65518, so it only
> > impacts
> > the uncommon case where the output size (with tcpip header) is
> > within
> > this range. (As such, I don't think it would have a negative impact
> > for
> > drivers that handle more than 32 transmit segments.)
> > From the testers, it seems that this is sufficient to get rid of
> > the EFBIG
> > errors. (The total data length including ethernet header doesn't
> > exceed 64K,
> > so m_defrag() fits it into 32 mbuf clusters.)
> > 
> > The main downside of this is that there will be a lot of m_defrag()
> > calls
> > being done and they do quite a bit of bcopy()'ng.
> > 
> > Version B:
> > replace
> >   ifp->if_hw_tsomax = IP_MAXPACKET;
> > with
> >   ifp->if_hw_tsomax = min(29 * MCLBYTES, IP_MAXPACKET);
> > 
> > This one would avoid the m_defrag() calls, but might have a
> > negative
> > impact on TSO performance for drivers that can handle 35 transmit
> > segments,
> > since the maximum TSO segment size is reduced by about 6K. (Because
> > of the
> > second size reduction to an exact multiple of TCP transmit data
> > size, the
> > exact amount varies.)
> > 
> > Possible longer term fixes:
> > One longer term fix might be to add something like if_hw_tsomaxseg
> > so that
> > a driver can set a limit on the number of transmit segments (mbufs
> > in chain)
> > and tcp_output() could use that to limit the size of the TSO
> > segment, as
> > required. (I have a first stab at such a patch, but no way to test
> > it, so
> > I can't see that being done by May. Also, it would require changes
> > to a lot
> > of drivers to make it work. I've attached this patch, in case
> > anyone wants
> > to work on it?)
> > 
> > Another might be to increase the size of MCLBYTES (I don't see this
> > as
> > practical for 9.3, although the actual change is simple.). I do
> > think
> > that increasing MCLBYTES might be something to consider doing in
> > the
> > future, for reasons beyond fixing this.
> > 
> > So, what do others think should be done? rick
> > 
> 
> AFAIK all TSO capable drivers you mentioned above have no limit on
> the number of TX segments in TSO path.  Not sure about Intel
> controllers though.  Increasing the number of segment will consume
> lots of kernel stack in those drivers. Given that ixgbe, which
> seems to use 100, didn't show any kernal stack shortage, I think
> bumping the number of segments would be quick way to address the
> issue.
> 
Well, bumping it from 32->35 is all it would take for NFS (can't comment
w.r.t. iSCSI). ixgbe uses 100 for the 82598 chip and 32 for the 82599
(just so others aren't confused by the above comment). I understand
your point was w.r.t. using 100 without blowing the kernel stack, but
since the testers have been using "ix" with the 82599 chip which is
limited to 32 transmit segments...

However, please increase any you know can be safely done from 32->35, rick

_______________________________________________
> freebsd-net at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to
> "freebsd-net-unsubscribe at freebsd.org"
>