RFC: How to fix the NFS/iSCSI vs TSO problem

Wed Mar 26 02:33:46 UTC 2014

On Tue, Mar 25, 2014 at 07:10:35PM -0400, Rick Macklem wrote:
> Hi,
> 
> First off, I hope you don't mind that I cross-posted this,
> but I wanted to make sure both the NFS/iSCSI and networking
> types say it.
> If you look in this mailing list thread:
>   http://docs.FreeBSD.org/cgi/mid.cgi?1850411724.1687820.1395621539316.JavaMail.root
> you'll see that several people have been working hard at testing and
> thanks to them, I think I now know what is going on.

Thanks for your hard work on narrowing down that issue.  I'm too
busy for $work in these days so I couldn't find time to investigate
the issue.

> (This applies to network drivers that support TSO and are limited to 32 transmit
>  segments->32 mbufs in chain.) Doing a quick search I found the following
> drivers that appear to be affected (I may have missed some):
>  jme, fxp, age, sge, msk, alc, ale, ixgbe/ix, nfe, e1000/em, re
> 

The magic number 32 was chosen long time ago when I implemented TSO
in non-Intel drivers.  I tried to find optimal number to reduce the
size kernel stack usage at that time.  bus_dma(9) will coalesce
with previous segment if possible so I thought the number 32 was
not an issue.  Not sure current bus_dma(9) also has the same code
though.  The number 32 is arbitrary one so you can increase
it if you want.

> Further, of these drivers, the following use m_collapse() and not m_defrag()
> to try and reduce the # of mbufs in the chain. m_collapse() is not going to
> get the 35 mbufs down to 32 mbufs, as far as I can see, so these ones are
> more badly broken:
>  jme, fxp, age, sge, alc, ale, nfe, re

I guess m_defeg(9) is more optimized for non-TSO packets. You don't
want to waste CPU cycles to copy the full frame to reduce the
number of mbufs in the chain.  For TSO packets, m_defrag(9) looks
better but if we always have to copy a full TSO packet to make TSO
work, driver writers have to invent better scheme rather than
blindly relying on m_defrag(9), I guess.

> 
> The long description is in the above thread, but the short version is:
> - NFS generates a chain with 35 mbufs in it for (read/readdir replies and write requests)
>   made up of (tcpip header, RPC header, NFS args, 32 clusters of file data)
> - tcp_output() usually trims the data size down to tp->t_tsomax (65535) and
>   then some more to make it an exact multiple of TCP transmit data size.
>   - the net driver prepends an ethernet header, growing the length by 14 (or
>     sometimes 18 for vlans), but in the first mbuf and not adding one to the chain.
>   - m_defrag() copies this to a chain of 32 mbuf clusters (because the total data
>     length is <= 64K) and it gets sent
> 
> However, it the data length is a little less than 64K when passed to tcp_output()
> so that the length including headers is in the range 65519->65535...
> - tcp_output() doesn't reduce its size.
>   - the net driver adds an ethernet header, making the total data length slightly
>     greater than 64K
>   - m_defrag() copies it to a chain of 33 mbuf clusters, which fails with EFBIG
> --> trainwrecks NFS performance, because the TSO segment is dropped instead of sent.
> 
> A tester also stated that the problem could be reproduced using iSCSI. Maybe
> Edward Napierala might know some details w.r.t. what kind of mbuf chain iSCSI
> generates?
> 
> Also, one tester has reported that setting if_hw_tsomax in the driver before
> the ether_ifattach() call didn't make the value of tp->t_tsomax smaller.
> However, reducing IP_MAXPACKET (which is what it is set to by default) did
> reduce it. I have no idea why this happens or how to fix it, but it implies
> that setting if_hw_tsomax in the driver isn't a solution until this is resolved.
> 
> So, what to do about this?
> First, I'd like a simple fix/workaround that can go into 9.3. (which is code
> freeze in May). The best thing I can think of is setting if_hw_tsomax to a
> smaller default value. (Line# 658 of sys/net/if.c in head.)
> 
> Version A:
> replace
>   ifp->if_hw_tsomax = IP_MAXPACKET;
> with
>   ifp->if_hw_tsomax = min(32 * MCLBYTES - (ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN),
>       IP_MAXPACKET);
> plus
>   replace m_collapse() with m_defrag() in the drivers listed above.
> 
> This would only reduce the default from 65535->65518, so it only impacts
> the uncommon case where the output size (with tcpip header) is within
> this range. (As such, I don't think it would have a negative impact for
> drivers that handle more than 32 transmit segments.)
> From the testers, it seems that this is sufficient to get rid of the EFBIG
> errors. (The total data length including ethernet header doesn't exceed 64K,
> so m_defrag() fits it into 32 mbuf clusters.)
> 
> The main downside of this is that there will be a lot of m_defrag() calls
> being done and they do quite a bit of bcopy()'ng.
> 
> Version B:
> replace
>   ifp->if_hw_tsomax = IP_MAXPACKET;
> with
>   ifp->if_hw_tsomax = min(29 * MCLBYTES, IP_MAXPACKET);
> 
> This one would avoid the m_defrag() calls, but might have a negative
> impact on TSO performance for drivers that can handle 35 transmit segments,
> since the maximum TSO segment size is reduced by about 6K. (Because of the
> second size reduction to an exact multiple of TCP transmit data size, the
> exact amount varies.)
> 
> Possible longer term fixes:
> One longer term fix might be to add something like if_hw_tsomaxseg so that
> a driver can set a limit on the number of transmit segments (mbufs in chain)
> and tcp_output() could use that to limit the size of the TSO segment, as
> required. (I have a first stab at such a patch, but no way to test it, so
> I can't see that being done by May. Also, it would require changes to a lot
> of drivers to make it work. I've attached this patch, in case anyone wants
> to work on it?)
> 
> Another might be to increase the size of MCLBYTES (I don't see this as
> practical for 9.3, although the actual change is simple.). I do think
> that increasing MCLBYTES might be something to consider doing in the
> future, for reasons beyond fixing this.
> 
> So, what do others think should be done? rick
> 

AFAIK all TSO capable drivers you mentioned above have no limit on
the number of TX segments in TSO path.  Not sure about Intel
controllers though.  Increasing the number of segment will consume
lots of kernel stack in those drivers. Given that ixgbe, which
seems to use 100, didn't show any kernal stack shortage, I think
bumping the number of segments would be quick way to address the
issue.