review/test: NFS patch to use pagesize mbuf clusters

Tue Mar 18 08:40:03 UTC 2014

Hi.

On 18.03.2014 03:26, Rick Macklem wrote:
> Several of the TSO capable network interfaces have a limit of
> 32 mbufs in the transmit mbuf chain (the drivers call these transmit
> segments, which I admit I find confusing).
>
> For a 64K read/readdir reply or 64K write request, NFS passes
> a list of 34 mbufs down to TCP. TCP will split the list, since
> it is slightly more than 64K bytes, but that split will normally
> be a copy by reference of the last mbuf cluster. As such, normally
> the network interface will get a list of 34 mbufs.
>
> For TSO enabled interfaces that are limited to 32 mbufs in the
> list, the usual workaround in the driver is to copy { real copy,
> not copy by reference } the list to 32 mbuf clusters via m_defrag().
> (A few drivers use m_collapse() which is less likely to succeed.)
>
> As a workaround to this problem, the attached patch modifies NFS
> to use larger pagesize clusters, so that the 64K RPC message is
> in 18 mbufs (assuming a 4K pagesize).
>
> Testing on my slow hardware which does not have TSO capability
> shows it to be performance neutral, but I believe avoiding the
> overhead of copying via m_defrag() { and possible failures
> resulting in the message never being transmitted } makes this
> patch worth doing.
>
> As such, I'd like to request review and/or testing of this patch
> by anyone who can do so.

First, I've tried to find respective NIC to test: cxgb/cxgbe have limit 
of 36, and so probably unaffected, ixgb -- 100, igb -- 64, only on em 
I've found limit of 32.

I run several profiles on em NIC with and without the patch. I can 
confirm that without the patch m_defrag() is indeed called, while with 
patch it is not any more. But profiler shows to me that very small 
amount of time (percents or even fractions) is spent there. I can't 
measure the effect (my Core-i7 desktop test system has only about 5% CPU 
load while serving full 1Gbps NFS over the em), though I can't say for 
sure that effect can't be there on some low-end system.

I am also not very sure about replacing M_WAITOK with M_NOWAIT. Instead 
of waiting a bit while VM find a cluster, NFSMCLGET() will return single 
mbuf, as result, replacing chain of 2K clusters instead of 4K ones with 
chain of 256b mbufs.

-- 
Alexander Motin