svn commit: r254520 - in head/sys: kern sys

Andre Oppermann andre at freebsd.org
Mon Aug 19 20:58:55 UTC 2013


On 19.08.2013 19:33, Navdeep Parhar wrote:
> On 08/19/13 04:16, Andre Oppermann wrote:
>> Author: andre
>> Date: Mon Aug 19 11:16:53 2013
>> New Revision: 254520
>> URL: http://svnweb.freebsd.org/changeset/base/254520
>>
>> Log:
>>    Remove the unused M_NOFREE mbuf flag.  It didn't have any in-tree users
>>    for a very long time, if ever.
>>
>>    Should such a functionality ever be needed again the appropriate and
>>    much better way to do it is through a custom EXT_SOMETHING external mbuf
>>    type together with a dedicated *ext_free function.
>>
>>    Discussed with:	trociny, glebius
>>
>> Modified:
>>    head/sys/kern/kern_mbuf.c
>>    head/sys/kern/uipc_mbuf.c
>>    head/sys/sys/mbuf.h
>>
>
> Hello Andre,
>
> Is this just garbage collection or is there some other reason for this?

This is garbage collection and removal of not quite right, rotten, functionality.

> I recently tried some experiments to reduce the number of mbuf and
> cluster allocations in a 40G NIC driver.  M_NOFREE and EXT_EXTREF proved
> very useful and the code changes to the kernel were minimal.  See
> user/np/cxl_tuning.  The experiment was quite successful and I was
> planning to bring in most of those changes to HEAD.  I was hoping to get
> some runtime mileage on the approach in general before tweaking the
> ctors/dtors for jumpbo, jumbo9, jumbo16 to allow for an mbuf+refcnt
> within the cluster.  But now M_NOFREE has vanished without a warning...

I'm looking through your experimental code and that is some really good
numbers you're achieving there!

However a couple things don't feel quite right, hackish even, and not
fit for HEAD.  This is a bit the same situation we had with some of the
first 1GigE cards quite a number of years back (mostly ti(4)).  There
we ended up with a couple of just good enough hacks to make it fast.
Most of the remains I've collected today.

I believe most of the improvements you've shown can be implemented in
a more generic and safe way into the mbuf system.  Also a number of things
in your experimental code may have side-effects in situations other than
netperf runs.

To summarize what I get from your experimental branch commits:
- the Chelsio T4/T5 card can DMA multiple ethernet frames (packets) into
   a single memory buffer, provided it is large enough.
- you make use of that feature and point multiple m_ext mbufs into that
   buffer to separate the packets and send them up the stack.
- you embed the m_ext refcount into the single memory buffer as well.
- you recycle mbufs? (I'm not entirely clear on that as I'm not familiar
   with the cxgbe code)

Lets examine and discuss these parts:
- M_NOFREE wasn't really safe with bpf anyway at least for packets going
   down the stack.
- Instead of M_NOFREE a custom *ext_free should be used that has the same
   and even more functionality.
- Recycling mbufs may cause imbalances to the mbuf pool with multiple cores
   and possibly NUMA in the future.  Not that we are very good with it at
   the moment but bypassing the mbuf allocator shouldn't become the norm.
   If it is a problem/bottleneck again it should be fixed, not worked around
   and copy-pasted n-times in so many drivers.
- jumbo9 and jumbo16 mbufs should not be used because they are more special
   than necessary with being KVM and physically contiguous.  This probably
   isn't necessary for the T4/T5 cards and any other modern DMA engine.
   Under heavy diverse network the kernel memory becomes fragmented and can't
   find memory fulfilling both criteria anymore.  In fact both are an artifact
   of the early GigE hacks when high speed DMA engines were in their infancy.
   Both jumbo9 and jumbo16 should go away without direct replacement.
   In your T4/T5 case the alternative would be either to a) allocate your
   own memory directly from KVM with the necessary properties (KVM and/or phys
   contig); b) have such a generic kernel mbuf type fulfilling the same role.
   There may be some cache line issues on non-x86 systems that have to be
   though and taken care of.
- Refcounts should have the option of being separately allocated.  It was
   mandatory to use the refcount zone so far because there wasn't any other
   type of refcount.  Now that we have one we should revisit the issue.
   Actually the entire mbuf allocation and initialization could be streamlined
   as a whole.

On a side note a different mbuf handling for the RX DMA rings may give some
significant improvements as well: allocate just a cluster without a mbuf through
m_cljget() and put it into the RX ring.  Only when the DMA has put a packet into
it allocated attach the mbuf (in the drivers RX function).  This avoids the cache
pollution from touching the mbuf fields during initial allocation, including the
refcount.

Another nice trick would be to shorten the mbuf by 4 bytes (in ext_size) and put
the refcount there.

Lets work on these together.

-- 
Andre



More information about the svn-src-head mailing list