svn commit: r254520 - in head/sys: kern sys

Mon Aug 19 21:45:24 UTC 2013

On 08/19/13 13:58, Andre Oppermann wrote:
> On 19.08.2013 19:33, Navdeep Parhar wrote:
>> On 08/19/13 04:16, Andre Oppermann wrote:
>>> Author: andre
>>> Date: Mon Aug 19 11:16:53 2013
>>> New Revision: 254520
>>> URL: http://svnweb.freebsd.org/changeset/base/254520
>>>
>>> Log:
>>>    Remove the unused M_NOFREE mbuf flag.  It didn't have any in-tree
>>> users
>>>    for a very long time, if ever.
>>>
>>>    Should such a functionality ever be needed again the appropriate and
>>>    much better way to do it is through a custom EXT_SOMETHING
>>> external mbuf
>>>    type together with a dedicated *ext_free function.
>>>
>>>    Discussed with:    trociny, glebius
>>>
>>> Modified:
>>>    head/sys/kern/kern_mbuf.c
>>>    head/sys/kern/uipc_mbuf.c
>>>    head/sys/sys/mbuf.h
>>>
>>
>> Hello Andre,
>>
>> Is this just garbage collection or is there some other reason for this?
> 
> This is garbage collection and removal of not quite right, rotten,
> functionality.
> 
>> I recently tried some experiments to reduce the number of mbuf and
>> cluster allocations in a 40G NIC driver.  M_NOFREE and EXT_EXTREF proved
>> very useful and the code changes to the kernel were minimal.  See
>> user/np/cxl_tuning.  The experiment was quite successful and I was
>> planning to bring in most of those changes to HEAD.  I was hoping to get
>> some runtime mileage on the approach in general before tweaking the
>> ctors/dtors for jumpbo, jumbo9, jumbo16 to allow for an mbuf+refcnt
>> within the cluster.  But now M_NOFREE has vanished without a warning...
> 
> I'm looking through your experimental code and that is some really good
> numbers you're achieving there!
> 
> However a couple things don't feel quite right, hackish even, and not
> fit for HEAD.  This is a bit the same situation we had with some of the
> first 1GigE cards quite a number of years back (mostly ti(4)).  There
> we ended up with a couple of just good enough hacks to make it fast.
> Most of the remains I've collected today.

If M_NOFREE and EXT_EXTREF are properly supported in the tree (and I'm
arguing that they were, before r254520) then the changes are perfectly
legitimate.  The only hackish part was that I was getting the cluster
from the jumbop zone while bypassing its normal refcnt mechanism.  This
I did so as to use the same zone as m_uiotombuf to keep it "hot" for all
consumers (driver + network stack).

> I believe most of the improvements you've shown can be implemented in
> a more generic and safe way into the mbuf system.  Also a number of things
> in your experimental code may have side-effects in situations other than
> netperf runs.

Agreed.  As I mentioned my long term plan was to tweak the jumboX zones
to allow them to allocate cluster with embedded mbuf + refcount.  The
M_NOFREE+EXT_EXTREF approach is the perfect bridge from here to there.
It is non-intrusive and lends itself well to experimentation.

> 
> To summarize what I get from your experimental branch commits:
> - the Chelsio T4/T5 card can DMA multiple ethernet frames (packets) into
>   a single memory buffer, provided it is large enough.
> - you make use of that feature and point multiple m_ext mbufs into that
>   buffer to separate the packets and send them up the stack.
> - you embed the m_ext refcount into the single memory buffer as well.

yes, yes, and yes.

> - you recycle mbufs? (I'm not entirely clear on that as I'm not familiar
>   with the cxgbe code)

I recycle the cluster (and the embedded mbuf in it) when possible.  All
depends on whether it's still in use by the time the rx queue needs it back.

> 
> Lets examine and discuss these parts:
> - M_NOFREE wasn't really safe with bpf anyway at least for packets going
>   down the stack.

I see.  Can you point out the parts of bpf unable to deal with M_NOFREE?

> - Instead of M_NOFREE a custom *ext_free should be used that has the same
>   and even more functionality.

Yes, that's what I was planning too, with the jumboX zone changes.  It
would be faster than running the m_ext's free function (which is a
function dereference+call).

> - Recycling mbufs may cause imbalances to the mbuf pool with multiple cores
>   and possibly NUMA in the future.  Not that we are very good with it at
>   the moment but bypassing the mbuf allocator shouldn't become the norm.
>   If it is a problem/bottleneck again it should be fixed, not worked around
>   and copy-pasted n-times in so many drivers.

If/when a cluster is recycled, it is given back to the same rx ithread
that originally allocated it, and not not any other queue.  If the
ithread stays in the same NUMA domain (and it really should) then
recycling the cluster in the same queue really shouldn't cause imbalances.

> - jumbo9 and jumbo16 mbufs should not be used because they are more special
>   than necessary with being KVM and physically contiguous.  This probably
>   isn't necessary for the T4/T5 cards and any other modern DMA engine.
>   Under heavy diverse network the kernel memory becomes fragmented and
> can't
>   find memory fulfilling both criteria anymore.  In fact both are an
> artifact
>   of the early GigE hacks when high speed DMA engines were in their
> infancy.
>   Both jumbo9 and jumbo16 should go away without direct replacement.

The devices I deal will be able to cope, but I suspect this will be
disruptive (for not enough reason to justify the disruption, imho) to
others.  That's a separate discussion anyway.

>   In your T4/T5 case the alternative would be either to a) allocate your
>   own memory directly from KVM with the necessary properties (KVM and/or
> phys
>   contig); b) have such a generic kernel mbuf type fulfilling the same
> role.
>   There may be some cache line issues on non-x86 systems that have to be
>   though and taken care of.
> - Refcounts should have the option of being separately allocated.  It was
>   mandatory to use the refcount zone so far because there wasn't any other
>   type of refcount.  Now that we have one we should revisit the issue.
>   Actually the entire mbuf allocation and initialization could be
> streamlined
>   as a whole.
> 
> On a side note a different mbuf handling for the RX DMA rings may give some
> significant improvements as well: allocate just a cluster without a mbuf
> through
> m_cljget() and put it into the RX ring.  Only when the DMA has put a
> packet into
> it allocated attach the mbuf (in the drivers RX function).  This avoids
> the cache
> pollution from touching the mbuf fields during initial allocation,
> including the
> refcount.

Er, this is pretty much how cxgbe(4) has always done it.  A cluster is
allocated with m_cljget, an mbuf is allocated with MT_NOINIT to avoid
touching any line in the mbuf.  m_init is called later in the rx ithread.

> 
> Another nice trick would be to shorten the mbuf by 4 bytes (in ext_size)
> and put
> the refcount there.
> 
> Lets work on these together.

Yes.  I think we agree on the long term optimization, or are at least
headed in the same general direction.  But removing M_NOFREE removed a
dirt-cheap way to test and exercise the approach without implementing
the "real thing."  Why not just let it be?

Regards,
Navdeep