MQ Patch.

Tue Oct 29 22:03:14 UTC 2013

On 10/29/13 14:25, Andre Oppermann wrote:
> On 29.10.2013 22:03, Navdeep Parhar wrote:
>> On 10/29/13 13:41, Andre Oppermann wrote:
>>> Let me jump in here and explain roughly the ideas/path I'm exploring
>>> in creating and eventually implementing a big picture for drivers,
>>> queues, queue management, various QoS and so on:
>>>
>>> Situation: We're still mostly based on the old 4.4BSD IFQ model with
>>> a couple of work-arounds (sndring, drbr) and the bit-rotten ALTQ we
>>> have in tree aren't helpful at all.
>>>
>>> Steps:
>>>
>>> 1. take the soft-queuing method out of the ifnet layer and make it
>>>     a property of the driver, so that the upper stack (or actually
>>>     protocol L3/L2 mapping/encapsulation layer) calls (*if_transmit)
>>>     without any queuing at that point.  It then is up to the driver
>>>     to decide how it multiplexes multi-core access to its queue(s)
>>>     and how they are configured.
>>
>> It would work out much better if the kernel was aware of the number of
>> tx queues of a multiq driver and explicitly selected one in if_transmit.
>>   The driver has no information on the CPU affinity etc. of the
>> applications generating the traffic; the kernel does.  In general, the
>> kernel has a much better "global view" of the system and some of the
>> stuff currently in the drivers really should move up into the stack.
> 
> I've been thinking a lot about this and come to the preliminary conclusion
> that the upper stack should not tell the driver which queue to use.  There
> are way to many possible and depending on the use-case, better or worse
> performing approaches.  Also we have a big problem with cores vs. queues
> mismatches either way (more cores than queues or more queues than cores,
> though the latter is much less of problem).
> 
> For now I see these primary multi-hardware-queue approaches to be
> implemented
> first:
> 
> a) the drivers (*if_transmit) takes the flowid from the mbuf header and
>    selects one of the N hardware DMA rings based on it.  Each of the DMA
>    rings is protected by a lock.  Here the assumption is that by having
>    enough DMA rings the contention on each of them will be relatively low
>    and ideally a flow and ring sort of sticks to a core that sends lots
>    of packets into that flow.  Of course it is a statistical certainty that
>    some bouncing will be going on.
> 
> b) the driver assigns the DMA rings to particular cores which by that,
> through
>    a critnest++ can drive them lockless.  The drivers (*if_transmit)
> will look
>    up the core it got called on and push the traffic out on that DMA ring.
>    The problem is the actual upper stacks affinity which is not guaranteed.
>    This has to consequences: there may be reordering of packets of the same
>    flow because the protocols send function happens to be called from a
>    different core the second time.  Or the drivers (*if_transmit) has to
>    switch to the right core to complete the transmit for this flow if the
>    upper stack migrated/bounced around.  It is rather difficult to assure
>    full affinity from userspace down through the upper stack and then to
>    the driver.
> 
> c) non-multi-queue capable hardware uses a kernel provided set of functions
>    to manage the contention for the single resource of a DMA ring.
> 
> The point here is that the driver is the right place to make these
> decisions
> because the upper stack lacks (and shouldn't care about) the actual
> available
> hardware and its capabilities.  All necessary information is available
> to the
> driver as well through the appropriate mbuf header fields and the core
> it is
> called on.
> 

I mildly disagree with most of this, specifically with the part that the
driver is the right place to make these decisions.  But you did say this
was a "preliminary conclusion" so there's hope yet ;-)

Let's wait till you have an early implementation and we are all able to
experiment with it.  To be continued...

Regards,
Navdeep