Flow ID, LACP, and igb

Tue Aug 27 07:28:03 UTC 2013

On 27.08.2013 01:30, Adrian Chadd wrote:
> ... is there any reason we wouldn't want to have the TX and RX for a given flow mapped to the same core?

They are.  Thing is the inbound and outbound packet flow id's are totally
independent from each other.  The inbound one determines the RX ring it
will take to go up the stack.  If that's bound to a core that's fine and
gives affinity.  If the socket and user-space application are bound to
the same core as well, there is full affinity.

Now on the way down the core doing the write to the socket matters entering
the kernel.  It stays there until the packet is generated (in tcp_output
for example).  The flow id of the packet doesn't matter at all so far because
it is filled only then.  Now the packet goes down the stack and the flow id
is only used at the end when it has to decide for an outbound TX queue based
on it. This outbound TX ring doesn't have to be same it came in on as long as
it stays the same to prevent reordering.

This fixes Justin's issue with if_lagg and poor balancing.  He can simply
choose a good hash for the packets going out and stop worrying about it.
More important he's no longer hostage to random switches with poor hashing.

Ultimately you could try to bind the TX ring to a particular CPU as well and
try to run it lockless.  That is fraught with some difficult problems though.
First you must have exactly as many RX/TX queues as cores.  That's often not
the case as there are many cards that only support a limited number of rings.
Then for packets generated locally (think DNS query over UDP) you either simply
stick to the local cpu-assigned queue to send without looking at the computed
flow id or you have to switch cores to send the packet on the correct queue.
Such a very strong core binding is typically only really useful in embarrassing
parallel applications that only do packet pushing.  If your application is also
compute intense you may want to have some more flexibility to schedule threads
to prevent stalls from busy cores.  In that case not binding TX to a core is
a win.  So we will pretty much end up with one lock per TX ring to protect the
DMA descriptor structures.

We're still far way from having to worry about this TX issue.  The big win
is the RX queue - socket - application affinity (to the same core).

-- 
Andre