irq cpu binding

Sun Mar 29 05:46:55 UTC 2015

On 28 March 2015 at 17:33, Slawa Olhovchenkov <slw at zxy.spb.ru> wrote:
> On Sat, Mar 28, 2015 at 04:58:53PM -0700, Adrian Chadd wrote:
>
>> Hi,
>>
>> * It turns out that fragments were being 100% handled out of order
>> (compared to non-fragments in the same stream) when doing fragment
>> reassembly, because the current system was assuming direct dispatch
>> netisr and not checking any packet contents for whether they're on the
>> wrong CPU. I checked. It's not noticable unless you go digging, but
>> it's absolutely happening. That's why I spun a lot of cycles looking
>> at the IP fragment reassembly path and which methods get called on the
>> frames as they're reinjected.
>
> In case of fragmented packet we have first fragment (may be arrived
> not first) contained L4 information and dispatchet to correct bucket
> and other fragments, don't contains this information and dispathed
> anywere. As I understund IP stack gather all packet before processing.
> All we need -- do processing on CPU arriving first segment.

I'm pretty sure that wasn't what was happening when i went digging. I
was using UDP and varying the transmit size so I had exact control
over the fragmentation.

The driver rx path does direct dispatch netisr processing, and for
fragments it was hashed on only L3 details not L4. Even the first
frame is hashed on L3 only. So it'd go to a different queue compared
to L4 hashing, and subsequent fragments would come in on the same
queue. Once it was completed, it was processed up inline - it wasn't
going back into netisr and getting re-checked for the right queue.

>> * We're going to have modify drivers, because the way drivers
>> currently assign interrupts, pick CPUs for queues, auto-select how
>> many queues to use, etc is all completely adhoc and not consistent. So
>
> Yes. I don't see problem (except re-binding IRQ by cpuset).
> All interesting drivers give tunable to control how many queues to
> use. I don't know how automate this:
>
> - one 1-port card
> - one 2-port card
> - one port of 2-port card
> - two 1-port card
> - two different card
> ....
>
> Manual select is aceptable here.
>
>> yeah, we're going to change the drivers and they're going to be
>> consistent and configurable. That way you can choose how you want to
>> distribute work and pin or not pin things - and it's not done adhoc
>> differently in each driver. Even igb, ixgbe and cxgbe differ in how
>> they implement these three things.
>>
>> * For RSS, there'll be a consistent configuration for what the
>> hardware is doing with hashing, rather than it being driver dependent.
>> Again, otherwise you may end up with some NICs doing 2-tuple hashing
>> where others are doing 4-tuple hashing, and behaviour changes
>> dramatically based on what NIC you're using.
>
> What's problem there?
> I am don't intersting how NIC do hashing (anyway, hashing for direct
> and reflex traffic is different -- this is not Tilera).
> All I need -- distributing flow to CPU, for balance load and reduction
> lock congenstion.

Right, but you assume all packets in a flow go to the same CPU, and I
discovered this wasn't the case.
That's why I went down the path with RSS to make it right.

>
>> * For applications - I'm not sure yet, but at the minimum the librss
>> API I have vaguely sketched out and coded up in a git branch lets you
>> pull out the list of buckets and which CPU it's on. I'm going to
>> extend that a bit more, but it should be enough for things like nginx
>> to say "ok, start up one nginx process per RSS bucket, and here's the
>> CPU set for it to bind to." You said it has worker groups - that's
>> great; I want that to be auto configured.
>
> For applications minimum is (per socket) select/kqueut/accept work
> only for flow, arrived at CPU matched CPU at time select/kqueut/accept
> (yes, for correct work application must pined to this CPU).
>
> And application don't need know anything about buckets and etc.
>
> After this, arrived packet activated IRQ handler, ithread, driver
> interrup thread, TCP stack, select/accept, read, write, tcp_output --
> all on same cpu. I can be wrong, this is save L2/L3 cache.
>
> Where I missunderstund?

The other half of the network stack - the sending side - also needs to
be either on the same or nearby CPU, or you still end up with lock
contention and cache thrashing.

-a