irq cpu binding

Sun Mar 29 08:19:04 UTC 2015

On Sat, Mar 28, 2015 at 10:46:54PM -0700, Adrian Chadd wrote:

> >> * It turns out that fragments were being 100% handled out of order
> >> (compared to non-fragments in the same stream) when doing fragment
> >> reassembly, because the current system was assuming direct dispatch
> >> netisr and not checking any packet contents for whether they're on the
> >> wrong CPU. I checked. It's not noticable unless you go digging, but
> >> it's absolutely happening. That's why I spun a lot of cycles looking
> >> at the IP fragment reassembly path and which methods get called on the
> >> frames as they're reinjected.
> >
> > In case of fragmented packet we have first fragment (may be arrived
> > not first) contained L4 information and dispatchet to correct bucket
> > and other fragments, don't contains this information and dispathed
> > anywere. As I understund IP stack gather all packet before processing.
> > All we need -- do processing on CPU arriving first segment.
> 
> I'm pretty sure that wasn't what was happening when i went digging. I
> was using UDP and varying the transmit size so I had exact control
> over the fragmentation.
> 
> The driver rx path does direct dispatch netisr processing, and for
> fragments it was hashed on only L3 details not L4. Even the first
> frame is hashed on L3 only. So it'd go to a different queue compared
> to L4 hashing, and subsequent fragments would come in on the same
> queue. Once it was completed, it was processed up inline - it wasn't
> going back into netisr and getting re-checked for the right queue.

Two case:
1) let this behavior
2) rewrite fo resheduling.

I think 1) acceptable -- fragmented packets very rarely, compared to
target data rate (2Mpps and more).

> > What's problem there?
> > I am don't intersting how NIC do hashing (anyway, hashing for direct
> > and reflex traffic is different -- this is not Tilera).
> > All I need -- distributing flow to CPU, for balance load and reduction
> > lock congenstion.
> 
> Right, but you assume all packets in a flow go to the same CPU, and I
> discovered this wasn't the case.
> That's why I went down the path with RSS to make it right.

Only fragmented packets case or other case?

> >
> >> * For applications - I'm not sure yet, but at the minimum the librss
> >> API I have vaguely sketched out and coded up in a git branch lets you
> >> pull out the list of buckets and which CPU it's on. I'm going to
> >> extend that a bit more, but it should be enough for things like nginx
> >> to say "ok, start up one nginx process per RSS bucket, and here's the
> >> CPU set for it to bind to." You said it has worker groups - that's
> >> great; I want that to be auto configured.
> >
> > For applications minimum is (per socket) select/kqueut/accept work
> > only for flow, arrived at CPU matched CPU at time select/kqueut/accept
> > (yes, for correct work application must pined to this CPU).
> >
> > And application don't need know anything about buckets and etc.
> >
> > After this, arrived packet activated IRQ handler, ithread, driver
> > interrup thread, TCP stack, select/accept, read, write, tcp_output --
> > all on same cpu. I can be wrong, this is save L2/L3 cache.
> >
> > Where I missunderstund?
> 
> The other half of the network stack - the sending side - also needs to
> be either on the same or nearby CPU, or you still end up with lock
> contention and cache thrashing.

For incoming connections this will be automatuc -- sending will be
from CPU binding to receiving queue.

Outgoing connections is more complex case, yes.
Need to transfer FD (with re-binding) and signaling (from kernel to
application) about prefered CPU. Prefered CPU is CPU give SYN-ACK.
And this need assistance from application. But I am currently can't
remember application massive servering outgouing connections.