irq cpu binding

Sun Mar 29 15:20:26 UTC 2015

On 29 March 2015 at 01:19, Slawa Olhovchenkov <slw at zxy.spb.ru> wrote:
> On Sat, Mar 28, 2015 at 10:46:54PM -0700, Adrian Chadd wrote:
>
>> >> * It turns out that fragments were being 100% handled out of order
>> >> (compared to non-fragments in the same stream) when doing fragment
>> >> reassembly, because the current system was assuming direct dispatch
>> >> netisr and not checking any packet contents for whether they're on the
>> >> wrong CPU. I checked. It's not noticable unless you go digging, but
>> >> it's absolutely happening. That's why I spun a lot of cycles looking
>> >> at the IP fragment reassembly path and which methods get called on the
>> >> frames as they're reinjected.
>> >
>> > In case of fragmented packet we have first fragment (may be arrived
>> > not first) contained L4 information and dispatchet to correct bucket
>> > and other fragments, don't contains this information and dispathed
>> > anywere. As I understund IP stack gather all packet before processing.
>> > All we need -- do processing on CPU arriving first segment.
>>
>> I'm pretty sure that wasn't what was happening when i went digging. I
>> was using UDP and varying the transmit size so I had exact control
>> over the fragmentation.
>>
>> The driver rx path does direct dispatch netisr processing, and for
>> fragments it was hashed on only L3 details not L4. Even the first
>> frame is hashed on L3 only. So it'd go to a different queue compared
>> to L4 hashing, and subsequent fragments would come in on the same
>> queue. Once it was completed, it was processed up inline - it wasn't
>> going back into netisr and getting re-checked for the right queue.
>
> Two case:
> 1) let this behavior
> 2) rewrite fo resheduling.
>
> I think 1) acceptable -- fragmented packets very rarely, compared to
> target data rate (2Mpps and more).
>
>> > What's problem there?
>> > I am don't intersting how NIC do hashing (anyway, hashing for direct
>> > and reflex traffic is different -- this is not Tilera).
>> > All I need -- distributing flow to CPU, for balance load and reduction
>> > lock congenstion.
>>
>> Right, but you assume all packets in a flow go to the same CPU, and I
>> discovered this wasn't the case.
>> That's why I went down the path with RSS to make it right.
>
> Only fragmented packets case or other case?
>
>> >
>> >> * For applications - I'm not sure yet, but at the minimum the librss
>> >> API I have vaguely sketched out and coded up in a git branch lets you
>> >> pull out the list of buckets and which CPU it's on. I'm going to
>> >> extend that a bit more, but it should be enough for things like nginx
>> >> to say "ok, start up one nginx process per RSS bucket, and here's the
>> >> CPU set for it to bind to." You said it has worker groups - that's
>> >> great; I want that to be auto configured.
>> >
>> > For applications minimum is (per socket) select/kqueut/accept work
>> > only for flow, arrived at CPU matched CPU at time select/kqueut/accept
>> > (yes, for correct work application must pined to this CPU).
>> >
>> > And application don't need know anything about buckets and etc.
>> >
>> > After this, arrived packet activated IRQ handler, ithread, driver
>> > interrup thread, TCP stack, select/accept, read, write, tcp_output --
>> > all on same cpu. I can be wrong, this is save L2/L3 cache.
>> >
>> > Where I missunderstund?
>>
>> The other half of the network stack - the sending side - also needs to
>> be either on the same or nearby CPU, or you still end up with lock
>> contention and cache thrashing.
>
> For incoming connections this will be automatuc -- sending will be
> from CPU binding to receiving queue.
>
> Outgoing connections is more complex case, yes.
> Need to transfer FD (with re-binding) and signaling (from kernel to
> application) about prefered CPU. Prefered CPU is CPU give SYN-ACK.
> And this need assistance from application. But I am currently can't
> remember application massive servering outgouing connections.

Or you realise you need to rewrite your userland application so it
doesn't have to do this, and instead uses an IOCP/libdispatch style IO
API to register for IO events and get IO completions to occur in any
given completion thread.

Then it doesn't have to care about moving descriptors around - it just
creates an outbound socket, and then the IO completion callbacks will
happen wherever they need to happen. If that needs to shuffle around
due to RSS rebalancing then it'll "just happen".

And yeah, I know of plenty of applications doing massive outbound
connections - anything being an intermediary HTTP proxy. :)

-adrian