Best practice for accepting TCP connections on multicore?

Sat Jun 7 20:45:05 UTC 2014

On 7 June 2014 16:37, Igor Mozolevsky <igor at hybrid-lab.co.uk> wrote:
>
>
>
> On 7 June 2014 21:18, Adrian Chadd <adrian at freebsd.org> wrote:
>>
>> > Not quite - the gist (and the point) of that slide with Rob's story was
>> > that
>> > by the time Rob wrote something that could comprehensively deal with
>> > states
>> > in an even-driven server, he ended up essentially re-inventing the
>> > wheel.
>>
>> I read the same slides you did. He didn't reinvent the wheel - threads
>> are a different concept - at any point the state can change and you
>> switch to a new thread. Event driven, asynchronous programming isn't
>> quite like that.
>
>
> Not quite- unless you're dealing with stateless HTTP, you still need to know
> what the "current" state of the "current" connection is, which is the point
> of that slide.
>
>
>> > Paul Tyma's presentation posted earlier did conclude with various models
>> > for
>> > different types of daemons, which the OP might find at least
>> > interesting.
>>
>> Agreed, but again - it's all java, it's all linux, and it's 2008.
>
>
> Agreed, but threading models are platform-agnostic.
>
>
>> The current state is that threads and thread context switching are
>> more expensive than you'd like. You really want to (a) avoid locking
>> at all, (b) keep the CPU hot with cached data, and (c) keep it from
>> changing contexts.
>
>
> Agreed, but uncontested locking should be virtually cost-free (or close to
> that), modern CPUs have plenty of L2/L3 cache to keep enough data nearby,
> and there are plenty of cores to keep cycling in the same thread-loop, and
> hyper-threading helps with ctx switching (or at least is supposed to). In
> any event, shuttling data between RAM and cache (especially with the on-die
> RAM controllers, and even if data has to go through QPI/HyperT), and the
> cost of changing contexts is tiny compared to that of disk and network IO.

I was doing 40gbit/sec testing over 2^16 connections (and was hoping
to get the chance to optimise this stuff to get to 2^17 active
streaming connections, but I ran out of CPU.) If you're not careful
about keeping work on a local CPU, you end up blowing your caches and
hitting lock contention pretty quickly.

And QPI isn't free. There's a cost going backwards and forwards with
packet data and cache lines for uncontested data. I'm not going to
worry about QPI and socket awareness just for now - that's a bigger
problem to solve. I'll first worry about getting RSS working for a
single socket setup and then convert a couple of drivers over to be
RSS aware. I'll then worry about multiple socket awareness and being
aware of whether a NIC is local to a socket.

I'm hoping that with this work and the Verisign TCP locking changes,
we'll be able to handle 40gig bulk data on single socket Sandy Bridge
Xeon hardware and/or > 100,000 TCP sessions a second with plenty of
CPU to spare. Then it's getting to 80 gig on Ivy bridge class
single-socket hardware. I'm hoping we can aim for much higher (million
+ transactions a second) on the current generation hardware but that
requires a bunch more locking work. And well, whatever hardware I can
play with. All I have at home is a 4-core ivy bridge desktop box with
igb(4). :-P

-a