Working on NUMA support

Tue Jul 29 21:51:20 UTC 2014

Hey Adrian,

Yes, there has been progress on this - although admittedly not as much as
we'd like at this point.  I believe to what you're talking about, we have
the layout for CPU affinity/locality.  I need to go through and cleanup a
good half-dozen branches of code.

Myself a mere mortal standing on the shoulders of giants in a room of
titans, I have to merge in my changes with Jeff's pertinent branch to get
this closer to useable.

>From my experience and research, in terms of access/response time:
1. localized DMA < all remote
2. (localized DMA + spillover remote) >= all remote

As ugly as it may be, I think I said that right..

There have been a few changes since that original email, but yes what we're
working to address is the userland <---> kernelspace.

On Sat, Jul 26, 2014 at 1:11 PM, Adrian Chadd <adrian at freebsd.org> wrote:

> Hi all!
>
> Has there been any further progress on this?
>
> I've been working on making the receive side scaling support usable by
> mere mortals and I've reached a point where I'm going to need this
> awareness in the 10ge/40ge drivers for the hardware I have access to.
>
> I'm right now more interested in the kernel driver/allocator side of
> things, so:
>
> * when bringing up a NIC, figure out what are the "most local" CPUs to run
> on;
> * for each NIC queue, figure out what the "most local" bus resources
> are for NIC resources like descriptors and packet memory (eg mbufs);
> * for each NIC queue, figure out what the "most local" resources are
> for local driver structures that the NIC doesn't touch (eg per-queue
> state);
> * for each RSS bucket, figure out what the "most local" resources are
> for things like packet memory (mbufs), tcp/udp/inp control structures,
> etc.
>
> I had a chat with jhb yesterday and he reminded me that y'all at
> isilon have been looking into this.
>
> He described a few interesting cases from the kernel side to me.
>
> * On architectures with external IO controllers, the path cost from an
> IO device to multiple CPUs may be (almost) equivalent, so there's not
> a huge penalty to allocate things on the wrong CPU. I think it'll be
> nice to get CPU local affinity where possible so we can parallelise
> DRAM access fully, but we can play with this and see.
> * On architectures with CPU-integrated IO controllers, there's a large
> penalty for doing inter-CPU IO,
> * .. but there's not such a huge penalty for doing inter-CPU memory access.
>
> Given that, we may find that we should always put the IO resources
> local to the CPU it's attached to, even if we decide to run some / all
> of the IO for the device on another CPU. Ie, any RAM that the IO
> device is doing data or descriptor DMA into should be local to that
> device. John said that in his experience it seemed the penalty for a
> non-local CPU touching memory was much less than device DMA crossing
> QPI.
>
> So the tricky bit is figuring that out and expressing it all in a way
> that allows us to do memory allocation and CPU binding in a more aware
> way. The other half of this tricky thing is to allow it to be easily
> overridden by a curious developer or system administrator that wants
> to experiment with different policies.
>
> Now, I'm very specifically only addressing the low level kernel IO /
> memory allocation requirements here. There's other things to worry
> about up in userland; I think you're trying to address that in your
> KPI descriptions.
>
> Thoughts?
>
>
> -a
>

-- 
V/Respectfully,
Andrew M Bates