status of projects/numa?

Wed Dec 24 03:09:13 UTC 2014

Ok, so to summarise the various bits/pieces I hear from around the
grapevine and speaking with numa people like jhb:

The projects/numa stuff has a lot of good stuff in it. Jeff/Isilon's
idea seems to be to create an allocator policy framework that can be
specified global, per-process and/or per-thread. That's pretty good
stuff right there.

What's missing however before it can be useful by drivers is the rest
of the UMA stuff and I think there's some VM pagetable work that needs
doing. I haven't dug into the latter because I was trusting that those
working on projects/numa will get to it like they've said, but ..
well, it's still not done.

>From a UMA API perspective, it isn't trying very hard to return memory
back to the domain that it was allocated from. I'm worried that on a
real NUMA box with no policy configured or used by userland threads
(ie, everything is being scheduled everywhere), we'll end up with
threads allocating from a CPU-local pool, being migrated to another
CPU, then freeing it there. So over time it'll end up with the cache
spiked with non-local pages. I talked to jhb about it but I don't
think we have a consensus about it at all. I think we should go the
extra mile of ensuring that when we return pages to UMA it goes back
into the right NUMA domain - I don't want to have to debug systems
that start really fast and then end up getting slower over time as
stuff ends up on the wrong CPU. But that's just me.

>From the physmem perspective, I'll likely implement an SRAT-aware
allocator option that allocates local-first, then tries to allocate
from locall-er domains until it just round-robins. We have that cost
matrix that says how expensive things are from a given NUMA domain, so
that shouldn't be too difficult to pre-compute.

>From a driver perspective, I've added the basic "which domain am I in"
call.  John's been playing with a couple of drivers (igb/ixgbe I
think) in his local repository for teaching them about what is a local
interrupt and local CPU set to start threads on. Ideally all the
drivers would use the same API for querying what their local cpuset is
and assigning worker threads / interrupts appropriately. I should poke
him again to find the status of that work and at least get that into
-HEAD so it can be evaluated and used by people.

What I'm hoping to do with that in the short term is to make it
generic enough so that a generic, consistent set of hints can be
configured for a given driver to setup its worker thread and cpuset
map. Ie, instead of each driver having its own "how many queues"
sysctl and probe logic, it'll have some API to say "give me my number
of threads and local cpuset(s)" so if someone wants to override it,
it's done by something like code in the bus layer rather than above
individual driver hacks. We can also add options for things like "pin
threads" or not.

>From the driver /allocation/ perspective, there's a few things to think about:

* how we allocate busdma memory for things like descriptors;
* how we allocate DMA memory for things like mbufs, bufs, etc - what
we're dma'ing into and out of.

None of that is currently specified, although there's a couple of
tidbits in the projects/numa branch that haven't been fleshed out.

In my vague drawing-on-paper sense of this, I think we also should
then extend busdma a little to be aware of a numaset for allocating
memory for descriptors. Same with calling malloc and contigmalloc.
Ideally we'd keep descriptor accesses in memory local to the device.

Now for things like bufs/mbufs/etc - the DMA bits - that's where it
gets a little tricky. Sometimes we're handed memory to send from/to.
The NIC RX path allocates memory itself. storage controllers get
handed memory to read storage data into. So, the higher layer question
of "how do we tell something where to allocate memory from" gets a
little complicated, because we (a) may have to allocate memory for a
non-local device, knowing we're going to hand it to some device on
another core, and (b) we then have to return it to the right NUMA
aware pool (UMA, vm_page stuff, etc.)

The other tricksy bit is when drivers want to allocate memory for
local ephemeral work (eg M_TEMP) versus DMA/descriptor/busdma stuff.
Here's about where I say "don't worry about this - do all the above
bits first then worry about worrying about this":

say we have a NIC driver on an 8-core CPU, and it's in a 4-socket box
(so 4 sockets * 8 cores each socket.) The receive threads can be told
to run in a specific cpuset local to the CPU the NIC is plugged into.
But the transmit threads, taskqueue bits, etc may run on any CPU -
remember, we don't queue traffic and wakeup a thread now; we call
if_transmit() from whichever CPU is running the transmit code. So if
it's local then it'll come from NUMA memory domain local memory and
stuff is fine. But if it's transmitting from some /remote/ thread:

* the mbuf that's been allocated to send data is likely allocated from
the wrong NUMA domain;
* the local memory used by the driver to do work should be coming from
memory local to that CPU, but it may call malloc/etc to get temporary
working memory - and that hopefully is also coming from a local memory
domain;
* if it has to allocate descriptor memory or something then it may
need to allocate memory from the remote memory domain;
* .. then touch the NIC hardware, which is remote.

This is where I say "screw it, don't get bogged down in the details."
It gets hairier when we think about what's local versus remote for
vm_page entries, because hey, who knows what the right thing to do
there is. But I think the right thing to do here is to not worry about
it and get the above stuff done. For NUMA aware network applications
they're likely going to be CPU set to threads local to the same domain
as the NIC, and if that isn't good enough, we'll have to come up with
some way to mark a socket as local to a NUMA domain so things like
mbufs, TCP timers, etc get scheduled appropriately.

(This is where I say "And this is where the whole RSS framework and
NUMA need to know about each other", but I'm not really ready for that
yet.)

So, that's my braindump of what I remember from discussing with others.

I think the stages to go here are:

* get a firmer idea of what's missing from projects/numa for UMA,
physmem and VM allocator / pagedaemon/etc stuff;
* get jhb's numaset code into -HEAD;
* get jhb's driver NUMA awareness bits out, add more generic hooks for
expressing this stuff via hints, and dump that into -HEAD;
* evaluate what we're going to do about mbufs, malloc/contigmalloc for
busdma, bounce buffers, descriptors, etc and get that into -HEAD;
* make sure the intel-pcm tools (and then hwpmc!) work for the uncore
CPU interconnects work so we can measure how well we are/aren't doing
things;
* come back to the drawing board once the above is done and we've got
some experience with it.

I haven't even started talking about what's needed for /userland/
memory pages. I think that's part of what needs to be fleshed out /
discussed with the VM bits.

-adrian