status of projects/numa?

Thu Dec 25 00:28:25 UTC 2014

Hi, Adrian,

Thanks for such detail information. I think now I can and will:
1. start from projects/numa and have a summary first.
2. post the summary and thoughts out for discussion.
3. merge latest HEAD code.
4. implement the idea and post the patch for review.

Regards,
Gavin Mu

> On Dec 24, 2014, at 11:09, Adrian Chadd <adrian at freebsd.org> wrote:
> 
> Ok, so to summarise the various bits/pieces I hear from around the
> grapevine and speaking with numa people like jhb:
> 
> The projects/numa stuff has a lot of good stuff in it. Jeff/Isilon's
> idea seems to be to create an allocator policy framework that can be
> specified global, per-process and/or per-thread. That's pretty good
> stuff right there.
> 
> What's missing however before it can be useful by drivers is the rest
> of the UMA stuff and I think there's some VM pagetable work that needs
> doing. I haven't dug into the latter because I was trusting that those
> working on projects/numa will get to it like they've said, but ..
> well, it's still not done.
> 
> From a UMA API perspective, it isn't trying very hard to return memory
> back to the domain that it was allocated from. I'm worried that on a
> real NUMA box with no policy configured or used by userland threads
> (ie, everything is being scheduled everywhere), we'll end up with
> threads allocating from a CPU-local pool, being migrated to another
> CPU, then freeing it there. So over time it'll end up with the cache
> spiked with non-local pages. I talked to jhb about it but I don't
> think we have a consensus about it at all. I think we should go the
> extra mile of ensuring that when we return pages to UMA it goes back
> into the right NUMA domain - I don't want to have to debug systems
> that start really fast and then end up getting slower over time as
> stuff ends up on the wrong CPU. But that's just me.
> 
> From the physmem perspective, I'll likely implement an SRAT-aware
> allocator option that allocates local-first, then tries to allocate
> from locall-er domains until it just round-robins. We have that cost
> matrix that says how expensive things are from a given NUMA domain, so
> that shouldn't be too difficult to pre-compute.
> 
> From a driver perspective, I've added the basic "which domain am I in"
> call.  John's been playing with a couple of drivers (igb/ixgbe I
> think) in his local repository for teaching them about what is a local
> interrupt and local CPU set to start threads on. Ideally all the
> drivers would use the same API for querying what their local cpuset is
> and assigning worker threads / interrupts appropriately. I should poke
> him again to find the status of that work and at least get that into
> -HEAD so it can be evaluated and used by people.
> 
> What I'm hoping to do with that in the short term is to make it
> generic enough so that a generic, consistent set of hints can be
> configured for a given driver to setup its worker thread and cpuset
> map. Ie, instead of each driver having its own "how many queues"
> sysctl and probe logic, it'll have some API to say "give me my number
> of threads and local cpuset(s)" so if someone wants to override it,
> it's done by something like code in the bus layer rather than above
> individual driver hacks. We can also add options for things like "pin
> threads" or not.
> 
> From the driver /allocation/ perspective, there's a few things to think about:
> 
> * how we allocate busdma memory for things like descriptors;
> * how we allocate DMA memory for things like mbufs, bufs, etc - what
> we're dma'ing into and out of.
> 
> None of that is currently specified, although there's a couple of
> tidbits in the projects/numa branch that haven't been fleshed out.
> 
> In my vague drawing-on-paper sense of this, I think we also should
> then extend busdma a little to be aware of a numaset for allocating
> memory for descriptors. Same with calling malloc and contigmalloc.
> Ideally we'd keep descriptor accesses in memory local to the device.
> 
> Now for things like bufs/mbufs/etc - the DMA bits - that's where it
> gets a little tricky. Sometimes we're handed memory to send from/to.
> The NIC RX path allocates memory itself. storage controllers get
> handed memory to read storage data into. So, the higher layer question
> of "how do we tell something where to allocate memory from" gets a
> little complicated, because we (a) may have to allocate memory for a
> non-local device, knowing we're going to hand it to some device on
> another core, and (b) we then have to return it to the right NUMA
> aware pool (UMA, vm_page stuff, etc.)
> 
> The other tricksy bit is when drivers want to allocate memory for
> local ephemeral work (eg M_TEMP) versus DMA/descriptor/busdma stuff.
> Here's about where I say "don't worry about this - do all the above
> bits first then worry about worrying about this":
> 
> say we have a NIC driver on an 8-core CPU, and it's in a 4-socket box
> (so 4 sockets * 8 cores each socket.) The receive threads can be told
> to run in a specific cpuset local to the CPU the NIC is plugged into.
> But the transmit threads, taskqueue bits, etc may run on any CPU -
> remember, we don't queue traffic and wakeup a thread now; we call
> if_transmit() from whichever CPU is running the transmit code. So if
> it's local then it'll come from NUMA memory domain local memory and
> stuff is fine. But if it's transmitting from some /remote/ thread:
> 
> * the mbuf that's been allocated to send data is likely allocated from
> the wrong NUMA domain;
> * the local memory used by the driver to do work should be coming from
> memory local to that CPU, but it may call malloc/etc to get temporary
> working memory - and that hopefully is also coming from a local memory
> domain;
> * if it has to allocate descriptor memory or something then it may
> need to allocate memory from the remote memory domain;
> * .. then touch the NIC hardware, which is remote.
> 
> This is where I say "screw it, don't get bogged down in the details."
> It gets hairier when we think about what's local versus remote for
> vm_page entries, because hey, who knows what the right thing to do
> there is. But I think the right thing to do here is to not worry about
> it and get the above stuff done. For NUMA aware network applications
> they're likely going to be CPU set to threads local to the same domain
> as the NIC, and if that isn't good enough, we'll have to come up with
> some way to mark a socket as local to a NUMA domain so things like
> mbufs, TCP timers, etc get scheduled appropriately.
> 
> (This is where I say "And this is where the whole RSS framework and
> NUMA need to know about each other", but I'm not really ready for that
> yet.)
> 
> So, that's my braindump of what I remember from discussing with others.
> 
> I think the stages to go here are:
> 
> * get a firmer idea of what's missing from projects/numa for UMA,
> physmem and VM allocator / pagedaemon/etc stuff;
> * get jhb's numaset code into -HEAD;
> * get jhb's driver NUMA awareness bits out, add more generic hooks for
> expressing this stuff via hints, and dump that into -HEAD;
> * evaluate what we're going to do about mbufs, malloc/contigmalloc for
> busdma, bounce buffers, descriptors, etc and get that into -HEAD;
> * make sure the intel-pcm tools (and then hwpmc!) work for the uncore
> CPU interconnects work so we can measure how well we are/aren't doing
> things;
> * come back to the drawing board once the above is done and we've got
> some experience with it.
> 
> I haven't even started talking about what's needed for /userland/
> memory pages. I think that's part of what needs to be fleshed out /
> discussed with the VM bits.
> 
> 
> 
> -adrian