RFT: numa policy branch

Adrian Chadd adrian at freebsd.org
Wed Apr 22 02:42:54 UTC 2015


Hi!

I have a branch off of -HEAD that implements the bare minimum for
default, per-thread, per-process NUMA allocation policies and
associated syscalls / tool to manipulate it.

You can all thank Norse for providing me with kit to test this on
(including a Dell R910, which is a quad-socket 40-core, 80-thread
westmere-EX box with ~1TB of RAM) and time to do the work, and Dell
for loaning me way too much hardware to make this happen.

It's not ready for formal review for commit (hence why this is a
"RFT") but it works well enough in my local test setup that I think
it's worth sharing.

What it does:

* adds VM domain policy and iterator types;
* the system default policy is "first-touch-round-robin", which is
"first-touch, and if fail, round-robin to other domains";
* there's per-proc and per-thread policy entries in struct proc /
struct thread - enough to play with, but certainly not in its final
form;
* two syscalls - numa_setaffinity() and numa_getaffinity();
* a very basic numactl program, complete with adrian-standard "MAN=".

This doesn't teach ULE or the proc/thread stuff anything about NUMA
/scheduling/. That's a whole different ballgame. It also has nothing
to do with kernel memory allocation - no ULE, no contigmalloc, no
driver affinity, etc. This is purely for controlling the initial page
allocation for processes - which for a lot of NUMA workloads is all it
needs.

How to use:

* look at the NUMA config file. You have to add in memory domain
support or you won't get the domains setup;
* sysctl vm.default_domain controls the default policy. "rr",
"first-touch-rr" and "first-touch" are supported here.
* numactl (--tid=tid or --pid=pid) --policy=policy, --domain=domain,
(--get or --set) (optional command) - like cpuset

So, some examples:

numactl --pid=1 --get

Get the current policy for the given PID:

# ./numactl --pid=1 --get
  Policy: none; domain: -1

Run a job with a fixed-domain allocation from domain 1, but pinned to
CPU 0 (which on my system is in domain 0, so it's 100% remote memory
access):

$ cpuset -l 0 ./numactl --policy=fixed-domain --domain=1 ~/himenobmtxpa xl 0

Run a job with round-robin:

$ cpuset -l 0 ./numactl --policy=rr ~/himenobmtxpa xl 0

I'm using the 'pcm-numa.x' tool from the intel-pcm package to ensure
that memory accesses are correctly local/remote/round-robin as
appropriate.

I'd appreciate feedback and any improvements (yes, including a
manpage) that people have.

Thanks!



-adrian


More information about the freebsd-arch mailing list