RFT: numa policy branch

Tue Apr 28 23:32:30 UTC 2015

On 28 April 2015 at 14:39, Rui Paulo <rpaulo at me.com> wrote:
> On Apr 26, 2015, at 01:30 PM, Adrian Chadd <adrian at freebsd.org> wrote:
>
> Hi!
>
> Another update:
>
> * updated to recent -HEAD;
> * numactl now can set memory policy and cpuset domain information - so
> it's easy to say "this runs in memory domain X and cpu domain Y" in
> one pass with it;
>
>
> That works, but --mempolicy=first-touch should ignore the --memdomain
> argument (or print an error) if it's present.

Ok.

> * the locality matrix is now available. Here's an example from scott's
> 2x haswell v3, with cluster-on-die enabled:
>
> vm.phys_locality:
> 0: 10 21 31 31
> 1: 21 10 31 31
> 2: 31 31 10 21
> 3: 31 31 21 10
>
> And on the westmere-ex box, with no SLIT table:
>
> vm.phys_locality:
> 0: -1 -1 -1 -1
> 1: -1 -1 -1 -1
> 2: -1 -1 -1 -1
> 3: -1 -1 -1 -1
>
>
> This worked for us on IvyBridge a SLIT table.

Cool.

> * I've tested in on westmere-ex (4x socket), sandybridge, ivybridge,
> haswell v3 and haswell v3 cluster on die.
> * I've discovered that our implementation of libgomp (from gcc-4.2) is
> very old and doesn't include some of the thread control environment
> variables, grr.
> * .. and that the gcc libgomp code doesn't at all have freebsd thread
> affinity routines, so I added them to gcc-4.8.
>
>
> I used gcc 4.9
>
> I'd appreciate any reviews / testing people are able to provide. I'm
> about at the functionality point where I'd like to submit it for
> formal review and try to land it in -HEAD.
>
> There's a bug in the default sysctl policy.  You're calling strcat on an
> uninitialised string, so it produces garbage output.  We also hit the a
> panic when our application starts allocation many GBs of memory.  In this
> case, the memory is split between two sockets and I think it's crashing like
> you described on IRC.

I'll fix the former soon, thanks for pointing that out.

As for the crash - yeah, I reproducd it and sent a patch to alc for
review. It's because vm_page_alloc() doesn't expect calls to vm_phys
to fail a second time around.

Trouble is - the VM thresholds are all global. Failing an allocation
in one domain does cause pagedaemon to start up on that domain, but no
paging actually occurs. Unfortunately the pager still thinks there's
plenty of memory available, so it doesn't know it needs to run.
There's a pagedaemon per domain, but no thresholds per domain or
paging / paging targets per domain. I don't think we're going to be
able to fix that this pass - I'd rather get this or something like
this into the kernel so at least first-touch-rr, fixed-domain-rr and
rr work. Then yes, the VM will need some updating.

-adrian