[rfc] enumerating device / bus domain information

Fri Oct 10 15:58:09 UTC 2014

On Thursday, October 09, 2014 09:53:52 PM Warner Losh wrote:
> On Oct 8, 2014, at 5:12 PM, Adrian Chadd <adrian at FreeBSD.org> wrote:
> > On 8 October 2014 12:07, Warner Losh <imp at bsdimp.com> wrote:
> >> On Oct 7, 2014, at 7:37 PM, Adrian Chadd <adrian at FreeBSD.org> wrote:
> >>> Hi,
> >>> 
> >>> Right now we're not enumerating any NUMA domain information about
> >>> devices.
> >>> 
> >>> The more recent intel NUMA stuff has some extra affinity information
> >>> for devices that (eventually) will allow us to bind kernel/user
> >>> threads and/or memory allocation to devices to keep access local.
> >>> There's a penalty for DMAing in/out of remote memory, so we'll want to
> >>> figure out what counts as "Local" for memory allocation and perhaps
> >>> constrain the CPU set that worker threads for a device run on.
> >>> 
> >>> This patch adds a few things:
> >>> 
> >>> * it adds a bus_if.m method for fetching the VM domain ID of a given
> >>> device; or ENOENT if it's not in a VM domain;
> >> 
> >> Maybe a default VM domain. All devices are in VM domains :) By default
> >> today, we have only one VM domain, and that’s the model that most of the
> >> code expects…
> > 
> > Right, and that doesn't change until you compile in with num domains > 1.
> 
> The first part of the statement doesn’t change when the number of domains
> is more than one. All devices are in a VM domain.
> 
> > Then, CPUs and memory have VM domains, but devices may or may not have
> > a VM domain. There's no "default" VM domain defined if num domains >
> > 1.
> 
> Please explain how a device cannot have a VM domain? For the
> terminology I'm familiar with, to even get cycles to the device, you have to
> have a memory address (or an I/O port). That memory address has to
> necessarily map to some domain, even if that domain is equally sucky to get
> to from all CPUs (as is the case with I/O ports). while there may not be a
> “default” domain, by virtue of its physical location it has to have one.
> 
> > The devices themselves don't know about VM domains right now, so
> > there's nothing constraining things like IRQ routing, CPU set, memory
> > allocation, etc. The isilon team is working on extending the cpuset
> > and allocators to "know" about numa and I'm sure this stuff will fall
> > out of whatever they're working on.
> 
> Why would the device need to know the domain? Why aren’t the IRQs,
> for example, steered to the appropriate CPU? Why doesn’t the bus handle
> allocating memory for it in the appropriate place? How does this “domain”
> tie into memory allocation and thread creation?

Because that's not what you always want (though it often is).  However,
another reason is that system administrators want to know what devices
are close to.  You can sort of figure it out from devinfo on a modern
x86 machine if you squint right, but isn't super obvious.  I have a followup
patch that adds a new per-device '%domain' sysctl node so that it is
easier to see which domain a device is close to.  In real-world experience
this can be useful as it lets a sysadmin/developer know which CPUs to
schedule processes on.  (Note that it doesn't always mean you put them
close to the device.  Sometimes you have processes that are more important 
than others, so you tie those close to the NIC and shove the other ones over 
to the "wrong" domain because you don't care if they have higher latency.)

> > So when I go to add sysctl and other tree knowledge for device -> vm
> > domain mapping I'm going to make them return -1 for "no domain.”
> 
> Seems like there’s too many things lumped together here. First off, how
> can there be no domain. That just hurts my brain. It has to be in some
> domain, or it can’t be seen. Maybe this domain is one that sucks for
> everybody to access, maybe it is one that’s fast for some CPU or package of
> CPUs to access, but it has to have a domain.

They are not always tied to a single NUMA domain.  On some dual-socket 
Nehalem/Westmere class machines with per-CPU memory controllers (so 2 NUMA 
domains) you will have a single I/O hub that is directly connected to both 
CPUs.  Thus, all memory in the system is equi-distant for I/O (but not for CPU 
access).

The other problem is that you simply may not know.  Not all BIOSes correctly 
communicate this information for devices.  For example, certain 1U Romley 
servers I have worked with properly enumerate CPU <--> memory relationships in 
the SRAT table, but they fail to include the necessary _PXM method in the top-
level PCI bus devices (that correspond to the I/O hub).  In that case, 
returning a domain of 0 may very well be wrong.  (In fact, for these 
particular machines it mostly _is_ wrong as the expansion slots are all tied 
to NUMA domain 1, not 0.)

> > (Things will get pretty hilarious later on if we have devices that are
> > "local" to two or more VM domains ..)
> 
> Well, devices aren’t local to domains, per se. Devices can communicate with
> other components in a system at a given cost. One NUMA model is “near” vs
> “far” where a single near domain exists and all the “far” resources are
> quite costly. Other NUMA models may have a wider range of costs so that
> some resources are cheap, others are a little less cheap, while others are
> down right expensive depending on how far across the fabric of
> interconnects the messages need to travel. While one can model this as a
> full 1-1 partitioning, that doesn’t match all of the extant
> implementations, even today. It is easy, but an imperfect match to the
> underlying realities in many cases (though a very good match to x86, which
> is mostly what we care about).

Even x86 already has a notion of multiple layers of cost.  You can get that 
today if you buy a 4 socket Intel system.  It seems you might also get that if 
you get a dual socket Haswell system with more than 8 cores per package (due 
to the funky split-brain thing on higher core count Haswells).  I believe AMD 
also ships CPUs that contain 2 NUMA domains within a single physical package 
as well.

Note that the I/O thing is becoming far more urgent in the past few years on 
x86.  With Nehalem/Westmere having I/O being remote or local didn't seem to 
matter very much (you could only measure very small differences in latency or 
throughput between the two scenarios in my experience).  On Romley (Sandy 
Bridge) and later it can be a very substantial difference in terms of both 
latency and throughput.

-- 
John Baldwin