Code review: groundwork for SMP

Fri Jan 29 17:03:03 UTC 2010

Greetings one and all.  Thanks for weighing in on this issue.

In general, I agree with Neel here.  But I also think we need to see
if we can be flexible and push this down into a per-cpu-type
decision (which differs slightly from a per-platform type because we
can have a CPU appearing in multiple platforms, or multiple CPUs
appearing within one platform).  If we make it a per-cpu-type
solution, we could have a sys/mips/mips/pcpu_machdep.c which does the
normal SMP stuff, as well as having sys/mips/xlr/pcpu_machdep.c which
does something optimized for the XLR.  Chances are good that different
CPUs will want to have different trade-offs here.  We'd also need some
way to encode this in an include file, so there's some work to make
PCPU macro different for different CPUs...

We could load $gp with the per-cpu info.  This loading is orthogonal
to the value we load.  With Neel's method, we could load it with the
same value on the trip into the kernel every time.  We need to load it
to some value, saving the old value, when we trap into the kernel
because we can't trust the user to not do something like:
	or gp, zero, zero
or worse, causing security issues (panics, wrong data accessed, etc).
Even on the XLR if we have lots of trips into the kernel, we're likely
to put a lot of pressure on the TLBs if we're constantly installing
one that isn't wired.  So even the pcpu[getcpuid()] method fails here.

The advantage of Neel's method is that it works well for a large class
of SMP machines, and the code path is no different for non-SMP
machines (since the non-SMP machines can just use an address in KSEG0
and not need any TLB entry).  It also scales well with the number of
CPUs, since each additional CPU just needs 2 pages of RAM and we don't
have to be limited by MAXCPU.  Many systems will benefit from this as
well, since, for example, the Octeon supports muliple executives
partitioning the cores.  In that scenario, FreeBSD may be given CPUs 1
and 15, leaving a rather large gap in CPU numbers (since the MIPS64r2
method of getting the core returns a raw number).  This gap will mean
a table lookup, or larger tables.  I don't know if all the assumptions
about contiguous CPU numbers are yet out of the kernel.

I also tend to agree with Neel that pcpu[getcpuid()] likely is going
to be expensive to compute for the trap and interrupt contexts we have
to run in. We should avoid that as much as possible.

One other nice side effect of Neel's scheme is that you can have MP
and !MP kernel modules that use the same method to get pcpu data.  But
that's a minor point at this stage of the game.

The XLR will have scheduler challenges as well.  It will push the
design assumptions of ULE beyond the breaking point, I fear.
Hyperthreading already exists on intel, and ULE copes, a bit, with
it.  But with the high number of threads each CPU can have, we may
need something with a little more smarts.  Something that knows it
might be better to schedule two different processes on two different
cores, and leave some of the threads idle to reduce TLB pressure, for
example.

Per CPU scratch registers do not exist on MIPS, in general.  Some CPUs
have them, and many do not.  CP0 registers are plentiful in more
modern designs, and some of them may even be useful for our needs.
However, mfc0 and mtc0 often have pipeline hazards associated with
them which will trip up the unwary.  When reading the historical
errata for MIPS CPUs, we often find that this is where we need to do
the most workarounds.

I guess this is a long way to say "I think we should commit Neel's
patches.  We should work along two fronts: (1) implementing Juli's
idea of sharing kstack and pcpu data in one TLB and (2) making it so
that CPUs where this is sub-optimal can swap in their own
implementation."

Warner

In message: <dffe84831001290725g2ca2574ap22b82f2ad38af2d6 at mail.gmail.com>
            Neel Natu <neelnatu at gmail.com> writes:
: Thanks Juli, Randall and JC for the comments.
: 
: I think it is fair to ask that we don't burn another TLB entry to
: store the pcpu data. So maybe it might help if I went through what
: options I considered before settling on this one:
: 
: - One of the first things that I did investigate was using per-cpu
: scratch registers but the Sibyte did not have any and they are not
: part of the MIPS architecture.
: 
: - The second thing I considered was using a platform-specific
: getcpuid() to index into the struct pcpu pcpu[MAXCPU] array to compute
: the KSEG0 address of pcpu at runtime. However this turned out to be a
: bit messy because there are consumers of getcpuid() in exception
: context where we are restricted to using only k0 and k1 (and sometimes
: only one of them). Also, like Juli pointed out getcpuid() is slow on
: some cpus and I did not want to make the assumption that one could
: write getcpuid() using a single k0/k1 register.
: 
: So, having the pcpu pointer in a TLB entry divorces us from any
: assumptions about the CPU we are running on.
: 
: I think that there is a legitimate concern about this on the XLR - but
: given that you are sharing the TLB among 4 threads I think there is
: the bigger issue of the wired kstack entries that you need to solve
: before even thinking about pcpu mapping.
: 
: I did not consider the approach suggested by Juli where the pcpu and
: kstack pointers can be stashed in a single wired TLB entry. I need
: some time to chew on it and prototype it.
: 
: I would still like to commit this so as to keep making progress on the
: SMP support. This is a small piece of the bigger goal of getting SMP
: functional and can be replaced in the future if need be.
: 
: best
: Neel
: 
: On Thu, Jan 28, 2010 at 10:42 PM, Juli Mallett <jmallett at freebsd.org> wrote:
: > On Thu, Jan 28, 2010 at 21:28, Randall Stewart <rrs at lakerest.net> wrote:
: >>> [ Using a single wired TLB entry for kstack and pcpu ]
: >>
: >> Which means you have a big array that you are offsetting.
: >
: > Not really — you can have a structure at 0xc000000000000000u (or the
: > same >> 32) with two pointers in it, even, one to pcpu and one to
: > KSTACK_PAGES direct-mapped, contiguous pages.  Then you can load the
: > kstack address or the pcpu base very quickly.  Of course, you can even
: > have a single wired entry consisting of the pcpu data and then put a
: > pointer to the top of the kstack in it.  I don't think you can get by
: > with no wired TLB entries, but you also don't have to index a big
: > array.  The nice thing about setting up a per-CPU TLB entry (you have
: > to set up at least one, the kstack, in order to be able to handle
: > exceptions) is that then you need only access offsets into it that are
: > known at compile time and constant no matter what CPU you're running
: > on.  Load the kstack by doing "ld sp, 0(0xc...)" and load the pcpu
: > address by doing "ld t0, 8(0xc....)".  Two wired entries lets you get
: > rid of the indirection, but you can get by with one and still not have
: > to do (1) run-time computation of the index into some array (2)
: > possibly very expensive getting of the cpuid.
: >
: >> I was even thinking get a LARGE entry.. one that is say 8 Meg
: >> for the kernel.. covering all text/data etc... with this
: >> new super page stuff. of course I have never looked into how
: >> its implemented..
: >
: > That would be easy to do, but what would be the benefits of accessing
: > that data through a wired TLB entry instead of the direct map?
: >
: >> Yes, you pay an index reference for every access .. or at
: >> least one to setup a pointer.. but I think that it much cheaper
: >> than a TLB miss is... (words for imp to think about)...
: >
: > Yes, TLB misses are very slow.  Your desire to avoid adding another
: > wired entry seems pretty reasonable.  I think that using a single
: > wired TLB entry for a mux or for both the kstack and pcpu is easy and
: > usable.  I feel like just wiring the kstack and putting a
: > direct-mapped, sometimes-recomputed pointer to the pcpu into gp is the
: > best combination in the long run — even just loading an immediate
: > 64-bit address is pretty slow wrt how often things in the PCPU are
: > accessed in SMP kernels.
: >
: > Juli.
: >
: _______________________________________________
: freebsd-mips at freebsd.org mailing list
: http://lists.freebsd.org/mailman/listinfo/freebsd-mips
: To unsubscribe, send any mail to "freebsd-mips-unsubscribe at freebsd.org"
: 
: