Code review: groundwork for SMP

Fri Jan 29 17:17:50 UTC 2010

In message: <5C3F269A-8E9A-4356-B1A1-3D503962F106 at lakerest.net>
            Randall Stewart <rrs at lakerest.net> writes:
: Neel
: 
: Comments in line
: 
: 
: On Jan 29, 2010, at 7:25 AM, Neel Natu wrote:
: 
: > Thanks Juli, Randall and JC for the comments.
: >
: > I think it is fair to ask that we don't burn another TLB entry to
: 
: (its actually 4 if we turn on all 4 threads that are in each
:  core for RMI.. each thread has a complete register set etc so
:  it is a virtual cpu ;-o )

Yes.  That might be a problem.  You are modeling a partial CPU as a
full CPU...  That's one reason turning on hyperthreading for many
application work loads on older Intel CPUs produced worse results than
with it off...

: > store the pcpu data. So maybe it might help if I went through what
: > options I considered before settling on this one:
: >
: > - One of the first things that I did investigate was using per-cpu
: > scratch registers but the Sibyte did not have any and they are not
: > part of the MIPS architecture.
: 
: Which is a shame that they are not part of the arch... but are
: just showing up in recent cores... but oh well...

Yea.  We have to take our medicine :)

: > - The second thing I considered was using a platform-specific
: > getcpuid() to index into the struct pcpu pcpu[MAXCPU] array to compute
: > the KSEG0 address of pcpu at runtime. However this turned out to be a
: > bit messy because there are consumers of getcpuid() in exception
: > context where we are restricted to using only k0 and k1 (and sometimes
: > only one of them). Also, like Juli pointed out getcpuid() is slow on
: > some cpus and I did not want to make the assumption that one could
: > write getcpuid() using a single k0/k1 register.
: >
: Yep.. sounds reasonable... if getcpuid is slow this is an issue ;-0
: 
: > So, having the pcpu pointer in a TLB entry divorces us from any
: > assumptions about the CPU we are running on.
: 
: 
: True..
: >
: > I think that there is a legitimate concern about this on the XLR - but
: > given that you are sharing the TLB among 4 threads I think there is
: > the bigger issue of the wired kstack entries that you need to solve
: > before even thinking about pcpu mapping.
: >
: 
: Actually the 4 threads are not sharing, each thread gets its own
: pcpu.. which means I need 4 TLB entries.. sigh..

Maybe we need to have a better model for the threads then?  But that's
beyond the scope of this exercise :)

: > I did not consider the approach suggested by Juli where the pcpu and
: > kstack pointers can be stashed in a single wired TLB entry. I need
: > some time to chew on it and prototype it.
: 
: Yeah, I have not grokked this approach either.. I have to
: think about if it would work.. hmm but would you not have to
: still index which cpu you are on.. or am I missing something... maybe
: I need some more coffee ;-)

The idea here is that you need a kstack entry ANYWAY to do anything
useful with the thread in the kernel.  Just expand it a little to
include pcpu.

: Juli's idea about using the badaddr pointer might actually have
: merit.. I know it sounds weird but one could harmonize this with
: the scratch register idea...

I think that this idea needs a lot of research.  It is unclear to me
the extent to which values that are written to this register persist.
Plus, moving to and from CP0 takes a bit of doing for the CPU, and
historically most of the errata for CPUs are in the CP0 handling.

: Basically have a marco that "updates" the pcpu pointer on
: any kernel entry. For systems that have a scratch register that
: is setup at cpu boot this is a nop. For systems that don't have
: scratch register this does the calculation to get the pcpu[cpuid] and
: throws it in the baddadr variable.. of course this would have to
: be after pulling out the baddaddr ;-)
: 
: Then we just use a macro to access the pcpu (I think this already
: exists).. on systems with a scratch register it refers to the
: proper register.. on systems without its the badaddr register.
: 
: Either way its a single register access and no TLB entries burned...

Well, that's not entirely true.  The instant you touch the PCPU data,
you'll have to go through either a KSEG0 mapping, or a TLB entry.

: And I do think we REALLY need to look into what it takes to
: do superpages. I know Kirk gave a presentation on them at
: EuroBSD and there in the kernel.. and I know we need to
: do something to support them. If we did that it would free
: up loads of TLB's since every time I am in the debugger and
: look at the TLB list I see ton's of kernel addresses....

Alan Cox had a student interested in implementing them for Mips before
Nova/BSD was cancelled.  It seemed to be good for a master's or PhD
thesis, from what I recall at the time...

: We could easily map one wired TLB entry that puts the entire
: kernel in some large block with one TLB.. And even on RMI
: this would be "good" since they (all 4 threads) could share that one
: TLB entry for the entire kernel ;-)

Well, the entire kernel CODE is already in one giant TLB entry that
doesn't burn a TLB entry: KSEG0 :).  The code, data and BSS entries in
that area are all reference through kseg0.  a simple pcpu[] array
would live in kseg0.  But with the new dynamic pcpu stuff, we'd have
to be careful since that is malloced and doesn't love in kseg0 without
special magic.

Warner

: Just some early morning thoughts ;-0
: 
: R
: 
: >
: > I would still like to commit this so as to keep making progress on the
: > SMP support. This is a small piece of the bigger goal of getting SMP
: > functional and can be replaced in the future if need be.
: >
: > best
: > Neel
: >
: > On Thu, Jan 28, 2010 at 10:42 PM, Juli Mallett <jmallett at freebsd.org>
: > wrote:
: >> On Thu, Jan 28, 2010 at 21:28, Randall Stewart <rrs at lakerest.net>
: >> wrote:
: >>>> [ Using a single wired TLB entry for kstack and pcpu ]
: >>>
: >>> Which means you have a big array that you are offsetting.
: >>
: >> Not really — you can have a structure at 0xc000000000000000u (or the
: >> same >> 32) with two pointers in it, even, one to pcpu and one to
: >> KSTACK_PAGES direct-mapped, contiguous pages.  Then you can load the
: >> kstack address or the pcpu base very quickly.  Of course, you can even
: >> have a single wired entry consisting of the pcpu data and then put a
: >> pointer to the top of the kstack in it.  I don't think you can get by
: >> with no wired TLB entries, but you also don't have to index a big
: >> array.  The nice thing about setting up a per-CPU TLB entry (you have
: >> to set up at least one, the kstack, in order to be able to handle
: >> exceptions) is that then you need only access offsets into it that are
: >> known at compile time and constant no matter what CPU you're running
: >> on.  Load the kstack by doing "ld sp, 0(0xc...)" and load the pcpu
: >> address by doing "ld t0, 8(0xc....)".  Two wired entries lets you get
: >> rid of the indirection, but you can get by with one and still not have
: >> to do (1) run-time computation of the index into some array (2)
: >> possibly very expensive getting of the cpuid.
: >>
: >>> I was even thinking get a LARGE entry.. one that is say 8 Meg
: >>> for the kernel.. covering all text/data etc... with this
: >>> new super page stuff. of course I have never looked into how
: >>> its implemented..
: >>
: >> That would be easy to do, but what would be the benefits of accessing
: >> that data through a wired TLB entry instead of the direct map?
: >>
: >>> Yes, you pay an index reference for every access .. or at
: >>> least one to setup a pointer.. but I think that it much cheaper
: >>> than a TLB miss is... (words for imp to think about)...
: >>
: >> Yes, TLB misses are very slow.  Your desire to avoid adding another
: >> wired entry seems pretty reasonable.  I think that using a single
: >> wired TLB entry for a mux or for both the kstack and pcpu is easy and
: >> usable.  I feel like just wiring the kstack and putting a
: >> direct-mapped, sometimes-recomputed pointer to the pcpu into gp is the
: >> best combination in the long run — even just loading an immediate
: >> 64-bit address is pretty slow wrt how often things in the PCPU are
: >> accessed in SMP kernels.
: >>
: >> Juli.
: >>
: >
: 
: ------------------------------
: Randall Stewart
: 803-317-4952 (cell)
: 803-345-0391(direct)
: 
: _______________________________________________
: freebsd-mips at freebsd.org mailing list
: http://lists.freebsd.org/mailman/listinfo/freebsd-mips
: To unsubscribe, send any mail to
: "freebsd-mips-unsubscribe at freebsd.org"
: 
: