Code review: groundwork for SMP
rrs at lakerest.net
Fri Jan 29 16:32:08 UTC 2010
Comments in line
On Jan 29, 2010, at 7:25 AM, Neel Natu wrote:
> Thanks Juli, Randall and JC for the comments.
> I think it is fair to ask that we don't burn another TLB entry to
(its actually 4 if we turn on all 4 threads that are in each
core for RMI.. each thread has a complete register set etc so
it is a virtual cpu ;-o )
> store the pcpu data. So maybe it might help if I went through what
> options I considered before settling on this one:
> - One of the first things that I did investigate was using per-cpu
> scratch registers but the Sibyte did not have any and they are not
> part of the MIPS architecture.
Which is a shame that they are not part of the arch... but are
just showing up in recent cores... but oh well...
> - The second thing I considered was using a platform-specific
> getcpuid() to index into the struct pcpu pcpu[MAXCPU] array to compute
> the KSEG0 address of pcpu at runtime. However this turned out to be a
> bit messy because there are consumers of getcpuid() in exception
> context where we are restricted to using only k0 and k1 (and sometimes
> only one of them). Also, like Juli pointed out getcpuid() is slow on
> some cpus and I did not want to make the assumption that one could
> write getcpuid() using a single k0/k1 register.
Yep.. sounds reasonable... if getcpuid is slow this is an issue ;-0
> So, having the pcpu pointer in a TLB entry divorces us from any
> assumptions about the CPU we are running on.
> I think that there is a legitimate concern about this on the XLR - but
> given that you are sharing the TLB among 4 threads I think there is
> the bigger issue of the wired kstack entries that you need to solve
> before even thinking about pcpu mapping.
Actually the 4 threads are not sharing, each thread gets its own
pcpu.. which means I need 4 TLB entries.. sigh..
> I did not consider the approach suggested by Juli where the pcpu and
> kstack pointers can be stashed in a single wired TLB entry. I need
> some time to chew on it and prototype it.
Yeah, I have not grokked this approach either.. I have to
think about if it would work.. hmm but would you not have to
still index which cpu you are on.. or am I missing something... maybe
I need some more coffee ;-)
Juli's idea about using the badaddr pointer might actually have
merit.. I know it sounds weird but one could harmonize this with
the scratch register idea...
Basically have a marco that "updates" the pcpu pointer on
any kernel entry. For systems that have a scratch register that
is setup at cpu boot this is a nop. For systems that don't have
scratch register this does the calculation to get the pcpu[cpuid] and
throws it in the baddadr variable.. of course this would have to
be after pulling out the baddaddr ;-)
Then we just use a macro to access the pcpu (I think this already
exists).. on systems with a scratch register it refers to the
proper register.. on systems without its the badaddr register.
Either way its a single register access and no TLB entries burned...
And I do think we REALLY need to look into what it takes to
do superpages. I know Kirk gave a presentation on them at
EuroBSD and there in the kernel.. and I know we need to
do something to support them. If we did that it would free
up loads of TLB's since every time I am in the debugger and
look at the TLB list I see ton's of kernel addresses....
We could easily map one wired TLB entry that puts the entire
kernel in some large block with one TLB.. And even on RMI
this would be "good" since they (all 4 threads) could share that one
TLB entry for the entire kernel ;-)
Just some early morning thoughts ;-0
> I would still like to commit this so as to keep making progress on the
> SMP support. This is a small piece of the bigger goal of getting SMP
> functional and can be replaced in the future if need be.
> On Thu, Jan 28, 2010 at 10:42 PM, Juli Mallett
> <jmallett at freebsd.org> wrote:
>> On Thu, Jan 28, 2010 at 21:28, Randall Stewart <rrs at lakerest.net>
>>>> [ Using a single wired TLB entry for kstack and pcpu ]
>>> Which means you have a big array that you are offsetting.
>> Not really — you can have a structure at 0xc000000000000000u (or the
>> same >> 32) with two pointers in it, even, one to pcpu and one to
>> KSTACK_PAGES direct-mapped, contiguous pages. Then you can load the
>> kstack address or the pcpu base very quickly. Of course, you can
>> have a single wired entry consisting of the pcpu data and then put a
>> pointer to the top of the kstack in it. I don't think you can get by
>> with no wired TLB entries, but you also don't have to index a big
>> array. The nice thing about setting up a per-CPU TLB entry (you have
>> to set up at least one, the kstack, in order to be able to handle
>> exceptions) is that then you need only access offsets into it that
>> known at compile time and constant no matter what CPU you're running
>> on. Load the kstack by doing "ld sp, 0(0xc...)" and load the pcpu
>> address by doing "ld t0, 8(0xc....)". Two wired entries lets you get
>> rid of the indirection, but you can get by with one and still not
>> to do (1) run-time computation of the index into some array (2)
>> possibly very expensive getting of the cpuid.
>>> I was even thinking get a LARGE entry.. one that is say 8 Meg
>>> for the kernel.. covering all text/data etc... with this
>>> new super page stuff. of course I have never looked into how
>>> its implemented..
>> That would be easy to do, but what would be the benefits of accessing
>> that data through a wired TLB entry instead of the direct map?
>>> Yes, you pay an index reference for every access .. or at
>>> least one to setup a pointer.. but I think that it much cheaper
>>> than a TLB miss is... (words for imp to think about)...
>> Yes, TLB misses are very slow. Your desire to avoid adding another
>> wired entry seems pretty reasonable. I think that using a single
>> wired TLB entry for a mux or for both the kstack and pcpu is easy and
>> usable. I feel like just wiring the kstack and putting a
>> direct-mapped, sometimes-recomputed pointer to the pcpu into gp is
>> best combination in the long run — even just loading an immediate
>> 64-bit address is pretty slow wrt how often things in the PCPU are
>> accessed in SMP kernels.
More information about the freebsd-mips