seems I finally found what upset kqemu on amd64 SMP... shared gdt! (please test patch :)

Sat May 3 13:13:41 UTC 2008

On Thu, May 01, 2008 at 01:35:06PM -0400, John Baldwin wrote:
> On Thursday 01 May 2008 11:53:04 am Juergen Lock wrote:
> > On Thu, May 01, 2008 at 10:11:06AM -0400, John Baldwin wrote:
> > > On Thursday 01 May 2008 06:19:51 am Juergen Lock wrote:
> > > > On Wed, Apr 30, 2008 at 12:24:58AM +0200, Juergen Lock wrote:
> > > > > Yeah, the amd64 kernel reuses the same gdt to setup all cpus, causing
> > > > > kqemu to end up restoring the interrupt stackpointer (after running
> > > > > guest code using its own cpu state) from the tss of the last cpu,
> > > > > regardless which cpu it happened to run on.  And that then causes the
> > > > > last cpu's (usually) idle thread's stack to get smashed and the host
> > > > > doing multiple panics...  (Which also explains why pinning qemu onto 
> cpu
> > > > > 1 worked on a 2-way host.)
> > > >
> > > > Hmm maybe the following is a little more clear:  kqemu sets up its own
> > > > cpu state and has to save and restore the original state because of 
> that,
> > > > so among other things it does an str insn (store task register), and 
> later
> > > > an ltr insn (load task register) using the value it got from the first
> > > > str insn.  That ltr insn loads the selector for the tss which is stored
> > > > in the gdt, and that entry in the gdt is different for each cpu, but 
> since
> > > > a single gdt was reused to setup the cpus at boot (in init_secondary() 
> in
> > > > /sys/amd64/amd64/mp_machdep.c), it still points to the tss for the last
> > > > cpu, instead of to the right one for the cpu the ltr insn gets executed 
> on.
> > > > That is what the kqemu_tss_workaround() in the patch `fixes'...
> > > 
> > > Perhaps kqemu shouldn't be doing str/ltr on amd64 instead?  The things 
> i386 
> > > uses a separate tss for in the kernel (separate stack for double faults) 
> is 
> > > handled differently on amd64 (on amd64 we make the double fault handler 
> use 
> > > one of the IST stacks).
> > 
> > Well, kqemu uses its own gdt, tss and everything while running guest code
> > in its monitor, so it kinda has to do the str/ltr.s to setup its stuff, run
> > guest code, and then restore the original state of things.  (And `restore
> > original state of things' is what failed here.)
> > 
> >  Oh and also the tss does seem to be used for the interrupt stack on
> > amd64 too, at least thats the one that ended up wrong and caused the panics
> > I saw...
> 
> The single TSS holds the IST pointers.  On i386 we use a separate TSS for 
> double faults, but on amd64 a double fault uses the same TSS but uses the IST 
> pointers from that same TSS.  The TSS also holds the ring stack pointer for 
> when syscalls, interrupts, and traps from userland cross from ring 3 to ring 
> 0 which is probably why you got a panic.
> 
Yeah thats where it happened.

> Because of the fact that amd64 in normal operation never changes the task 
> register (and that the gdt isn't used quite the same either, all the per-cpu 
> stuff is via FSBASE and GSBASE) I don't expect the kernel to change to use a 
> per-cpu gdt or the like.  I think you will need to use the current approach 
> of patching kqemu to fixup the tss/gdt when reloading the task register.  You 
> might want to make it a regular part of the code rather than a workaround as 
> a result.
> 
 Hmm okay, how would you call it then, kqemu_tss_fixup?

	Juergen