seems I finally found what upset kqemu on amd64 SMP... shared gdt! (please test patch :)

Thu May 1 18:52:24 UTC 2008

On Thursday 01 May 2008 11:53:04 am Juergen Lock wrote:
> On Thu, May 01, 2008 at 10:11:06AM -0400, John Baldwin wrote:
> > On Thursday 01 May 2008 06:19:51 am Juergen Lock wrote:
> > > On Wed, Apr 30, 2008 at 12:24:58AM +0200, Juergen Lock wrote:
> > > > Yeah, the amd64 kernel reuses the same gdt to setup all cpus, causing
> > > > kqemu to end up restoring the interrupt stackpointer (after running
> > > > guest code using its own cpu state) from the tss of the last cpu,
> > > > regardless which cpu it happened to run on.  And that then causes the
> > > > last cpu's (usually) idle thread's stack to get smashed and the host
> > > > doing multiple panics...  (Which also explains why pinning qemu onto 
cpu
> > > > 1 worked on a 2-way host.)
> > >
> > > Hmm maybe the following is a little more clear:  kqemu sets up its own
> > > cpu state and has to save and restore the original state because of 
that,
> > > so among other things it does an str insn (store task register), and 
later
> > > an ltr insn (load task register) using the value it got from the first
> > > str insn.  That ltr insn loads the selector for the tss which is stored
> > > in the gdt, and that entry in the gdt is different for each cpu, but 
since
> > > a single gdt was reused to setup the cpus at boot (in init_secondary() 
in
> > > /sys/amd64/amd64/mp_machdep.c), it still points to the tss for the last
> > > cpu, instead of to the right one for the cpu the ltr insn gets executed 
on.
> > > That is what the kqemu_tss_workaround() in the patch `fixes'...
> > 
> > Perhaps kqemu shouldn't be doing str/ltr on amd64 instead?  The things 
i386 
> > uses a separate tss for in the kernel (separate stack for double faults) 
is 
> > handled differently on amd64 (on amd64 we make the double fault handler 
use 
> > one of the IST stacks).
> 
> Well, kqemu uses its own gdt, tss and everything while running guest code
> in its monitor, so it kinda has to do the str/ltr.s to setup its stuff, run
> guest code, and then restore the original state of things.  (And `restore
> original state of things' is what failed here.)
> 
>  Oh and also the tss does seem to be used for the interrupt stack on
> amd64 too, at least thats the one that ended up wrong and caused the panics
> I saw...

The single TSS holds the IST pointers.  On i386 we use a separate TSS for 
double faults, but on amd64 a double fault uses the same TSS but uses the IST 
pointers from that same TSS.  The TSS also holds the ring stack pointer for 
when syscalls, interrupts, and traps from userland cross from ring 3 to ring 
0 which is probably why you got a panic.

Because of the fact that amd64 in normal operation never changes the task 
register (and that the gdt isn't used quite the same either, all the per-cpu 
stuff is via FSBASE and GSBASE) I don't expect the kernel to change to use a 
per-cpu gdt or the like.  I think you will need to use the current approach 
of patching kqemu to fixup the tss/gdt when reloading the task register.  You 
might want to make it a regular part of the code rather than a workaround as 
a result.

-- 
John Baldwin