KSE/ia64 broken

Fri Nov 21 08:26:17 PST 2003

On Fri, Nov 21, 2003 at 08:33:31PM +0800, David Xu wrote:
> >
> >Ok. More pieces of the puzzle. If I apply the attached patch (against
> >clean sources), I get the following:
> >
> >itanium% ./foo.bad
> >XXX:_thr_alloc: thread=200000000008a000, tcb=2000000000085000
> >XXX:_thr_alloc: thread=2000000000090000, tcb=2000000000090000
> >
> >The second _thr_alloc() is screwed up, in that malloc() returns
> >the same pointer twice. Hence thread->tcb points to thread itself
> >and we're clobbering our thread structure. 
> >
> I saw the same result.
> 
> >Since thr_spinlock.c
> >affects the locking of malloc(), we may have a race condition.
> >Note that forcing an upcall (by adding a _thread_printf() in the
> >code stream) seems to fix it. Does the UTS call malloc when first
> >invoked?
> >
> No, we never call malloc in such case.  I suspect we do not
> fully restore thread's context. In kernel, I pass zero as third
> parameter to get_mcontext(), is it enough for ia64 ?

Yes. The context is asynchronous. We save and restore all scratch
registers, including the high FP registers. Note that an incorrect
context restoration would very likely not have such a clean failure
mode.

The thing that bugs me is that if you add a _thread_printf() just
prior to the call to _thr_alloc(), you trigger an upcall. That
seems to make all the difference. It's like having to avoid that
the UTS gets its first upcall with a spinlock held. What also
bugs me is that the second malloc happily returns the same address
as the malloc immediately prior to it. There's no indication of
corruption. It's like the first malloc never happened or that the
memory got freed in between. If you look at it from a more context
oriented point of view; it's like the second malloc is returning
the results of the first malloc as if the context of the first
(assuming it got saved) is restored by the second. This could mean
that if the context switching is normal, that we missed saving a
context and we're restoring a stale context.

Anyway: upcalls play a key role.

BTW: Maybe an interesting experiment is to disable upcalls on
page faults on i386 and see if that makes a difference. We do
not have upcalls for page faults on ia64. There may be an upcall
on i386 that we do not get on ia64...

-- 
 Marcel Moolenaar	  USPA: A-39004		 marcel at xcllnt.net