'make -j16 universe' gives SIReset

Wed Jun 22 10:05:30 UTC 2011

On Mon, Jun 20, 2011 at 08:00:34AM +1000, Peter Jeremy wrote:
> On 2011-Jun-16 01:34:45 +0200, Marius Strobl <marius at alchemy.franken.de> wrote:
> >You could try whether the below patch sufficiently reduces the lock
> >coverage to avoid these. For stable/8 you'll probably need to apply
> >the second chunk by hand.
> 
> Well, it lasted through 30 hours of 'make -j32 universe' (on its 7th
> cycle) before panicing with a 'spin lock held too long' on sched lock.

Okay, given that it considerably improves the situation though I
suspect that the problem is that we instantly begin to fault on
kernel mappings once we flush all unlocked TLB entries in order
to get rid of the user mappings, which in case of cpu_switch()
still is covered by sched_lock. That would mean that we should use
a fine grained approach instead as the current one doesn't behave/
scale well even if sched_lock wasn't be (ab)used here. Could you
please give the following patch a try on top of what you already
have?
http://people.freebsd.org/~marius/sparc64_flush_user_no_sledgehammer.diff
Note that this version is incomplete in that it breaks compiling
the loader so you won't be able to build world with it.

> Along the way, I got 4 isp_watchdog timeouts (and subsequent 'bad
> request handle' reports.
> 
> Do you have any ideas why the panics aren't dropping into DDB?
> 

No, not really. However, the remaining contenders are cpu_switch()
and the scheduler itself and I'm not sure whether one can easily
panic when in there. It would be interesting to know if you get
the "timeout stopping cpus" in generic_stop_cpus(), compiling the
whole kernel with DIAGNOSTIC is overkill though. Unfortunately,
theres no way to "hard" stop Sun sun4u CPUs or emulate some NMI
short of triggering a red state exception, which is rather hairy
to recover from.

Marius