'make -j16 universe' gives SIReset

Mon Oct 24 20:52:19 UTC 2011

On Sat, Oct 22, 2011 at 09:17:05AM +1100, Peter Jeremy wrote:
> On 2011-Oct-18 19:27:18 +0200, Marius Strobl <marius at alchemy.franken.de> wrote:
> >On Tue, Oct 18, 2011 at 03:26:46PM +1100, Peter Jeremy wrote:
> >> On 2011-Oct-13 20:42:25 +0200, Marius Strobl <marius at alchemy.franken.de> wrote:
> >> >On Thu, Oct 13, 2011 at 02:56:48PM +1100, Peter Jeremy wrote:
> >> >> Unfortunately, I can't get a crashdump because dumpon(8) doesn't like
> >> >> my Solaris swap partitions:
> >> >> GEOM_PART: Partition 'da0b' not suitable for kernel dumps (wrong type?)
> >> >> GEOM_PART: Partition 'da6b' not suitable for kernel dumps (wrong type?)
> >> >> No suitable dump device was found.
> >> >> 
> >> >> I did write a patch for that but took it out during some earlier
> >> >> testing to get back to stock code.  It looks like I didn't PR it
> >> >> either so I will do that when I get some time.
> >> 
> >> I've resurrected that patch (and will send-pr it later).
> 
> Thanks for committing it.

Thanks for the patch!

> 
> >Hrm, AFAICT this would mean that the _mtx_obtain_lock(), which boils
> >down to a atomic_cmpset_acq_ptr(), in _mtx_trylock() didn't work as
> >expected, I currently can't think of a good reason why that could
> >happen though. The assembly generated for that code also looks just
> >fine. Have you run the workload which is triggering this before? It
> >would be interesting to know whether it also happens with SCHED_4BSD
> >with current sources, pre-r226054 and pre-r225889 if the machine
> >previously survived that load.
> 
> It was running 6 parallel -j16 buildworlds.  I switched to SCHED_4BSD
> and haven't been able to reproduce it - even with a pile of added
> "sysctl sysctl vm.vmtotal".  I haven't tried rolling back to an
> earlier kernel.

Well, actually this is good as it means that this problem isn't a
regression of r225889 or r226054, which worried me the most.
Could you please test whether the following patch makes a difference
with SCHED_ULE?
http://people.freebsd.org/~marius/sparc64_curthread_preemption.diff

> 
> >Have you enabled PREEMPTION by chance?
> 
> That was using GENERIC and only changing the scheduler.
> 
> >The other thing that worries me is that it could be a silicon bug,
> >especially since that machine also has that issue of issuing stale
> >vector interrupts along with a state in which it traps even on
> >locked TLB entries, which isn't mentioned in the public erratum ...
> 
> I've had a rummage around in the OpenSolaris sources and nothing
> jumps out at me.  (Actually, I can't find any special case code
> that looks like it addresses silicon bugs in Jaguar).

Well, as I've learnt the hard way that doesn't mean that much; the
OpenSolaris source doesn't resembler a complete OS in the first place
as vital parts are missing, workarounds for silicon bugs aren't
necessarily marked as such, due to the nature of some CPU bugs it's
more likely that they've worked around them in their compiler to
not emit the offending instruction (sequence), FreeBSD uses some
parts of the hardware differently than (Open)Solaris so they might
have never hit it in the first place, etc...

> 
> One other thing is that I'm getting lots of isp watchdog timeouts:
> (da4:isp0:0:4:0): first watchdog (handle 0x5cf020f3) timed out- deferring for grace period
> (da4:isp0:0:4:0): first watchdog (handle 0x5cf1206d) timed out- deferring for grace period
> (da4:isp0:0:4:0): first watchdog (handle 0x5cf2203a) timed out- deferring for grace period
> isp0: isp_watchdog: timeout for handle 0x5cad2046
> (da4:isp0:0:4:0): FIN dl16384 resid 0 CDB=0x2a 0x00 0x0f 0xdd 0xe8 0xe0 0x00 0x00 0x20 0x00  STS 0x0 XS_ERR=0xb
> isp0: bad request handle 0x5cad2046 (iocb type 0x3)
> isp0: isp_watchdog: timeout for handle 0x5cdb20cb
> (da4:isp0:0:4:0): FIN dl16384 resid 0 CDB=0x2a 0x00 0x0f 0xe3 0xa8 0x00 0x00 0x00 0x20 0x00  STS 0x0 XS_ERR=0xb
> isp0: isp_watchdog: timeout for handle 0x5cdc2059
> (da4:isp0:0:4:0): FIN dl16384 resid 0 CDB=0x2a 0x00 0x0f 0xe3 0xa8 0x20 0x00 0x00 0x20 0x00  STS 0x0 XS_ERR=0xb
> isp0: isp_watchdog: timeout for handle 0x5cdd2020
> (da4:isp0:0:4:0): FIN dl16384 resid 0 CDB=0x2a 0x00 0x0f 0xe3 0xa8 0x40 0x00 0x00 0x20 0x00  STS 0x0 XS_ERR=0xb
> isp0: bad request handle 0x5cdb20cb (iocb type 0x3)
> isp0: bad request handle 0x5cdc2059 (iocb type 0x3)
> isp0: bad request handle 0x5cdd2020 (iocb type 0x3)
> (da4:isp0:0:4:0): first watchdog (handle 0x6b9520bb) timed out- deferring for grace period
> (da4:isp0:0:4:0): first watchdog (handle 0x6b96200e) timed out- deferring for grace period
> 
> Any ideas on that?
> 

Not really; I also see such timeouts on a recently acquired Qlogic HBA
based 280R when pushing it but not when using mpt(4), sym(4) etc so I
highly doubt that these are caused by a MD problem that causes lost
interrupts for example. Also I can't remember having seen such timeouts
back when I was testing with B1K/B2K (which basically use the same
mainboard as 280R), so this actually _could_ be a regression in isp(4).
However, at least in my case no real problem arose from isp(4) timeouts
so far.
You could try to ask mjacob@ what he thinks about these. Unfortunately,
he no longer seems to maintain isp(4) that actively.

Marius