'make -j16 universe' gives SIReset

Fri Oct 21 22:17:15 UTC 2011

On 2011-Oct-18 19:27:18 +0200, Marius Strobl <marius at alchemy.franken.de> wrote:
>On Tue, Oct 18, 2011 at 03:26:46PM +1100, Peter Jeremy wrote:
>> On 2011-Oct-13 20:42:25 +0200, Marius Strobl <marius at alchemy.franken.de> wrote:
>> >On Thu, Oct 13, 2011 at 02:56:48PM +1100, Peter Jeremy wrote:
>> >> Unfortunately, I can't get a crashdump because dumpon(8) doesn't like
>> >> my Solaris swap partitions:
>> >> GEOM_PART: Partition 'da0b' not suitable for kernel dumps (wrong type?)
>> >> GEOM_PART: Partition 'da6b' not suitable for kernel dumps (wrong type?)
>> >> No suitable dump device was found.
>> >> 
>> >> I did write a patch for that but took it out during some earlier
>> >> testing to get back to stock code.  It looks like I didn't PR it
>> >> either so I will do that when I get some time.
>> 
>> I've resurrected that patch (and will send-pr it later).

Thanks for committing it.

>Hrm, AFAICT this would mean that the _mtx_obtain_lock(), which boils
>down to a atomic_cmpset_acq_ptr(), in _mtx_trylock() didn't work as
>expected, I currently can't think of a good reason why that could
>happen though. The assembly generated for that code also looks just
>fine. Have you run the workload which is triggering this before? It
>would be interesting to know whether it also happens with SCHED_4BSD
>with current sources, pre-r226054 and pre-r225889 if the machine
>previously survived that load.

It was running 6 parallel -j16 buildworlds.  I switched to SCHED_4BSD
and haven't been able to reproduce it - even with a pile of added
"sysctl sysctl vm.vmtotal".  I haven't tried rolling back to an
earlier kernel.

>Have you enabled PREEMPTION by chance?

That was using GENERIC and only changing the scheduler.

>The other thing that worries me is that it could be a silicon bug,
>especially since that machine also has that issue of issuing stale
>vector interrupts along with a state in which it traps even on
>locked TLB entries, which isn't mentioned in the public erratum ...

I've had a rummage around in the OpenSolaris sources and nothing
jumps out at me.  (Actually, I can't find any special case code
that looks like it addresses silicon bugs in Jaguar).

One other thing is that I'm getting lots of isp watchdog timeouts:
(da4:isp0:0:4:0): first watchdog (handle 0x5cf020f3) timed out- deferring for grace period
(da4:isp0:0:4:0): first watchdog (handle 0x5cf1206d) timed out- deferring for grace period
(da4:isp0:0:4:0): first watchdog (handle 0x5cf2203a) timed out- deferring for grace period
isp0: isp_watchdog: timeout for handle 0x5cad2046
(da4:isp0:0:4:0): FIN dl16384 resid 0 CDB=0x2a 0x00 0x0f 0xdd 0xe8 0xe0 0x00 0x00 0x20 0x00  STS 0x0 XS_ERR=0xb
isp0: bad request handle 0x5cad2046 (iocb type 0x3)
isp0: isp_watchdog: timeout for handle 0x5cdb20cb
(da4:isp0:0:4:0): FIN dl16384 resid 0 CDB=0x2a 0x00 0x0f 0xe3 0xa8 0x00 0x00 0x00 0x20 0x00  STS 0x0 XS_ERR=0xb
isp0: isp_watchdog: timeout for handle 0x5cdc2059
(da4:isp0:0:4:0): FIN dl16384 resid 0 CDB=0x2a 0x00 0x0f 0xe3 0xa8 0x20 0x00 0x00 0x20 0x00  STS 0x0 XS_ERR=0xb
isp0: isp_watchdog: timeout for handle 0x5cdd2020
(da4:isp0:0:4:0): FIN dl16384 resid 0 CDB=0x2a 0x00 0x0f 0xe3 0xa8 0x40 0x00 0x00 0x20 0x00  STS 0x0 XS_ERR=0xb
isp0: bad request handle 0x5cdb20cb (iocb type 0x3)
isp0: bad request handle 0x5cdc2059 (iocb type 0x3)
isp0: bad request handle 0x5cdd2020 (iocb type 0x3)
(da4:isp0:0:4:0): first watchdog (handle 0x6b9520bb) timed out- deferring for grace period
(da4:isp0:0:4:0): first watchdog (handle 0x6b96200e) timed out- deferring for grace period

Any ideas on that?

-- 
Peter Jeremy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-sparc64/attachments/20111021/b8b18dff/attachment.pgp