'make -j16 universe' gives SIReset
Marius Strobl
marius at alchemy.franken.de
Fri May 27 12:07:06 UTC 2011
On Fri, May 27, 2011 at 09:47:28AM +1000, Peter Jeremy wrote:
> I tried a "make -j16 universe" using a recent 8-stable on a 16-CPU
> V890 and after about 11 minutes, I got the following. This box
> had been running Solaris without problem for several years so I'm
> inclined to suspect a software issue.
It probably doesn't hurt to check the hardware with SunVTS though.
> Any suggestions?
>
> ERROR: CPU4 SIReset
>
>
> System State (CPU4 reporting)
>
> BBC Devices: 0000.0000.0000.000f 0000.0000.0000.000f
> BBC Arb: 0000.0000.0000.000f 0000.0000.0000.000f
> BBC Quiesce: 0000.0000.0000.0003 0000.0000.0000.0003
> BBC WDogAct: 0000.0000.0000.0000 0000.0000.0000.0000
> BBC POR Gen: 0000.0000.0000.0000 0000.0000.0000.0000
> BBC XIR Gen: 0000.0000.0000.0000 0000.0000.0000.0000
> BBC POR Src: 0000.0000.0000.0000 0000.0000.0000.0000
> BBC XIR Src: 0000.0000.0000.000f 0000.0000.0000.000f
> BBC EBus TC: 014f.99fd.a7e6.3f29 014f.99fd.a7e6.3f29
>
> CMP0 Core Config/Control registers:
>
> CoreAvail: 0000.0000.0000.0003 0 1
> CoreEnabled: 0000.0000.0000.0003 0 1
> CoreRunning: 0000.0000.0000.0003 0 1
> XIRSteering: 0000.0000.0000.0003 0 1
> ErrSteering: 0000.0000.0000.0000
>
> CPU0 Config/Control/Status registers:
>
> CPUVersion: 003e.0018.3100.0507
> SafConfig: 0caa.01bc.2000.8002 9:1 ID:0 HBM TOL:15
> SafBaseAdr: 0000.0400.0000.0000
> DispatchCtl: 0000.0000.0000.0009 MS SI
> DCacheCtl: 0000.0200.0000.0010 WE
> ECacheCtl: 0000.0000.01c5.5000 5:1 8MB mode=5-5-5(2) R/W-turn:2 Late-Sel ECC:off
> ErrorEnable: 0000.0000.0000.000b CEEN NCEEN UCEEN
>
> AFAR: 0000.0000.0000.0000
> AFSR: 0000.0000.0000.0000 (no errors set)
> AFAR 2: 0000.0000.8000.0000
> AFSR 2: 0000.0000.0000.0000 (no errors set)
>
> DMMU SFAR: 0000.0000.f3f8.c300
> DMMU SFSR: 0000.0000.0000.0000 (no status set)
> IMMU SFSR: 0000.0000.0080.8000 TM
>
This doesn't indicate much, especially not the address of the instruction
causing the SIR, except that there was an i-TLB miss, which seems innocuous.
Generally, FreeBSD only triggers a SIR when something really unexpected
happens in an environemt where we can't or at least can't easily trigger
a panic. The only exception to this which is not really fatal from the
OS point of view are stray vector interrupts (IIRC even OpenSolaris just
ignores a certain amount of these). You could try whether the following
patch makes any difference to the SIR you're seeing:
http://people.freebsd.org/~marius/sparc64_intr_vector_stray.diff
Generally, both USIV and V880 with USIII (which should be quite close to
a V890) are rather quirky hardware; I've already hit two CPU bugs which
are not documented in the publicly available errata. Two other things
to try is to replace the following in cheetah.c:
val &= ~DCR_DTPE;
once with:
val &= ~(DCR_DTPE | DCR_ITPE);
and once with:
val &= ~DCR_SI;
Besides that, IIRC I haven't added a workaround for the USVI+ erratum #4
so far, which seems unlikely to be the cause of this problem though.
Marius
More information about the freebsd-sparc64
mailing list