head -r347003 on 2-socket/2-cores-each G5 PowerMac11,2's: one type of boot-blocking context found

Tue May 7 19:04:20 UTC 2019

On 2019-May-7, at 11:06, Justin Hibbits <chmeeedalf at gmail.com> wrote:

> On Mon, 6 May 2019 22:43:36 -0700
> Mark Millard <marklmi at yahoo.com> wrote:
> 
>> Every example of boot failure during cpu_mp_unleash,
>> where I've had the tracking in place, has had 1 or more
>> examples of srr0<DMAP_BASE_ADDRESS (EXC_ISE) in
>> handle_kernel_slb_spill before cpu_mp_unleash tries to
>> start its first ap.
>> 
>> Every example of boot success, where I've had the tracking
>> in place, has had no examples of srr0<DMAP_BASE_ADDRESS
>> (EXC_ISE) in handle_kernel_slb_spill before the
>> cpu_mp_unleash finished. (Successful boots are rare
>> in my current test context, so there are fewer examples
>> of this.)
>> 
>> In other words: the original live-G5 information
>> for the segment was still present throughout that
>> time frame, thus avoiding a slbtrap for such a
>> fetch address over the time frame involved.
>> 
>> 
>> 
>> In the the code:
>> 
>>        rstvec = rstvec_virtbase + reset;
>> printf("powermac_smp_start_cpu: about to use *rstvec==4\n");
>>        *rstvec = 4;
>>        powerpc_sync();
>>        (void)(*rstvec);
>>        powerpc_sync();
>>        DELAY(1);
>> printf("powermac_smp_start_cpu: about to use *rstvec==0\n");
>>        *rstvec = 0;
>>        powerpc_sync();
>>        (void)(*rstvec);
>>        powerpc_sync();
>> printf("powermac_smp_start_cpu: done using *rstvec==0\n");
>> 
>> Every boot failure has had the last line reported by
>> FireWire dcons use as the first of those 3 printf's,
>> for CPU 2 as the target (of 0-3).
>> 
>> The above code appears to me to execute with MSR.IR=1
>> on the bsp.
>> 
>> But, then, what would *rstvec do if there is no ESID=0
>> V=1 combination active for the live-G5 information at
>> the time? Does that block the exception code that
>> is in what would be ESID=0's address range, effectively
>> preventing slbtrap from being invoked to enable ESID=0?
>> 
>> In other words: when MSR.IR=1, does there always
>> need to be a ESID=0 V=1 entry? Is it appropriate
>> to reserve one for ESID=0 V=1 (after invalidating
>> any arbitrarily placed ESID=0 V=1 entry present
>> before the kernel even started)?
> 
> Hi Mark,
> 
> Thanks for continuing to look into this.  In this case you're
> presenting, a ISE shouldn't really matter, because the SLB miss handler
> is written to run entirely from real mode to handle the miss.  Can you
> determine what the addresses were that faulted in the failure cases?
> We shouldn't be touching anything below DMAP_BASE at this time, since
> we're not yet in userspace, and all mappings should be either KVA or
> DMAP.

I'll try to to get examples of all of them for based on
my current code code.

But in a earlier message I reported several examples from
simply sticking a printf in handle_kernel_sb_spill and
later making it controllable to report at selective time
frames. (The printf's being there lead to earlier hang-ups.
I was surprised I got anything.)

Remember that the number of handle_kernel_sb_spill
calls for srr0<DMAP_START and dar<DMAP_START varies
from boot to boot so the places are not unique unique
overall.

Here is the core of those old reports for reference:

KDB: debugger backends: ddb
KDB: current backend: ddb
handle_kernel_slb_spill: type=0x380 dar=0x3d99348 srr0=0xa869bc
handle_kernel_slb_spill: type=0x380 dar=0x10000000 srr0=0xa869bc

Both seemed to involve the stbx instruction in:

0000000000a869bc <.memset+0x20> stbx    r4,r9,r3
0000000000a869c0 <.memset+0x24> addi    r9,r9,1
0000000000a869c4 <.memset+0x28> bdnz    0000000000a869bc <.memset+0x20>

The above was from the unconditional printf addition and, as I
remember, repeated for:

     #ifdef __powerpc64__
     i = 0;
     for (va = virtual_avail; va < virtual_end && i<(n_slbs-1)/2; va += SEGMENT_LENGTH, i++)
             moea64_bootstrap_slb_prefault(va, 0);
     #endif
enable_handle_kernel_slb_spill_reporting= 1;

(Note the (n_slbs-1)/2 that I was experimenting with at
the time.)

The below was from instead enabling later:

enable_handle_kernel_slb_spill_reporting= 1;
     dpcpu_init(dpcpu, curcpu);

got (eliminating an unrelated line that had a
truncated address showing):

KDB: debugger backends: ddb
KDB: current backend: ddb
handle_kernel_slb_spill: type=0x380 dar=0x22ef8 srr0=0xa86690
handle_kernel_slb_spill: type=0x480 dar=0x22ef8 srr0=0xa86690

Both seemed to involve the stdu instruction in:

0000000000a8668c <.memcpy+0x140> ldu     r0,-8(r9)
0000000000a86690 <.memcpy+0x144> stdu    r0,-8(r11)
0000000000a86694 <.memcpy+0x148> bdnz    0000000000a8668c <.memcpy+0x140>

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)