lr=u_trap+0x10 and srr0=k_trap+0x28 for "stopped at 0 illegal instruction 0" before-copyright hang on PowerMac G5's

Mark Millard markmi at dsl-only.net
Thu Sep 25 10:46:25 UTC 2014


One source code oddity that I notice is the following mixed use of ofw_real_mode: always tested vs. never tested (#if 0 ... #endif) ...

> /*
>  * Saved SPRG0-3 from OpenFirmware. Will be restored prior to the callback.
>  */
> register_t      ofw_sprg0_save;
> 
> static __inline void
> ofw_sprg_prepare(void)
> {
>         if (ofw_real_mode)
>                 return;
> 
>         /*
>          * Assume that interrupt are disabled at this point, or
>          * SPRG1-3 could be trashed
>          */
>         __asm __volatile("mfsprg0 %0\n\t"
>                          "mtsprg0 %1\n\t"
>                          "mtsprg1 %2\n\t"
>                          "mtsprg2 %3\n\t"
>                          "mtsprg3 %4\n\t"
>                          : "=&r"(ofw_sprg0_save)
>                          : "r"(ofmsr[1]),
>                          "r"(ofmsr[2]),
>                          "r"(ofmsr[3]),
>                          "r"(ofmsr[4]));
> }
>         
> static __inline void
> ofw_sprg_restore(void)
> {
> #if 0
>         if (ofw_real_mode)
>                 return;
> #endif
> 
>         /*
>          * Note that SPRG1-3 contents are irrelevant. They are scratch
>          * registers used in the early portion of trap handling when
>          * interrupts are disabled.
>          *
>          * PCPU data cannot be used until this routine is called !
>          */
>         __asm __volatile("mtsprg0 %0" :: "r"(ofw_sprg0_save));
> }

It would seem that for ofw_real_mode != 0 that ofw_sprg_prepare would never set up ofw_sprg0_save (via mfsprg0) for the later ofw_sprg_restore's always-executed mtsprg0 that is based on ofw_sprg0_save.

register_t seems to trace back to __int64_t --and that would leave ofw_sprg0_save initialized to zero as a global and that would have to be okay as the SPRG0 value to restore in such a case. (I have not tracked down what any of the per-processor values for SPRG0 are/should-be.)






===
Mark Millard
markmi at dsl-only.net

On Sep 25, 2014, at 2:12 AM, Mark Millard <markmi at dsl-only.net> wrote:

The register dump that has no kernel stack addresses in any registers does have register contents suggesting a ofwcall use, matching up reasonably with the code I looked at that is related to ofwcall. ofwcall is only reached via openfirmware_core from what I can tell. (If there are other paths into openfirmware than via ofwcall then the register dump suggests that they are not in use around the crash.)

And openfirmware_core has logic for exception vector swapping, going both directions:

> static int
> openfirmware_core(void *args)
> {
>         int             result;
>         register_t      oldmsr;
>  
>         /*
>          * Turn off exceptions - we really don't want to end up
>          * anywhere unexpected with PCPU set to something strange
>          * or the stack pointer wrong.
>          */
>         oldmsr = intr_disable();
>  
>         ofw_sprg_prepare();
>  
>         /* Save trap vectors */
>         ofw_save_trap_vec(save_trap_of);
>  
>         /* Restore initially saved trap vectors */
>         ofw_restore_trap_vec(save_trap_init);
>  
> #if defined(AIM) && !defined(__powerpc64__)
>         /*
>          * Clear battable[] translations
>          */
>         if (!(cpu_features & PPC_FEATURE_64))
>                 __asm __volatile("mtdbatu 2, %0\n"
>                                  "mtdbatu 3, %0" : : "r" (0));
>         isync();
> #endif
> 
>         result = ofwcall(args);
> 
>         /* Restore trap vecotrs */
>         ofw_restore_trap_vec(save_trap_of);
> 
>         ofw_sprg_restore();
> 
>         intr_restore(oldmsr);
> 
>         return (result);
> }

In turn openfirmware_core is used only by ofw_rendezvous_dispatch and in turn that is used only by openfirmware: only PCPU_GET(cpuid) == 0 does the above. save_trap_init is initialized by powerpc_init using ofw_save_trap_vec.

[Note that ofw_restore_trap_vec uses __syncicache which does not use dcbf after the bcopy but instead uses dcbst: That is part of what lead my investigation into the distinction --and so to my more overall dcbst vs. dcbf use questions after proving dcbf would not be sufficient for a fix to the specific boot issue.]

Unless the initialization of save_trap_init ends up with the wrong contents for openfirmware it would appear that the exception vectors are kept tracking by the above code. But the above does assume that the openfirmware vectors are unchanged after save_trap_init is initialized: there is no attempt at tracking of any potential updates to the openfirmware exception vectors.

I would infer then that after ofw_restore_trap_vec(save_trap_of) is executed is when the exception that DDB reports happened: That is when FreeBSD's exception vectors are again in place. But a stack pointer into the kernel stack is not then in place in any register (based on DDB's register dump): stack handling is messed up already by the point of the reported exception. And that may actually be why an illegal instruction at address zero was reached: an incorrect stack context used to get an address to execute at.




===
Mark Millard
markmi at dsl-only.net

On Sep 24, 2014, at 8:36 AM, Nathan Whitehorn <nwhitehorn at freebsd.org> wrote:

There shouldn't be any exceptions at that point, nested or otherwise. What I suspect is happening is that Open Firmware has turned them on for some bizarre reason, taken one, and ended up in the kernel's handlers but with the Open Firmware environment. Saving and restoring the OF interrupt vectors would be a possible solution; flattening the device tree in loader so that the kernel doesn't call Open Firmware at all would be another. I think Justin may have tried the first at some point.
-Nathan

On 09/24/14 02:04, Mark Millard wrote:
> Now that I've had a kernel/boot crash with a successful DDB bt and show registers (a different submittal) it makes for a good comparison/contrast with what DDB reports for this "before copyright" crash.
> 
> Something unique to the "before copyright" context is...
> 
> No registers are reported to have values that point into the range between tmpstk and esym.
> 
> In other words: There is no valid stack pointer reported as far as I can tell. r1 has the value 0 instead of being a handling a valid stack address. tmpstk=0xbd7000 and esym=0xbdb000 (example for one of my WITH_DEBUG_FILES= and options DDB and GDB builds of 10.1-BETA2). That at least gives a ball park on the range to expect for pointing into the stack even with some build variation.
> 
> It leaves me wondering if the DDB report is for a nested exception handling. That could explain why lr points to u_trap+0x10 and srr0 points to k_trap+0x28 when normally srr0 would point to the the failing instruction (or the instruction after) and lr to where that routine would normally return to.
> 
> The register values that are reported for my 10.1-BETA2 builds that crash before the copyright notice are:
> 
> r0: 0
> r1: 0
> r2: 0xc81538 vop_unlock_desc
> r3: 0xd18868
> r4: 0x894b58
> r5: 0
> r6: 0xc1dee0 M_AUDITBSM
> r7: 0xe3f818 ofw_real_mode
> r8: 0x1
> r9: 0xe0f580 __pcpu
> r10: 0x1c35ec0
> r11: 0
> r12: 0x10000000
> r13: 0xdbb290 thread0 (Note: another submittal has this mistyped as 0xdbb290.)
> r14-r19: all 0
> r20: 0x10c1000
> r21: 0x4
> r22: 0x180abd4
> r23: 0x1803a28
> r24: 0xc000000000008760
> r25: 0xcc89b8 smp_no...
> r26: 0xcea108 ofw_rend...
> r27: 0x894b58 ofwcall+0xa8
> r28: 0x894b58 ofwcall+0xa8
> r29: 2400022
> r30: 9000000000001032
> r31: 0xbb7d38
> 
> srr0: 0x102720 k_trap+0x28
> srr1: 9000000000001032
> lr: 0x1026f0 u_trap+0x10
> ctr: 0xff846d78
> cr: 2000deb0
> xer: 0
> dar: f...d50 (lots of f's)
> dsisr: 42000000
> 
> 
> 
> 
> 
> 
> ===
> Mark Millard
> markmi at dsl-only.net
> 
> On Sep 20, 2014, at 3:42 PM, Mark Millard <markmi at dsl-only.net> wrote:
> 
> [I corrected the SSR0 in the subject to be SRR0.]
> 
> I did miss a register in my list (it matched the shown r30 value). And it turns out to probably be very important to interpreting what the "show registers" is reporting:
> 
> SRR1: 0x9000000000001032
> 
> But bits 43-46 of SRR1 are supposed to indicate which type of Program Exception, using a single binary 1 to so. No such 1's are present.
> 
> Illegal instruction would have been bit 44 being 1. (PowerPC has the upper bit numbered zero and increases from there.)
> 
> So the ddb "show registers" is apparently not reporting the status as of when the "stopped at 0 illegal instruction 0" happened. Thus other things are also likely not from that exact time frame.
> 
> 
> 
> And I misinterpreted the LR value status: The LR value was just left over from the restore_kernsrs returning when it finished. Execution then flowed into k_trap. Nothing unusual involved.
> 
> 
> 
> 
> 
> ===
> Mark Millard
> markmi at dsl-only.net
> 
> On Sep 18, 2014, at 8:57 PM, Mark Millard <markmi at dsl-only.net> wrote:
> 
> I modified DDB to automatically "show registers" even at the early "before Copyright" crash time. The end of this note will show the /usr/src/sys/ddb/db_script.c diff for the hack. While I also had DDB bt, the bt does not actually print a back trace for this context. (It might for others.)
> 
> The registers give interesting context despite the lack of a back trace. I do not know if it will be sufficient to be of much immediate help if someone used the information to start looking at the problem.
> 
> I'll start with register lr: 0x1026f0 u_trap+0x10.
> 
> /usr/src/sys/powerpc/aim/trap_subr64.S has:
> 
> s_trap:
>         bf      17,k_trap               /* branch if PSL_PR is false */
>         GET_CPUINFO(%r1)
> u_trap:
>         ld      %r1,PC_CURPCB(%r1)
>         mr      %r27,%r28               /* Save LR, r29 */
>         mtsprg2 %r29
>         bl      restore_kernsrs         /* enable kernel mapping */
>         mfsprg2 %r29
>         mr      %r28,%r27
> 
> /*
>  * Now the common trap catching code.
>  */
> k_trap:
>         FRAME_SETUP(PC_TEMPSAVE)
> /* Call C interrupt dispatcher: */
> trapagain:
> 
> and so this appears to indicate a pending return to execute the "mfsprg2 %r29" after "bl restore_kernsrs", which indicates that restore_kernsrs should be active.
> 
> But register srr0 indicates: 0x102720 k_trap+0x28. (So apparently in FRAME_SETUP(PC_TEMPSAVE) someplace.)
> 
> So it appears to me that the processor got to the k_trap code during the supposed restore_kernsrs time frame. (But I'm no expert at these sorts of things or for the processor.)
> 
> I'll list the other register values:
> 
> r0:  0
> r1:  0
> r2:  0xc1be80 M_AUDITBSM
> r3:  0xb16138
> r4:  0x8926e8 .ofwcall+0xa8
> r5:  0
> r6:  0xbb5f90
> r7:  0xe3d118 ofw_real_mode
> r8:  0x1
> r9:  0xe0ce80 __pcpu
> r10: 0x1c35ec9
> r11: 0
> r12: 0x10000000
> r13: db890    thread0
> r14-r19: all 0
> r20: 0x10bc000
> r21: 0x4
> r22: 0x1801db4
> r23: 0x1803a28
> r24: 0xc000000000008760
> r25: 0xcc6908 smp_no_rendevous_barrier
> r26: 0xec79e0 ofw_rendezvous_dispatch (yep one has v and the other zv)
> r27: 0x8926e8 .ofwcall+0xa8
> r28: 0x8926e8 .ofwcall+0xa8 (yep: same value)
> r29: 0x24000022
> r30: 0x9000000000001032
> r31: 0xc7f488 vop_unlock_desc
> 
> ctr: 0xff846d78
> cr:  0x2000d7b0
> xer: 0
> dar: 0xfffffffffffffd50
> dsisr: 0x42000000
> 
> (Hopefully this manual transcription from the screen display is complete --and also accurate for what it does present.)
> 
> 
> 
> 
> The personal HACK to /usr/src/sys/ddb/db_script.c's db_script_kdbenter(...) to have it show registers and try bt...
> 
> $ cd /usr/src/sys/ddb/
> $ svnlite diff .
> Index: db_script.c
> ===================================================================
> --- db_script.c (revision 271610)
> +++ db_script.c (working copy)
> @@ -319,10 +319,25 @@
>  {
>   char scriptname[DB_MAXSCRIPTNAME];
>  
> + /* HACK!!! : Additional lines to force a basic default script to exist.
> +  * Will dump information even if ddb input is not available for early crash.
> +  * Used to get more information about PowerMac G5 "before Copyright" hangs.
> +  */
> + struct ddb_script *dsp = db_script_lookup(DB_SCRIPT_KDBENTER_DEFAULT);
> + if (!dsp) db_script_set(DB_SCRIPT_KDBENTER_DEFAULT, "show registers; bt");
> +
>   snprintf(scriptname, sizeof(scriptname), "%s.%s",
>       DB_SCRIPT_KDBENTER_PREFIX, eventname);
>   if (db_script_exec(scriptname, 0) == ENOENT)
>   (void)db_script_exec(DB_SCRIPT_KDBENTER_DEFAULT, 0);
> +
> + /* HACK!!! : Additional lines to always use the default script,
> +  *           even if scriptname existed and was executed.
> +  * Will dump information even if ddb input is not available for early crash.
> +  * Used to get more information about PowerMac G5 "before Copyright" hangs.
> +  */
> + else
> + (void)db_script_exec(DB_SCRIPT_KDBENTER_DEFAULT, 0);
>  }
>  
>  /*-
> 
> 
> 
> ===
> Mark Millard
> markmi at dsl-only.net
> 
> On Sep 16, 2014, at 9:28 PM, Mark Millard <markmi at dsl-only.net> wrote:
> 
> In part I sent directly to you because of a past exchange (July-27) where you had written:
> 
>> Nathan and I both speculate that it's
>> dropping into Open Firmware (we make extensive use of OFW), and then
>> messing something up, taking a page fault or something.
> 
> The specific text that I report and its uniformity when it is produced seems to add a little information beyond a speculated "page fault or something" and so might eventually help a little. As I understand the text it is reporting execution reaching address zero without any prior un-handled exceptions or other such that would stop it. A corrupted stack (pointer) so a bad return address or some such? I'd guess there are no explicit jumps to address zero so I expect that indirection is likely involved, with the content for the indirection messed up.
> 
> I really wish that I had a logic analyzer configuration for this. I've not found a way to make the failing context visible so far and the extra way of looking at things might have helped.
> 
> 
> 
> 
> ===
> Mark Millard
> markmi at dsl-only.net
> 
> On Sep 16, 2014, at 8:28 PM, Justin Hibbits <chmeeedalf at gmail.com> wrote:
> 
> Hi mark,
> 
> I see this on my G5, and I think it's due to the amount of RAM in the machine. More than 4gb seems to confuse open firmware when called by FreeBSD. There is some effort to remove the need of the callbacks but thus far it's not far along. The good news is that after it boots it's solid except when switching vtys, buy earlier this year or last year I added a sysctl hack to disable the call into open firmware on vty switch (don't recall offhand and not at my computer right now, but if you grep the sysctl output for reset and ofw you can find it).
> 
> -Justin
> 
> On Sep 16, 2014 8:01 PM, "Mark Millard" <markmi at dsl-only.net> wrote:
> I've now spent time with rebooting and power-off/power-on for all 3 PowerMac G5's (one PowerMac7,2 and two PowerMac11,2's) and all 3 get the
> 
>> GDB: no debug ports present
>> KDB: debugger backends: DDB
>> KDB: current backend: DDB
>> [ thread pid -1 tid 1006665719 ]
>> Stopped at 0: illegal instruction 0
>> db>
> 
> when they fail just before the Copyright notice would normally be displayed. None fail any earlier. At that spot none have failed any other way. It is the same SSD in all 3. (Happens with other SSD's as well.) Overall there is a mix of Radeon and NVIDIA display boards. Besides the SSD use and RAM upgrades the rest is stock equipment. scons used, not vt. (I've yet to try vt.)
> 
> Seeing a failure after the Copyright notice as been fairly rare in all my experiments from when I started last April or so. The ones that I've noted had Data Storage Interrupt reported. So far no examples of the above have been reported after the Copyright notice. So I'd guess that they are separate issues. Of course it seems that only in the last few days would I have seen the above sort of thing if it did happen after the Copyright notice: The prior history does not count for judgements about that.
> 
> ===
> Mark Millard
> markmi at dsl-only.net
> 
> On Sep 16, 2014, at 8:15 AM, Mark Millard <markmi at dsl-only.net> wrote:
> 
> Using 10.1-BETA1 I added "options DDB" and "options GDB" to powerpc64's GENERIC64. (I also used WITH_DEBUG_FILES=, WITHOUT_CLANG=, and WITH_DEBUG= in /etc/make.conf.) So buildworld, kernel was basically just set up to have more of a debugging context around (including for any ports builds).
> 
> The result was new information about the PowerMac G5 boot hangups: The screen is no longer blank when the G5 is hung up without there being a Copyright notice yet. It says...
> 
>> GDB: no debug ports present
>> KDB: debugger backends: DDB
>> KDB: current backend: DDB
>> [ thread pid -1 tid 1006665719 ]
>> Stopped at 0: illegal instruction 0
>> db>
> 
> (I had no ability to input at that point.) Normally the Copyright notice would have displayed instead of "[...]" and what follows. (I do not claim to have all the spacing, capitalization, and such correct above.)
> 
> That text is constant from hang to hang when it hangs just before it would normally output the Copyright notice: The numbers do not vary, much less the other text. It has never failed until after the two KDB messages are present. So far I've only tested one PowerMac G5, booting over and over for a few hours.
> 
> 
> 
> (I do not claim to be set up for remote kernel debugging. I just decided to let GDB go along for the ride when I added DDB.)
> 
> ===
> Mark Millard
> markmi at dsl-only.net
> 
> 
> 
> 
> 
> 





More information about the freebsd-ppc mailing list