Re: ARM64 system error

From: John F Carr <jfc_at_mit.edu>
Date: Wed, 03 Aug 2022 16:50:49 UTC

> On Aug 3, 2022, at 12:28 , Andrew Turner <andrew@fubar.geek.nz> wrote:
> 
> 
>> On 31 Jul 2022, at 17:55, John F Carr <jfc@mit.edu> wrote:
>> 
>> My OverDrive 1000 (Cortex A57) running CURRENT just crashed with the unhelpful message "panic: Unhandled System Error".  Is there any way to get better information?  The ESR value bf000000 translates to "system error with implementation-defined code 0" so that's not much use.  The instruction associated with the interrupt can't fault ("subs w22, w22, #0x1") so it must be an asynchronous error.  On other systems I've seen bits you can test or registers you can read to get details.
> 
> By my reading of the Cortex-A57 documentation [1] I think the ESR value shows the exception can be attributed to the current core, is containable to a given code sequence, and is a decode error.
> 
> It’s likely due to msk_phy_readreg accessing the phy, but it doesn’t respond quickly enough.
> 
> Does an older kernel boot? If so can you try bisecting to find which commit caused the panic.

Thanks, I missed that bit of documentation.

The same kernel worked after reboot with the same networking configuration.  The theory of a slow response from an I/O device sounds good.

Is there an easy way to trigger a system error to test error handling code?  For example, I once debugged a machine check handler (IBM lingo) by using a control/debug register that could intentionally write bad ECC to RAM.