Strange crash on wandboard

Wed Aug 7 03:57:22 UTC 2013

Okay, this is so strange I've just got to share it...  I've been having
trouble with wandboard (solo) bringup and have tracked the problem down
to returning from the first interrupt that happens.  (It's a clock
interrupt, but I don't think that's really germane.)  

It's as if PULLFRAMEFROMSVCANDEXIT wasn't restoring the registers
correctly.  At first the corruption hit the PC, which is damn hard to
debug.  But after figuring out just where it was happening in the code
(spinlock_exit()) and inserting some extra debugging printfs, things
changed a bit and now a different register is getting blasted.

Here's what I get at runtime:

        clock intr exit
        returned: intr_event_handle

        vm_fault(0xc0cca000, e46ab000, 1, 0) -> 1
        Fatal kernel mode data abort: 'Translation Fault (S)'
        trapframe: 0xdd3ffe24
        FSR=00000005, FAR=e46abdc0, spsr=600de613
        r0 =600001d3, r1 =60000113, r2 =000000c0, r3 =e46abdc0
        r4 =c271f620, r5 =c271cbf0, r6 =00000000, r7 =dd3ffea8
        r8 =c08d08f4, r9 =00000000, r10=00000000, r11=dd3ffe80
        r12=dd3ffe70, ssp=dd3ffe70, slr=c0af2bb4, pc =c0af2be8

        [ thread pid 12 tid 100006 ]
        Stopped at      spinlock_exit+0x5c:     ldr     r1, [r3]
        db>  

Here's the asm code around the fault point:

        c0af2bd4:   e10f0000    mrs r0, CPSR
        c0af2bd8:   e1c01002    bic r1, r0, r2
        c0af2bdc:   e0211003    eor r1, r1, r3
        c0af2be0:   e121f001    msr CPSR_c, r1
        c0af2be4:   e59f3024    ldr r3, [pc, #36]   ; c0af2c10
        c0af2be8:   e5931000    ldr r1, [r3]
        c0af2bec:   e3510000    cmp r1, #0  ; 0x0
        ....
        c0af2c10:   c0bd6ae4    adcgts  r6, sp, r4, ror #21
        c0af2c14:   c0b4e0e8    adcgts  lr, r4, r8, ror #1

Okay, so the msr instruction re-enables interrupts, and the next one
loads r3 with constant value 0xc0bd6ae4, then an interrupt happens
(other instrumentation in PULLFRAMEFROMSVCANDEXIT on previous runs shows
that this is the case every time, 100% reproducible, but that
instrumentation destroys registers it shouldn't, so it's not present in
the run shown above).  So the interrupt happens then control returns to
the instruction at c0af2be8, which faults.

Now here's the strange part.  Look at the fault-time r3 contents.  It's
the byte-reverse of the value it should have.  It's been restored
wrong-endian.  Just one register from the whole set restored with a
single "ldmia sp, {r0-r14}^" instruction.

I don't know what to make of it.  It seems like a hardware error of some
sort.

-- Ian