Cubieboard: Spurious interrupt detected

Sat Sep 6 14:54:34 UTC 2014

On Fri, 2014-09-05 at 23:45 -0700, Adrian Chadd wrote:
> On 5 September 2014 21:54, John-Mark Gurney <jmg at funkthat.com> wrote:
> > Ian Lepore wrote this message on Fri, Sep 05, 2014 at 19:33 -0600:
> >> On Fri, 2014-09-05 at 18:15 -0700, John-Mark Gurney wrote:
> >> > Adrian Chadd wrote this message on Fri, Sep 05, 2014 at 17:44 -0700:
> >> > > On 5 September 2014 16:11, Ian Lepore <ian at freebsd.org> wrote:
> >> > > > On Fri, 2014-09-05 at 23:57 +0200, Bernd Walter wrote:
> >> > > >> On Sat, Sep 06, 2014 at 01:43:23AM +0400, Maxim V FIlimonov wrote:
> >> > > >> > And another problem: every now and then the kernel says something like that:
> >> > > >> > Sep  5 19:22:37  kernel: Spurious interrupt detected
> >> > > >> > Sep  5 19:22:37  kernel: Spurious interrupt detected
> >> > > >> > Sep  5 19:23:46  last message repeated 10 times
> >> > > >> >
> >> > > >> > I've heard that FreeBSD happens to do that on ARM devices. What could be the
> >> > > >> > problem here?
> >> > > >>
> >> > > >> Means something generates inetrrupts, which are not handled by a driver.
> >> > > >> Could be the cause for your load problem too.
> >> > > >>
> >> > > >
> >> > > > No, that would be stray interrupts.  Spurious interrupts happen when an
> >> > > > interrupt is asserted, but by time the processor asks the interrupt
> >> > > > controller for the current active interrupt, it is no longer active.
> >> > > >
> >> > > > One way it can happen is when an interrupt handler writes to a device to
> >> > > > clear a pending interrupt and that write takes a long time to complete
> >> > > > because the device is on a slow bus, and the interrupt controller is on
> >> > > > a faster bus.  The EOI to the controller outraces the device write that
> >> > > > would clear the pending interrupt condition, so the processor is
> >> > > > re-interrupted, but by time it asks for the next active interrupt the
> >> > > > device write has finally completed and the interrupt is no longer
> >> > > > pending.
> >> > > >
> >> > > > That sequence used to happen a lot, and it was "fixed" by adding an
> >> > > > l2cache sync (basically a "drain write buffer") just before an EOI.  You
> >> > > > sometimes still see an occasional spurious interrupt, but it shouldn't
> >> > > > be happening multiple times per second as seen in the logging above.
> >> > >
> >> > > Hm, interesting. I remember your discussion about it on IRC. The
> >> > > atheros code ends up working around this in the driver by doing a read
> >> > > from the ISR after writing out bits to clear things, so the clear is
> >> > > flushed out.
> >> > >
> >> > > I wonder if we should be asking all device drivers to be doing their
> >> > > own ISR flushing before returning from their interrupt handlers.
> >> >
> >> > This is required on PCI (that you do a read to clear the posted/pending
> >> > write)...    So, IMO, yes, all device drivers should do the proper
> >> > clearing of their writes to the ISR...
> >> >
> >>
> >> But a driver can't assume that a read is sufficient on all architectures
> >> it may run on.  bus_space_barrier() is the right way.  Also, it's not
> >
> > Except that I don't think even on PCI a bus_space_barrier is sufficient...
> 
> It isn't.
> 
> The device itself may have FIFOs and internal busses that also need to
> be flushed.
> 

The question isn't whether or not it's sufficient, because it's
necessary.  The device driver knows what its hardware requirements are
and should meet them.  It doesn't not know what its parent bus
requirements are, and that's why it must call bus_space_barrier() to
handle architecture needs above the level of the device itself.

> > I was just looking at i386's implementation of bus_space_barrier and
> > it just does a stack access...  This won't be sufficient to clear any
> > PCI bridges that may have the write still pending...
> 
> Right. The memory barrier semantics right now don't at all guarantee
> that bus and device FIFOs have actually been flushed.
> 

The fact that some architectures don't implement bus_space_barrier() in
a way that's useful for that architecture is just a bug.  It doesn't
change the fact that bus_space_barrier() is currently our only defined
MI interface to barriers in device space.

> So I don't think doing it using the existing bus space barrier
> semantics is 'right'. For interrupts, it's highly likely that we do
> actually need device drivers to read from their interrupt register to
> ensure the update has been posted before returning. That's better than
> causing entire L2 cache flushes.
> 

Where did you see code that does an "entire L2 cache flush"?  You
didn't, you just saw a function name and made assumptions about what it
does.  The fact is, it does what is necessary for the architecture.  It
also happens to do what a write-then-read does on armv6, but that's
exactly the sort of assumption that should NOT be written into MI
code.  

> Question is - can we expose this somehow as a generic device method,
> so the higher bus layers can actually do something with it, or should
> we just leave it to device drivers to correctly do?
> 

In what way is that not EXACTLY whas bus_space_barrier() is defined to
do?

I've got to say, I don't understand all this pushback.  We have one
function defined already that's intended to be the machine-independant
way to ensure that any previous access to hardware using bus_space reads
and writes has completed, so why all this argument that it is not the
function to use?

-- Ian

> (Also - do any of the freebsd device driver books or the handbook mention this?)
> 
> 
> 
> -a