157k interrupts per second causing 60% CPU load on idle system

Sat Apr 7 05:08:46 UTC 2012

On 7 April 2012 14:31, Matt Thyer <matt.thyer at gmail.com> wrote:

> On 5 April 2012 01:18, Freddie Cash <fjwcash at gmail.com> wrote:
>
>> On Wed, Apr 4, 2012 at 5:19 AM, Matt Thyer <matt.thyer at gmail.com> wrote:
>> > So it seems that both the old and new mps driver have a problem with the
>> > Western Digital WD20EARX SATA 3 drive on a SuperMicro AOC-USAS2-L8i (SAS
>> > 6G) controller (flashed with -IT firmware).
>>
>> I wouldn't say the driver has a problem with that specific drive.
>> More that it might have a problem with a mixed SATA2/SATA3 setup.
>>
>> Sorry, that's what I meant to say but it now seems that the 157K
> interrupts per second is probably not due to the SuperMicro AOC-USAS2-L8i.
>
> Since moving the SATA 3 disk to the onboard Intel SATA 2 controller I'm no
> longer having that disk evicted from the raidz2 pool with write errors and
> I thought that the high interrupt rate issue had also been solved but it's
> back again.
>
> This is on 8-STABLE at revision 230921 (before the new driver hit
> 8-STABLE).
>
> So now I need to go back to trying to determine what the cause is.
>
> I'll stop posting in this thread as I don't think it's anything to do with
> either the old or new version of this driver.
>

Oops... wrong thread I thought I was replying in -CURRENT.

So on to the root cause.

vmstat -i has shown that the issue was on irq 16.

Unfortunately there seems to be a lot of things on irq 16:

$  dmesg | grep "irq 16"
pcib1: <PCI-PCI bridge> irq 16 at device 1.0 on pci0
mps0: <LSI SAS2008> port 0xee00-0xeeff mem
0xfbdfc000-0xfbdfffff,0xfbd80000-0xfbdbffff irq 16 at device 0.0 on pci1
vgapci0: <VGA-compatible display> port 0xff00-0xff07 mem
0xfb400000-0xfb7fffff,0xe0000000-0xefffffff irq 16 at device 2.0 on pci0
uhci0: <UHCI (generic) USB controller> port 0xfe00-0xfe1f irq 16 at device
26.0 on pci0
pcib2: <ACPI PCI-PCI bridge> irq 16 at device 28.0 on pci0
pcib3: <ACPI PCI-PCI bridge> irq 16 at device 28.4 on pci0
atapci0: <JMicron JMB368 UDMA133 controller> port
0xdf00-0xdf07,0xde00-0xde03,0xdd00-0xdd07,0xdc00-0xdc03,0xdb00-0xdb0f irq
16 at device 0.0 on pci3
pcib1: <PCI-PCI bridge> irq 16 at device 1.0 on pci0
mps0: <LSI SAS2008> port 0xee00-0xeeff mem
0xfbdfc000-0xfbdfffff,0xfbd80000-0xfbdbffff irq 16 at device 0.0 on pci1
vgapci0: <VGA-compatible display> port 0xff00-0xff07 mem
0xfb400000-0xfb7fffff,0xe0000000-0xefffffff irq 16 at device 2.0 on pci0
uhci0: <UHCI (generic) USB controller> port 0xfe00-0xfe1f irq 16 at device
26.0 on pci0
pcib2: <ACPI PCI-PCI bridge> irq 16 at device 28.0 on pci0
pcib3: <ACPI PCI-PCI bridge> irq 16 at device 28.4 on pci0
atapci0: <JMicron JMB368 UDMA133 controller> port
0xdf00-0xdf07,0xde00-0xde03,0xdd00-0xdd07,0xdc00-0xdc03,0xdb00-0xdb0f irq
16 at device 0.0 on pci3

Any idea how to isolate which bit of hardware could be triggering the
interrupts ?

Unfortunately the only device I could remove would be the SuperMicro
AOC-USAS2-L8i (so yes I could eliminate that).

My biggest problem right now is not knowing how to trigger the issue.

At this stage I'm going to upgrade to 9-STABLE and see if it returns.