em interrupt storm
Scott Long
scottl at samsco.org
Thu Nov 24 04:52:16 GMT 2005
Michael Vince wrote:
> Scott Long wrote:
>
>> Michael Vince wrote:
>>
>>> Kris Kennaway wrote:
>>>
>>>> On Tue, Nov 22, 2005 at 08:54:49PM -0800, John Polstra wrote:
>>>>
>>>>
>>>>> On 23-Nov-2005 Kris Kennaway wrote:
>>>>>
>>>>>
>>>>>> I am seeing the em driver undergoing an interrupt storm whenever the
>>>>>> amr driver receives interrupts. In this case I was running newfs on
>>>>>> the amr array and em0 was not in use:
>>>>>>
>>>>>> 28 root 1 -68 -187 0K 8K CPU1 1 0:32 53.98%
>>>>>> irq16: em0
>>>>>> 36 root 1 -64 -183 0K 8K RUN 1 0:37 27.75%
>>>>>> irq24: amr0
>>>>>>
>>>>>> # vmstat -i
>>>>>> interrupt total rate
>>>>>> irq1: atkbd0 2 0
>>>>>> irq4: sio0 199 1
>>>>>> irq6: fdc0 32 0
>>>>>> irq13: npx0 1 0
>>>>>> irq14: ata0 47 0
>>>>>> irq15: ata1 931 5
>>>>>> irq16: em0 6321801 37187
>>>>>> irq24: amr0 28023 164
>>>>>> cpu0: timer 337533 1985
>>>>>> cpu1: timer 337285 1984
>>>>>> Total 7025854 41328
>>>>>>
>>>>>> When newfs finished (i.e. amr was idle), em0 stopped storming.
>>>>>>
>>>>>> MPTable: <INTEL SE7520BD22 >
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> This is the dreaded interrupt aliasing problem that several of us have
>>>>> experienced with this chipset. High-numbered interrupts alias down to
>>>>> interrupts in the range 16..19 (or maybe 16..23), a multiple of 8 less
>>>>> than the original interupt.
>>>>>
>>>>> Nobody knows what causes it, and nobody knows how to fix it.
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>> This would be good to document somewhere so that people don't either
>>>> accidentally buy this hardware, or know what to expect when they run
>>>> it.
>>>>
>>>> Kris
>>>>
>>>>
>>> This is Intels latest server chipset designs and Dell are putting
>>> that chipset in all their servers.
>>> Luckily I haven't not seen the problem on any of my Dell servers (as
>>> long as I am looking at this right).
>>>
>>> This server has been running for a long time.
>>> vmstat -i
>>> interrupt total rate
>>> irq1: atkbd0 6 0
>>> irq4: sio0 23433 0
>>> irq6: fdc0 10 0
>>> irq8: rtc 2631238611 128
>>> irq13: npx0 1 0
>>> irq14: ata0 99 0
>>> irq16: uhci0 1507608958 73
>>> irq18: uhci2 42005524 2
>>> irq19: uhci1 3 0
>>> irq23: atapci0 151 0
>>> irq46: amr0 41344088 2
>>> irq64: em0 1513106157 73
>>> irq0: clk 2055605782 99
>>> Total 7790932823 379
>>>
>>> This one just transfered over 8gigs of data in 77seconds with around
>>> 1000 simultaneous tcp connections under a load of 35. Both seem OK.
>>> vmstat -i
>>> interrupt total rate
>>> irq4: sio0 315 0
>>> irq13: npx0 1 0
>>> irq14: ata0 47 0
>>> irq16: uhci0 2894669 2
>>> irq18: uhci2 977413 0
>>> irq23: ehci0 3 0
>>> irq46: amr0 883138 0
>>> irq64: em0 2890414 2
>>> cpu0: timer 2763566717 1999
>>> cpu3: timer 2763797300 1999
>>> cpu1: timer 2763551479 1999
>>> cpu2: timer 2763797870 1999
>>> Total 11062359366 8004
>>>
>>> Mike
>>>
>>>
>>
>> Looks like at least some of your interrupts are being aliased to
>> irq16, which just happens to be USB(uhci) in this case. Note that the
>> rate is
>> the same between irq64 and irq16, and the totals are pretty close. If
>> you don't need USB, I'd suggest turning it off.
>>
>> Scott
>
>
> Most of my Dell servers occasionally use the USB ports to serial out via
> tip using a usb2serial cable with the uplcom driver and then into
> another servers real serial port (sio) so its not really an option to
> disable USB.
>
> How much do you think it affects performance if the USB device is
> actually rarely used.
>
> I also have a 6-stable machine and noticed that the vmstat -i output
> lists the em and usb together, but em0 isn't used at all, em2 and em3
> are the active ones, it doesn't seem reasonable that my usb serial usage
> would be that high for irq16 or could it be that em2 and em3 and also
> going through irq16
>
> vmstat -i
> interrupt total rate
> irq4: sio0 228 0
> irq14: ata0 47 0
> irq16: em0 uhci0 917039 11
> irq18: uhci2 54823 0
> irq23: ehci0 3 0
> irq46: amr0 45998 0
> irq64: em2 898628 11
> lapic0: timer 159140889 1999
> Total 161057655 2024
>
> Mike
>
You're seeing the same aliasing. Here's exactly what happens:
irq64 generates an interrupt and the APIC routes it to the CPU.
FreeBSD services the interrupt, masks irq64 in the APIC, and schedules
the interrupt thread for em2. When that ithread runs, the driver
will take action to turn off the interrupt at the card, and then
irq64 will be unmasked when the ithread completes. PCI interrupts are
level driven, so they stay active until the driver tells the hardware to
turn them off. However, as soon as the interrupt get masked, the APIC
re-routes the signal to irq16. Since irq16 is an interrupt that freebsd
cares about, it's enabled, so FreeBSD gets interrupted again. It then
masks irq16 and schedules the em0/uhci0 ithread.
Servicing irq16 doesn't take a whole lot of time, but running the
ithread for it does. The uhci driver has to grab the Giant lock, so
things like the bufdaemon get locked out while it runs. The em0 driver
also has to grab a lock, but according to you that interface isnt' doing
any work so it's likely to not be a big deal. But even so, these lock
operations consume thousands of CPU cycles, plus the drivers have to
read status either on the card or in memory, and that will also take
thousands of cycles, maybe more (especially for UHCI). Add that up,
and the CPU is spending a lot of time doing useless work. On a lightly
loaded system you won't notice a difference, but if you're trying to do
something like push lots of packets through, it quickly becomes
noticable.
Masking the interrupt in the APIC is vital so that the PCI interrupt can
be silenced while the ithread gets scheduled to run. We can't EOI the
APIC until it's silenced or else we'll get immediately re-interrupted,
and if we don't EOI then we can't get any other interrupts. This scheme
works fairly well for typical PC hardware, and these Intel chipsets seem
to be the exception, albeit a popular one. One solution that is being
worked on is to allow drivers to test and clear the hardware interrupt
registers directly in the low-level interrupt handler, before their
ithread runs. That'll eliminate the need to mask the APIC. This
probably won't work for all drivers, but it should work for drivers like
if_em. I've tried doing tricks like deferring the EOI and the masking,
but the OS quickly looses its mind because it startsmissing clock
interrupts.
Scott
More information about the freebsd-current
mailing list