can someone explain...[ PCI interrupts]
John Baldwin
jhb at FreeBSD.org
Tue Dec 6 21:40:36 PST 2005
On Dec 6, 2005, at 5:57 PM, Julian Elischer wrote:
> In short words for the likes of me,
> Can someone give a quicj roundup on PCI routing in 4.x and -current.
My, what a set of questions. :) I'll do my best, but this will
probably be a long and perhaps wandering e-mail.
First off, interrupts for PCI devices are roughly split up into two
categories (currently): INTx interrupt lines and MSI interrupts. MSI
is relatively new and I won't cover it much here. No versions of
FreeBSD currently support MSI either (though it's on my todo list),
so I'll limit this discussion to INTx interrupts. For INTx
interrupts, each PCI device (or slot) has 4 interrupt lines: INTA,
INTB, INTC, and INTD. Thus, you can describe any individual PCI
interrupt as a tuple of (bus, slot, pin). For example, device 4's
INTA pin on pci bus 0 would be (0, 4, INTA). Each PCI function is
allowed to have one INTx interrupt. The bus and slot come from the
location of that function in the PCI hierarchy, and the pin comes
from the intpin PCI config register. PCI doesn't define beyond the
INTx pin how an interrupt is delivered to the CPU, etc. That is all
a property of the architecture, chipset, etc.
On x86, there are two disparate sets of hardware for managing
interrupt signals. The first is the pair of 8259A interrupt
controllers found on all PC-AT compatible machines. The second set
of hardware is the APIC subsystem as it were. Each processor
contains a local APIC that can receive messages from other APICs and
send messages to other local APICs. In addition to the local APICs,
the chipset contains 1 or more I/O APICs. Each I/O APIC contains
anywhere from 4 to 32 individual interrupt pins. Common numbers are
4 (somewhat rare), 16, 24, and 32. Conceptually, on x86 a given
interrupt source can be described by the tuple (pic, pin).
Simply put, PCI interrupt routing is the mapping of (bus, slot, pin)
PCI interrupt tuples to (pic, pin) x86 interrupt tuples.
Now, before delving deeper into the specifics of routing on x86, let
me digress about IRQs on FreeBSD. Basically, an IRQ value is a
cookie useful for binding a device interrupt (such as a PCI (bus,
slot, pin) tuple or an ISA IRQ) to a x86 interrupt tuple (pic, pin).
BIOSes don't operate with APICs at all, at least not for handling
device interrupts. Thus, they all use a simple mapping where IRQs
0-7 correspond to pins 0-7 on the master 8259A, and IRQs 8-15 map to
pins 0-7 on the slave 8259A. All versions of FreeBSD use the same
mapping for IRQ cookie values when using the 8259As to route
interrupts. For the APIC case the mapping of IRQ cookies to (pic,
pin) tuples is slightly more complicated. First, the simple case.
FreeBSD 5.2 and later follow the ACPI model (even when not using
ACPI) where the IRQs 0-n correspond to the pins 0-n of the first I/O
APIC, IRQs n+1 to (n+1)+m map to pins 0 to m of the second I/O APIC,
etc. (There is one possible exception with ACPI I'll cover later.)
FreeBSD 4.x is more complicated. The reason is that due to cpl and
spl interrupt masks being 32-bit integers with 8 bits set aside for
software interrupts (SWIs), cpl only has 24 bits available for
hardware interrupts. Therefore, FreeBSD <= 5.2 is limited to IRQ
values 0-23 and can't use the simple (and intuitive) model that
FreeBSD 5.2+ and ACPI use. What FreeBSD 4.x does is to map the ISA
interrupts attached to the first I/O APIC to IRQs 0, 1, and 3-15.
This just leaves IRQs 2 and 16-23 available for all the other APIC
interrupt pins. As each PCI device registers an interrupt handler
for a specific (apic, pin) tuple, that x86 interrupt is mapped to one
of the last set of IRQs. If all of them have been used already, then
the kernel starts assigning multiple (apic, pin) tuples to the same
IRQ resulting in interrupts being shared in software because of the
cpl limitation even though they aren't shared in hardware. This is
why your IRQ values are different on 4.x than on FreeBSD 5.2+ and
Linux which use the ACPI global interrupt number model.
Now, back to how routing of PCI device interrupts on x86 actually
works. I'll cover non-ACPI first. There are two cases to consider.
First, the easy case is that a PCI device interrupt (bus, tuple, pin)
is wired directly to an individual pin on a pic. This is often how
interrupts are wired when using APICs. If you look at the mptable
output and look at the interrupt section, this is fairly obvious as
you will see entries that map the interrupt for a given pci bus, slot
and pin to a given apic id and intpin on that apic. Thus, there is
the mapping for (bus, slot, pin) to (pic, pin) directly. The way
interrupt routing is implemented in this case is that when we go to
route an interrupt for a given PCI device, we search the mptable for
a matching entry. We then look up the associated apic via its apic
id, ask it for the specified pin, and then ask that pin for its IRQ
(via the pic_vector method of the ioapic interrupt source object that
describes the specific pin). When nexus(4) does bus_setup_intr(), it
passes that IRQ to the x86 intr_machdep code which uses the IRQ as an
index into its interrupt source array and ends up with the interrupt
source object for the (apic, pin) tuple being used. (Thus, IRQs are
just a cookie that is the index into the global array of interrupt
sources on x86.) Note that interrupts routed this way are hardwired
into the motherboard design. There's no chance for the OS to change
which (pic, pin) a PCI device interrupt is hooked up to.
For the non-APIC case (non-ACPI still), PCI device interrupts are
usually wired up to a pin on a programmable interrupt router. Each
of these pins is called a pci link device. Multiple PCI device
interrupts may be wired up to the same link device, and systems
typically have anywhere from 4 to 8 (sometimes even more) link
devices. Each link device can be independently routed to a given
(pic, pin) and it is limited to a fixed set of possible IRQs. If
multiple link devices are routed to the same IRQ, then all of the
devices attached to both link devices end up sharing the same IRQ
(and thus the same ithread, etc.). Because the link devices are
independently steerable, this is the one way in which the OS has
limited flexibility in routing interrupts. However, the way it works
is that you route the link devices, not individual PCI device
interrupts. The table the BIOS provides with the information about
the link devices is called the $PIR (since that's the 4 byte
signature you search for in RAM to find it). You can see it during a
verbose boot dmesg. It is a table that maps a given (bus, slot,
intpin) PCI tuple to a link index. Each entry also has a bitmask of
the valid ISA IRQs ($PIR only allows for the 16 ISA IRQs) that the
specified link index can be routed to. Thus, the way that interrupt
routing works with $PIR is that when a PCI interrupt is routed, you
search the table for a matching entry to get a link index. The $PIR
code in sys/i386/pci/pci_pir.c basically has a list of link objects
that maintain state about each link. The code finds the data
associated with the link index and sees if it has an IRQ routed
already. If so, that's the IRQ that that PCI device interrupt is
assigned to. If an IRQ isn't routed already, then it has to use an
algorithm to pick one, make a BIOS call to route the link to the
chosen IRQ, and then assign the PCI device interrupt to that IRQ.
Now that you understand that, ACPI routing can make some sense. The
way that ACPI routing works is that each PCI bus in the ACPI
namespace has a _PRT method that returns a table of routing entries.
Each entry contains the slot and intpin that it handles (so that you
can build the (bus, slot, intpin) PCI tuple (bus comes from the PCI
bus device _PRT is a child of, in FreeBSD the _PRT is actually a
child of the pcib(4) device that is the bridge that is the parent of
the PCI bus, but I digress)) as well as a reference to a link device
in the ACPI namespace and a source index. If the link device
reference is empty or NULL, then the interrupt is a hard-wired
interrupt such as the ones used with MP Table routing, and the source
index is the global interrupt number (==IRQ) that you use for this
interrupt and you are done. If the link device reference isn't
empty, then it is the name of a ACPI device object that manages a
single pci link device. Example names include \_SB_.PCI0.LPC0.LNKA.
Each link device object includes methods to query which IRQ it is
currently routed to (though in practice this is unreliable), get the
list of possible IRQs, disable the link device altogether, and route
the link device to a specified IRQ. This is similar to the link
objects we have in the $PIR code except that these end up being full
blown devices on the ACPI side. ACPI adds another twist in that the
BIOS is free to use link devices with APICs (MP Table has no way of
handling that), and in fact in practice there are some nvidia
chipsets for amd64 that do route some PCI device interrupts to link
devices that in APIC mode can be routed to any of the IRQs 20-23.
Now some of the minor trivia and exceptions. On the first I/O APIC,
IRQ0 is generally routed to intpin 2, not intpin 0 (though many
motherboards don't actually hook up the IRQ0 output from the ISA
timer to intpin 2 but do claim to do so in the MP Table and MADT).
Instead intpin 0 is a special ExtINT pin that listens to the 8259As
and can forward interrupts from the 8259As to one or more CPUs. This
is what "mixed mode" is, and on FreeBSD 4.x, if we discover via a
test that the motherboard did not wire IRQ0 up to intpin 2, we use
mixed mode to deliver it via the 8259A bounced through the ExtINT pin
0 on the first I/O APIC. Blech. Also, for ACPI, the SCI is
generally tied to IRQ 9, however, the SCI may be routed to another
intpin in APIC mode. Rather than change the IRQ value in the FADT
(or whichever table the SCI INT is in), ACPI will include an entry in
the MADT that maps IRQ 9 to some other intpin such as IRQ 13 or IRQ
20. If the new intpin is not an ISA IRQ (> 15) we use a backdoor to
override the IRQ ACPI uses. If the new intpin is an ISA IRQ though,
we actually rename the destination IRQ (such as IRQ 13 on one of my
boxes) to IRQ 9, and the original IRQ 9 becomes a "dead" interrupt
pin with no IRQ associated with it. Note that except for a few rare
and very old SMP boxes, no FreeBSD x86 machine has an IRQ 2. Another
odd case is that some very old SMP boxes did not route PCI device
interrupts to the APICs at all. Instead, they routed the outputs of
the link devices to the pins on the first I/O APIC corresponding to
the same IRQ as on the 8259A (the I/O APIC only had 16 pins). Thus,
on these boxes, PCI interrupts are still routed via link devices via
$PIR, and end up triggering IRQ X via intpin X on the first I/O
APIC. One final twist. If a PCI bus behind a PCI-PCI bridge is not
listed in a BIOS table ($PIR or MP Table) or does not have a _PRT in
ACPI, the interrupts are routed by applying the swizzle defined in
the PCI standard to route the interrupt via one of the four INTx pins
on the PCI-PCI bridge's parent PCI bus. The standard defines this
behavior for add-in cards, but some built in busses do this as well.
(I've seen several AGP busses that actually use this method to route
the VGA IRQ).
> Also, if the "boot interrupt" was previously set to 2, is that
> likely to have changed in -current?
> Am I now going to get clobbered on IRQ16? If yes, is this
> something that teh BIOS writers
> decided, or something that the Motherboard designers decided?
The "boot interrupt" issue on some of the PXH's used for PCI-X and
PCI-e host bridges is an unpleasant mess. I think it comes from
Intel assuming all the world is windows (imagine that) and ignoring
standards (such as MP Table and ACPI) that it helps to author. (Yay
Intel!) The issue there is that the PXH's include a dedicated I/O
APIC for each of the two busses the PXH serves, and the PCI device
interrupts are routed to intpins on those APICs. To handle the non-
APIC case, the PXH's forward any device interrupts to the INTx pins
on the parent side of the PCI-PCI bridge if the APIC is disabled.
The problem is that Intel chose a hack to figure out if the APIC was
disabled and that hack interacts badly with FreeBSD. Basically, if
the individual intpin is masked in the APIC, the PXH assumes you
aren't using the APIC to handle interrupts and so it forwards the
interrupt to the INTx pin on its bridge's parent side. The problem
is that after an interrupt comes in on 4.x and later, we mask the
interrupt in the APIC until we have run the interrupt handler. The
reason is that PCI interrupts are level triggered, so they won't
"shut up" until the ISR has run and pacified the PCI device. 4.x
masks the interrupt because it wants to not run ISRs with all
interrupts disabled, but at the same cpl that the interrupt was
registered at so that higher priority interrupts can still preempt an
ISR. 5.x and later need to mask the interrupt so that the processor
doesn't have to keep interrupts disabled until the ithread finishes.
Trying to do that would become complicated and quite painful since it
would also mean deferring the EOI to the lapic (which has to happen
on the same CPU that received the interrupt) and has other nastiness
since ithreads can block on locks, etc. Other OS's that use ithreads
such as BSD/OS and probably Solaris/x86 and Darwin/x86 probably have
the same issue. The sucky part is that Intel didn't have to do this
gross hack. ACPI requires that the OS call a method _PIC if it wants
to use APIC mode, and the _PIC method is free to write to registers,
PCI config space, etc., so Intel could have provided a register to
specify if the PXH's APIC was being used or not and included the code
to manage that in _PIC in their sample BIOS. But, they didn't.
One possible workaround for this issue would be to provide a hacked
PCI-PCI bridge driver for the PXH's that hacked the PCI interrupt
routing such that the PCI device interrupts for child devices didn't
use the APICs in the PXH at all, but used the IRQs that get aliased
to (such as IRQ16 on 5.2+). Getting that to work on 4.x might be
quite painful since 4.x PCI interrupt routing code is rather gross
and hacky already.
Hopefully this at least answers some questions and gives a good
overview of what PCI interrupt routing is and how it works, etc.
--
John Baldwin <jhb at FreeBSD.org> <>< http://www.FreeBSD.org/~jhb/
"Power Users Use the Power to Serve" = http://www.FreeBSD.org
More information about the freebsd-current
mailing list