can someone explain...[ PCI interrupts]
Julian Elischer
julian at elischer.org
Tue Dec 6 23:39:01 PST 2005
John Baldwin wrote:
>
> On Dec 6, 2005, at 5:57 PM, Julian Elischer wrote:
>
>> In short words for the likes of me,
>> Can someone give a quicj roundup on PCI routing in 4.x and -current.
>
>
> My, what a set of questions. :) I'll do my best, but this will
> probably be a long and perhaps wandering e-mail.
>
> First off, interrupts for PCI devices are roughly split up into two
> categories (currently): INTx interrupt lines and MSI interrupts. MSI
> is relatively new and I won't cover it much here. No versions of
> FreeBSD currently support MSI either (though it's on my todo list), so
> I'll limit this discussion to INTx interrupts. For INTx interrupts,
> each PCI device (or slot) has 4 interrupt lines: INTA, INTB, INTC, and
> INTD. Thus, you can describe any individual PCI interrupt as a tuple
> of (bus, slot, pin). For example, device 4's INTA pin on pci bus 0
> would be (0, 4, INTA). Each PCI function is allowed to have one INTx
> interrupt. The bus and slot come from the location of that function in
> the PCI hierarchy, and the pin comes from the intpin PCI config
> register. PCI doesn't define beyond the INTx pin how an interrupt is
> delivered to the CPU, etc. That is all a property of the architecture,
> chipset, etc.
>
> On x86, there are two disparate sets of hardware for managing interrupt
> signals. The first is the pair of 8259A interrupt controllers found on
> all PC-AT compatible machines. The second set of hardware is the APIC
> subsystem as it were. Each processor contains a local APIC that can
> receive messages from other APICs and send messages to other local
> APICs. In addition to the local APICs, the chipset contains 1 or more
> I/O APICs. Each I/O APIC contains anywhere from 4 to 32 individual
> interrupt pins. Common numbers are 4 (somewhat rare), 16, 24, and 32.
> Conceptually, on x86 a given interrupt source can be described by the
> tuple (pic, pin).
>
> Simply put, PCI interrupt routing is the mapping of (bus, slot, pin)
> PCI interrupt tuples to (pic, pin) x86 interrupt tuples.
>
> Now, before delving deeper into the specifics of routing on x86, let me
> digress about IRQs on FreeBSD. Basically, an IRQ value is a cookie
> useful for binding a device interrupt (such as a PCI (bus, slot, pin)
> tuple or an ISA IRQ) to a x86 interrupt tuple (pic, pin). BIOSes don't
> operate with APICs at all, at least not for handling device
> interrupts. Thus, they all use a simple mapping where IRQs 0-7
> correspond to pins 0-7 on the master 8259A, and IRQs 8-15 map to pins
> 0-7 on the slave 8259A. All versions of FreeBSD use the same mapping
> for IRQ cookie values when using the 8259As to route interrupts. For
> the APIC case the mapping of IRQ cookies to (pic, pin) tuples is
> slightly more complicated. First, the simple case. FreeBSD 5.2 and
> later follow the ACPI model (even when not using ACPI) where the IRQs
> 0-n correspond to the pins 0-n of the first I/O APIC, IRQs n+1 to
> (n+1)+m map to pins 0 to m of the second I/O APIC, etc. (There is one
> possible exception with ACPI I'll cover later.) FreeBSD 4.x is more
> complicated. The reason is that due to cpl and spl interrupt masks
> being 32-bit integers with 8 bits set aside for software interrupts
> (SWIs), cpl only has 24 bits available for hardware interrupts.
> Therefore, FreeBSD <= 5.2 is limited to IRQ values 0-23 and can't use
> the simple (and intuitive) model that FreeBSD 5.2+ and ACPI use. What
> FreeBSD 4.x does is to map the ISA interrupts attached to the first I/O
> APIC to IRQs 0, 1, and 3-15. This just leaves IRQs 2 and 16-23
> available for all the other APIC interrupt pins. As each PCI device
> registers an interrupt handler for a specific (apic, pin) tuple, that
> x86 interrupt is mapped to one of the last set of IRQs. If all of them
> have been used already, then the kernel starts assigning multiple
> (apic, pin) tuples to the same IRQ resulting in interrupts being shared
> in software because of the cpl limitation even though they aren't
> shared in hardware. This is why your IRQ values are different on 4.x
> than on FreeBSD 5.2+ and Linux which use the ACPI global interrupt
> number model.
but if I change the code that does this, I may be able to get my devices that
collide with the 'boot interrupt' to go elsewhere? That would be good..
>
> Now, back to how routing of PCI device interrupts on x86 actually
> works. I'll cover non-ACPI first. There are two cases to consider.
> First, the easy case is that a PCI device interrupt (bus, tuple, pin)
> is wired directly to an individual pin on a pic. This is often how
> interrupts are wired when using APICs. If you look at the mptable
> output and look at the interrupt section, this is fairly obvious as you
> will see entries that map the interrupt for a given pci bus, slot and
> pin to a given apic id and intpin on that apic. Thus, there is the
> mapping for (bus, slot, pin) to (pic, pin) directly. The way interrupt
> routing is implemented in this case is that when we go to route an
> interrupt for a given PCI device, we search the mptable for a matching
> entry. We then look up the associated apic via its apic id, ask it for
> the specified pin, and then ask that pin for its IRQ (via the
> pic_vector method of the ioapic interrupt source object that describes
> the specific pin). When nexus(4) does bus_setup_intr(), it passes that
> IRQ to the x86 intr_machdep code which uses the IRQ as an index into
> its interrupt source array and ends up with the interrupt source object
> for the (apic, pin) tuple being used. (Thus, IRQs are just a cookie
> that is the index into the global array of interrupt sources on x86.)
> Note that interrupts routed this way are hardwired into the motherboard
> design. There's no chance for the OS to change which (pic, pin) a PCI
> device interrupt is hooked up to.
but from my memory, many PCI devices can select between A,B,C and D
so maybe by going to the device and selecting a different one of those
you can force it to go elsewhere...
>
> For the non-APIC case (non-ACPI still), PCI device interrupts are
> usually wired up to a pin on a programmable interrupt router. Each of
> these pins is called a pci link device. Multiple PCI device interrupts
> may be wired up to the same link device, and systems typically have
> anywhere from 4 to 8 (sometimes even more) link devices. Each link
> device can be independently routed to a given (pic, pin) and it is
> limited to a fixed set of possible IRQs. If multiple link devices are
> routed to the same IRQ, then all of the devices attached to both link
> devices end up sharing the same IRQ (and thus the same ithread, etc.).
> Because the link devices are independently steerable, this is the one
> way in which the OS has limited flexibility in routing interrupts.
> However, the way it works is that you route the link devices, not
> individual PCI device interrupts. The table the BIOS provides with the
> information about the link devices is called the $PIR (since that's the
> 4 byte signature you search for in RAM to find it). You can see it
> during a verbose boot dmesg. It is a table that maps a given (bus,
> slot, intpin) PCI tuple to a link index. Each entry also has a bitmask
> of the valid ISA IRQs ($PIR only allows for the 16 ISA IRQs) that the
> specified link index can be routed to. Thus, the way that interrupt
> routing works with $PIR is that when a PCI interrupt is routed, you
> search the table for a matching entry to get a link index. The $PIR
> code in sys/i386/pci/pci_pir.c basically has a list of link objects
> that maintain state about each link. The code finds the data
> associated with the link index and sees if it has an IRQ routed
> already. If so, that's the IRQ that that PCI device interrupt is
> assigned to. If an IRQ isn't routed already, then it has to use an
> algorithm to pick one, make a BIOS call to route the link to the chosen
> IRQ, and then assign the PCI device interrupt to that IRQ.
>
so, is a "link device" a physical piece of hardware or a software abstraction?
> Now that you understand that, ACPI routing can make some sense. The
> way that ACPI routing works is that each PCI bus in the ACPI namespace
> has a _PRT method that returns a table of routing entries. Each entry
> contains the slot and intpin that it handles (so that you can build the
> (bus, slot, intpin) PCI tuple (bus comes from the PCI bus device _PRT
> is a child of, in FreeBSD the _PRT is actually a child of the pcib(4)
> device that is the bridge that is the parent of the PCI bus, but I
> digress)) as well as a reference to a link device in the ACPI namespace
> and a source index. If the link device reference is empty or NULL,
> then the interrupt is a hard-wired interrupt such as the ones used with
> MP Table routing, and the source index is the global interrupt number
> (==IRQ) that you use for this interrupt and you are done. If the link
> device reference isn't empty, then it is the name of a ACPI device
> object that manages a single pci link device. Example names include
> \_SB_.PCI0.LPC0.LNKA. Each link device object includes methods to
> query which IRQ it is currently routed to (though in practice this is
> unreliable), get the list of possible IRQs, disable the link device
> altogether, and route the link device to a specified IRQ. This is
> similar to the link objects we have in the $PIR code except that these
> end up being full blown devices on the ACPI side. ACPI adds another
> twist in that the BIOS is free to use link devices with APICs (MP Table
> has no way of handling that), and in fact in practice there are some
> nvidia chipsets for amd64 that do route some PCI device interrupts to
> link devices that in APIC mode can be routed to any of the IRQs 20-23.
>
> Now some of the minor trivia and exceptions. On the first I/O APIC,
> IRQ0 is generally routed to intpin 2, not intpin 0 (though many
> motherboards don't actually hook up the IRQ0 output from the ISA timer
> to intpin 2 but do claim to do so in the MP Table and MADT). Instead
> intpin 0 is a special ExtINT pin that listens to the 8259As and can
> forward interrupts from the 8259As to one or more CPUs. This is what
> "mixed mode" is, and on FreeBSD 4.x, if we discover via a test that the
> motherboard did not wire IRQ0 up to intpin 2, we use mixed mode to
> deliver it via the 8259A bounced through the ExtINT pin 0 on the first
> I/O APIC. Blech. Also, for ACPI, the SCI is generally tied to IRQ 9,
> however, the SCI may be routed to another intpin in APIC mode. Rather
> than change the IRQ value in the FADT (or whichever table the SCI INT
> is in), ACPI will include an entry in the MADT that maps IRQ 9 to some
> other intpin such as IRQ 13 or IRQ 20. If the new intpin is not an ISA
> IRQ (> 15) we use a backdoor to override the IRQ ACPI uses. If the new
> intpin is an ISA IRQ though, we actually rename the destination IRQ
> (such as IRQ 13 on one of my boxes) to IRQ 9, and the original IRQ 9
> becomes a "dead" interrupt pin with no IRQ associated with it. Note
> that except for a few rare and very old SMP boxes, no FreeBSD x86
> machine has an IRQ 2. Another odd case is that some very old SMP boxes
> did not route PCI device interrupts to the APICs at all. Instead, they
> routed the outputs of the link devices to the pins on the first I/O
> APIC corresponding to the same IRQ as on the 8259A (the I/O APIC only
> had 16 pins). Thus, on these boxes, PCI interrupts are still routed
> via link devices via $PIR, and end up triggering IRQ X via intpin X on
> the first I/O APIC. One final twist. If a PCI bus behind a PCI-PCI
> bridge is not listed in a BIOS table ($PIR or MP Table) or does not
> have a _PRT in ACPI, the interrupts are routed by applying the swizzle
> defined in the PCI standard to route the interrupt via one of the four
> INTx pins on the PCI-PCI bridge's parent PCI bus. The standard defines
> this behavior for add-in cards, but some built in busses do this as
> well. (I've seen several AGP busses that actually use this method to
> route the VGA IRQ).
>
>> Also, if the "boot interrupt" was previously set to 2, is that likely
>> to have changed in -current?
>> Am I now going to get clobbered on IRQ16? If yes, is this something
>> that teh BIOS writers
>> decided, or something that the Motherboard designers decided?
>
>
> The "boot interrupt" issue on some of the PXH's used for PCI-X and
> PCI-e host bridges is an unpleasant mess. I think it comes from Intel
> assuming all the world is windows (imagine that) and ignoring standards
> (such as MP Table and ACPI) that it helps to author. (Yay Intel!) The
> issue there is that the PXH's include a dedicated I/O APIC for each of
> the two busses the PXH serves, and the PCI device interrupts are routed
> to intpins on those APICs. To handle the non- APIC case, the PXH's
> forward any device interrupts to the INTx pins on the parent side of
> the PCI-PCI bridge if the APIC is disabled. The problem is that Intel
> chose a hack to figure out if the APIC was disabled and that hack
> interacts badly with FreeBSD. Basically, if the individual intpin is
> masked in the APIC, the PXH assumes you aren't using the APIC to handle
> interrupts and so it forwards the interrupt to the INTx pin on its
> bridge's parent side. The problem is that after an interrupt comes in
> on 4.x and later, we mask the interrupt in the APIC until we have run
> the interrupt handler. The reason is that PCI interrupts are level
> triggered, so they won't "shut up" until the ISR has run and pacified
> the PCI device. 4.x masks the interrupt because it wants to not run
> ISRs with all interrupts disabled, but at the same cpl that the
> interrupt was registered at so that higher priority interrupts can
> still preempt an ISR. 5.x and later need to mask the interrupt so that
> the processor doesn't have to keep interrupts disabled until the
> ithread finishes. Trying to do that would become complicated and quite
> painful since it would also mean deferring the EOI to the lapic (which
> has to happen on the same CPU that received the interrupt) and has
> other nastiness since ithreads can block on locks, etc. Other OS's
> that use ithreads such as BSD/OS and probably Solaris/x86 and
> Darwin/x86 probably have the same issue. The sucky part is that Intel
> didn't have to do this gross hack. ACPI requires that the OS call a
> method _PIC if it wants to use APIC mode, and the _PIC method is free
> to write to registers, PCI config space, etc., so Intel could have
> provided a register to specify if the PXH's APIC was being used or not
> and included the code to manage that in _PIC in their sample BIOS.
> But, they didn't.
>
> One possible workaround for this issue would be to provide a hacked
> PCI-PCI bridge driver for the PXH's that hacked the PCI interrupt
> routing such that the PCI device interrupts for child devices didn't
> use the APICs in the PXH at all, but used the IRQs that get aliased to
> (such as IRQ16 on 5.2+). Getting that to work on 4.x might be quite
> painful since 4.x PCI interrupt routing code is rather gross and hacky
> already.
>
> Hopefully this at least answers some questions and gives a good
> overview of what PCI interrupt routing is and how it works, etc.
My head hurts, but a lot makes more sense now.
I'll need to read this a few more times however.
if you made this into a web page, and added a few diagrams that would be
amazing.. also you use a few Acronyms without saying what they are..
Thanks!
>
More information about the freebsd-current
mailing list