can someone explain...[ PCI interrupts]

Tue Dec 6 23:39:01 PST 2005

John Baldwin wrote:
> 
> On Dec 6, 2005, at 5:57 PM, Julian Elischer wrote:
> 
>> In short words for the likes of me,
>> Can someone give a quicj roundup on PCI routing in 4.x and -current.
> 
> 
> My, what a set of questions. :)  I'll do my best, but this will  
> probably be a long and perhaps wandering e-mail.
> 
> First off, interrupts for PCI devices are roughly split up into two  
> categories (currently): INTx interrupt lines and MSI interrupts.  MSI  
> is relatively new and I won't cover it much here.  No versions of  
> FreeBSD currently support MSI either (though it's on my todo list),  so 
> I'll limit this discussion to INTx interrupts.  For INTx  interrupts, 
> each PCI device (or slot) has 4 interrupt lines: INTA,  INTB, INTC, and 
> INTD.  Thus, you can describe any individual PCI  interrupt as a tuple 
> of (bus, slot, pin).  For example, device 4's  INTA pin on pci bus 0 
> would be (0, 4, INTA).  Each PCI function is  allowed to have one INTx 
> interrupt.  The bus and slot come from the  location of that function in 
> the PCI hierarchy, and the pin comes  from the intpin PCI config 
> register.  PCI doesn't define beyond the  INTx pin how an interrupt is 
> delivered to the CPU, etc.  That is all  a property of the architecture, 
> chipset, etc.
> 
> On x86, there are two disparate sets of hardware for managing  interrupt 
> signals.  The first is the pair of 8259A interrupt  controllers found on 
> all PC-AT compatible machines.   The second set  of hardware is the APIC 
> subsystem as it were.  Each processor  contains a local APIC that can 
> receive messages from other APICs and  send messages to other local 
> APICs.  In addition to the local APICs,  the chipset contains 1 or more 
> I/O APICs.  Each I/O APIC contains  anywhere from 4 to 32 individual 
> interrupt pins.  Common numbers are  4 (somewhat rare), 16, 24, and 32.  
> Conceptually, on x86 a given  interrupt source can be described by the 
> tuple (pic, pin).
> 
> Simply put, PCI interrupt routing is the mapping of (bus, slot, pin)  
> PCI interrupt tuples to (pic, pin) x86 interrupt tuples.
> 
> Now, before delving deeper into the specifics of routing on x86, let  me 
> digress about IRQs on FreeBSD.  Basically, an IRQ value is a  cookie 
> useful for binding a device interrupt (such as a PCI (bus,  slot, pin) 
> tuple or an ISA IRQ) to a x86 interrupt tuple (pic, pin).   BIOSes don't 
> operate with APICs at all, at least not for handling  device 
> interrupts.  Thus, they all use a simple mapping where IRQs  0-7 
> correspond to pins 0-7 on the master 8259A, and IRQs 8-15 map to  pins 
> 0-7 on the slave 8259A.  All versions of FreeBSD use the same  mapping 
> for IRQ cookie values when using the 8259As to route  interrupts.  For 
> the APIC case the mapping of IRQ cookies to (pic,  pin) tuples is 
> slightly more complicated.  First, the simple case.   FreeBSD 5.2 and 
> later follow the ACPI model (even when not using  ACPI) where the IRQs 
> 0-n correspond to the pins 0-n of the first I/O  APIC, IRQs n+1 to 
> (n+1)+m map to pins 0 to m of the second I/O APIC,  etc.  (There is one 
> possible exception with ACPI I'll cover later.)   FreeBSD 4.x is more 
> complicated.  The reason is that due to cpl and  spl interrupt masks 
> being 32-bit integers with 8 bits set aside for  software interrupts 
> (SWIs), cpl only has 24 bits available for  hardware interrupts.  
> Therefore, FreeBSD <= 5.2 is limited to IRQ  values 0-23 and can't use 
> the simple (and intuitive) model that  FreeBSD 5.2+ and ACPI use.  What 
> FreeBSD 4.x does is to map the ISA  interrupts attached to the first I/O 
> APIC to IRQs 0, 1, and 3-15.   This just leaves IRQs 2 and 16-23 
> available for all the other APIC  interrupt pins.  As each PCI device 
> registers an interrupt handler  for a specific (apic, pin) tuple, that 
> x86 interrupt is mapped to one  of the last set of IRQs.  If all of them 
> have been used already, then  the kernel starts assigning multiple 
> (apic, pin) tuples to the same  IRQ resulting in interrupts being shared 
> in software because of the  cpl limitation even though they aren't 
> shared in hardware.  This is  why your IRQ values are different on 4.x 
> than on FreeBSD 5.2+ and  Linux which use the ACPI global interrupt 
> number model.

but if I change the code that does this, I may be able to get my devices that
collide with the 'boot interrupt' to go elsewhere? That would be good..

> 
> Now, back to how routing of PCI device interrupts on x86 actually  
> works.  I'll cover non-ACPI first.  There are two cases to consider.   
> First, the easy case is that a PCI device interrupt (bus, tuple, pin)  
> is wired directly to an individual pin on a pic.  This is often how  
> interrupts are wired when using APICs.  If you look at the mptable  
> output and look at the interrupt section, this is fairly obvious as  you 
> will see entries that map the interrupt for a given pci bus, slot  and 
> pin to a given apic id and intpin on that apic.  Thus, there is  the 
> mapping for (bus, slot, pin) to (pic, pin) directly.  The way  interrupt 
> routing is implemented in this case is that when we go to  route an 
> interrupt for a given PCI device, we search the mptable for  a matching 
> entry.  We then look up the associated apic via its apic  id, ask it for 
> the specified pin, and then ask that pin for its IRQ  (via the 
> pic_vector method of the ioapic interrupt source object that  describes 
> the specific pin).  When nexus(4) does bus_setup_intr(), it  passes that 
> IRQ to the x86 intr_machdep code which uses the IRQ as an  index into 
> its interrupt source array and ends up with the interrupt  source object 
> for the (apic, pin) tuple being used.  (Thus, IRQs are  just a cookie 
> that is the index into the global array of interrupt  sources on x86.)  
> Note that interrupts routed this way are hardwired  into the motherboard 
> design.  There's no chance for the OS to change  which (pic, pin) a PCI 
> device interrupt is hooked up to.

but from my memory, many PCI devices can select between A,B,C and D
so maybe by going to the device and selecting a different one of those
you can force it to go elsewhere...

> 
> For the non-APIC case (non-ACPI still), PCI device interrupts are  
> usually wired up to a pin on a programmable interrupt router.  Each  of 
> these pins is called a pci link device.  Multiple PCI device  interrupts 
> may be wired up to the same link device, and systems  typically have 
> anywhere from 4 to 8 (sometimes even more) link  devices.  Each link 
> device can be independently routed to a given  (pic, pin) and it is 
> limited to a fixed set of possible IRQs.  If  multiple link devices are 
> routed to the same IRQ, then all of the  devices attached to both link 
> devices end up sharing the same IRQ  (and thus the same ithread, etc.).  
> Because the link devices are  independently steerable, this is the one 
> way in which the OS has  limited flexibility in routing interrupts.  
> However, the way it works  is that you route the link devices, not 
> individual PCI device  interrupts.  The table the BIOS provides with the 
> information about  the link devices is called the $PIR (since that's the 
> 4 byte  signature you search for in RAM to find it).  You can see it 
> during a  verbose boot dmesg.  It is a table that maps a given (bus, 
> slot,  intpin) PCI tuple to a link index.  Each entry also has a bitmask 
> of  the valid ISA IRQs ($PIR only allows for the 16 ISA IRQs) that the  
> specified link index can be routed to.  Thus, the way that interrupt  
> routing works with $PIR is that when a PCI interrupt is routed, you  
> search the table for a matching entry to get a link index.  The $PIR  
> code in sys/i386/pci/pci_pir.c basically has a list of link objects  
> that maintain state about each link.  The code finds the data  
> associated with the link index and sees if it has an IRQ routed  
> already.  If so, that's the IRQ that that PCI device interrupt is  
> assigned to.  If an IRQ isn't routed already, then it has to use an  
> algorithm to pick one, make a BIOS call to route the link to the  chosen 
> IRQ, and then assign the PCI device interrupt to that IRQ.
> 

so, is a "link device" a physical piece of hardware or a software abstraction?

> Now that you understand that, ACPI routing can make some sense.  The  
> way that ACPI routing works is that each PCI bus in the ACPI  namespace 
> has a _PRT method that returns a table of routing entries.   Each entry 
> contains the slot and intpin that it handles (so that you  can build the 
> (bus, slot, intpin) PCI tuple (bus comes from the PCI  bus device _PRT 
> is a child of, in FreeBSD the _PRT is actually a  child of the pcib(4) 
> device that is the bridge that is the parent of  the PCI bus, but I 
> digress)) as well as a reference to a link device  in the ACPI namespace 
> and a source index.  If the link device  reference is empty or NULL, 
> then the interrupt is a hard-wired  interrupt such as the ones used with 
> MP Table routing, and the source  index is the global interrupt number 
> (==IRQ) that you use for this  interrupt and you are done.  If the link 
> device reference isn't  empty, then it is the name of a ACPI device 
> object that manages a  single pci link device.  Example names include 
> \_SB_.PCI0.LPC0.LNKA.   Each link device object includes methods to 
> query which IRQ it is  currently routed to (though in practice this is 
> unreliable), get the  list of possible IRQs, disable the link device 
> altogether, and route  the link device to a specified IRQ.  This is 
> similar to the link  objects we have in the $PIR code except that these 
> end up being full  blown devices on the ACPI side.  ACPI adds another 
> twist in that the  BIOS is free to use link devices with APICs (MP Table 
> has no way of  handling that), and in fact in practice there are some 
> nvidia  chipsets for amd64 that do route some PCI device interrupts to 
> link  devices that in APIC mode can be routed to any of the IRQs 20-23.
> 
> Now some of the minor trivia and exceptions.  On the first I/O APIC,  
> IRQ0 is generally routed to intpin 2, not intpin 0 (though many  
> motherboards don't actually hook up the IRQ0 output from the ISA  timer 
> to intpin 2 but do claim to do so in the MP Table and MADT).   Instead 
> intpin 0 is a special ExtINT pin that listens to the 8259As  and can 
> forward interrupts from the 8259As to one or more CPUs.  This  is what 
> "mixed mode" is, and on FreeBSD 4.x, if we discover via a  test that the 
> motherboard did not wire IRQ0 up to intpin 2, we use  mixed mode to 
> deliver it via the 8259A bounced through the ExtINT pin  0 on the first 
> I/O APIC.  Blech.  Also, for ACPI, the SCI is  generally tied to IRQ 9, 
> however, the SCI may be routed to another  intpin in APIC mode.  Rather 
> than change the IRQ value in the FADT  (or whichever table the SCI INT 
> is in), ACPI will include an entry in  the MADT that maps IRQ 9 to some 
> other intpin such as IRQ 13 or IRQ  20.  If the new intpin is not an ISA 
> IRQ (> 15) we use a backdoor to  override the IRQ ACPI uses.  If the new 
> intpin is an ISA IRQ though,  we actually rename the destination IRQ 
> (such as IRQ 13 on one of my  boxes) to IRQ 9, and the original IRQ 9 
> becomes a "dead" interrupt  pin with no IRQ associated with it.  Note 
> that except for a few rare  and very old SMP boxes, no FreeBSD x86 
> machine has an IRQ 2.  Another  odd case is that some very old SMP boxes 
> did not route PCI device  interrupts to the APICs at all.  Instead, they 
> routed the outputs of  the link devices to the pins on the first I/O 
> APIC corresponding to  the same IRQ as on the 8259A (the I/O APIC only 
> had 16 pins).  Thus,  on these boxes, PCI interrupts are still routed 
> via link devices via  $PIR, and end up triggering IRQ X via intpin X on 
> the first I/O  APIC.  One final twist.  If a PCI bus behind a PCI-PCI 
> bridge is not  listed in a BIOS table ($PIR or MP Table) or does not 
> have a _PRT in  ACPI, the interrupts are routed by applying the swizzle 
> defined in  the PCI standard to route the interrupt via one of the four 
> INTx pins  on the PCI-PCI bridge's parent PCI bus.  The standard defines 
> this  behavior for add-in cards, but some built in busses do this as 
> well.   (I've seen several AGP busses that actually use this method to 
> route  the VGA IRQ).
> 
>> Also, if the "boot interrupt" was previously set to 2, is that  likely 
>> to have changed in -current?
>> Am I now going to get clobbered on IRQ16?  If yes, is this  something 
>> that teh BIOS writers
>> decided, or something that the Motherboard designers decided?
> 
> 
> The "boot interrupt" issue on some of the PXH's used for PCI-X and  
> PCI-e host bridges is an unpleasant mess.  I think it comes from  Intel 
> assuming all the world is windows (imagine that) and ignoring  standards 
> (such as MP Table and ACPI) that it helps to author.  (Yay  Intel!)  The 
> issue there is that the PXH's include a dedicated I/O  APIC for each of 
> the two busses the PXH serves, and the PCI device  interrupts are routed 
> to intpins on those APICs.  To handle the non- APIC case, the PXH's 
> forward any device interrupts to the INTx pins  on the parent side of 
> the PCI-PCI bridge if the APIC is disabled.   The problem is that Intel 
> chose a hack to figure out if the APIC was  disabled and that hack 
> interacts badly with FreeBSD.  Basically, if  the individual intpin is 
> masked in the APIC, the PXH assumes you  aren't using the APIC to handle 
> interrupts and so it forwards the  interrupt to the INTx pin on its 
> bridge's parent side.  The problem  is that after an interrupt comes in 
> on 4.x and later, we mask the  interrupt in the APIC until we have run 
> the interrupt handler.  The  reason is that PCI interrupts are level 
> triggered, so they won't  "shut up" until the ISR has run and pacified 
> the PCI device.  4.x  masks the interrupt because it wants to not run 
> ISRs with all  interrupts disabled, but at the same cpl that the 
> interrupt was  registered at so that higher priority interrupts can 
> still preempt an  ISR.  5.x and later need to mask the interrupt so that 
> the processor  doesn't have to keep interrupts disabled until the 
> ithread finishes.   Trying to do that would become complicated and quite 
> painful since it  would also mean deferring the EOI to the lapic (which 
> has to happen  on the same CPU that received the interrupt) and has 
> other nastiness  since ithreads can block on locks, etc.  Other OS's 
> that use ithreads  such as BSD/OS and probably Solaris/x86 and 
> Darwin/x86 probably have  the same issue.  The sucky part is that Intel 
> didn't have to do this  gross hack.  ACPI requires that the OS call a 
> method _PIC if it wants  to use APIC mode, and the _PIC method is free 
> to write to registers,  PCI config space, etc., so Intel could have 
> provided a register to  specify if the PXH's APIC was being used or not 
> and included the code  to manage that in _PIC in their sample BIOS.  
> But, they didn't.
> 
> One possible workaround for this issue would be to provide a hacked  
> PCI-PCI bridge driver for the PXH's that hacked the PCI interrupt  
> routing such that the PCI device interrupts for child devices didn't  
> use the APICs in the PXH at all, but used the IRQs that get aliased  to 
> (such as IRQ16 on 5.2+).  Getting that to work on 4.x might be  quite 
> painful since 4.x PCI interrupt routing code is rather gross  and hacky 
> already.
> 
> Hopefully this at least answers some questions and gives a good  
> overview of what PCI interrupt routing is and how it works, etc.

My head hurts, but a lot makes more sense now.
I'll need to read this a few more times however.
if you made this into a web page, and added a few diagrams that would be
amazing.. also you use a few Acronyms without saying what they are..

Thanks!

>