Re: PCI topology-based hints
- Reply: Ravi Pokala : "Re: PCI topology-based hints"
- In reply to: Ravi Pokala : "PCI topology-based hints"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sat, 01 Mar 2025 21:23:43 UTC
On Fri, Feb 28, 2025 at 12:43 AM Ravi Pokala <rpokala@freebsd.org> wrote: > Hi folks, > > Setting up device attachment hints based on PCI address is easy; it's > right there in the manual (pci.4): > > | DEVICE WIRING > | You can wire the device unit at a given location with device.hints. > | Entries of the form hints.<name>.<unit>.at="pci<B>:<S>:<F>" or > | hints.<name>.<unit>.at="pci<D>:<B>:<S>:<F>" will force the driver > name to > | probe and attach at unit unit for any PCI device found to match the > | specification, where: > | ... > | Examples > | Given the following lines in /boot/device.hints: > | hint.nvme.3.at="pci6:0:0" hint.igb.8.at="pci14:0:0" If there is a > device > | that supports igb(4) at PCI bus 14 slot 0 function 0, then it will > be > | assigned igb8 for probe and attach. Likewise, if there is an > nvme(4) > > That's all well and good in a world without pluggable and hot-swappable > devices, but things get tricker when devices can appear and disappear. > > We have systems which have multiple U.2 bays, which take NVMe PCIe > devices. Across multiple reboots, the <D, B, S, F> address assigned to the > device in each of those bays was consistent. Great! We set up wring hints > for those devices, and confirmed that the wiring worked when devices were > swapped ... > > .. until we added NIC into the hot-swap OCP slot and rebooted. > > While things continued to work before the reboot, upon reboot, many > addresses changed. It looks like the slot into which the NIC was installed, > is on the same segment of the bus as the U.2 bays. When that segment was > enumerated, the addresses got shuffled to include the NIC. > > So, we can't necessarily rely on the PCI <D, B, S, F> address. But the > PCIe topology is consistent, even when devices are added and removed -- > it's the physical wiring between the root complex, bridges, devices, and > expansion slots. > > The `lspci' utility -- ubiquitous on Linux, and available via the > "sysutils/pciutils" port on FreeBSD -- can show the topology. For example, > consider three NVMe devices, reported by `pciconf', and by `lspci's tree > view (device details redacted): > > | % pciconf -l | tr '@' ' ' | sort -V -k2 | grep nvme > | nvme2 pci0:65:0:0: ... > | nvme0 pci0:133:0:0: ... > | nvme1 pci0:137:0:0: ... > | % > | % lspci -vt | grep -C2 -E '^..-|NVMe' > | -+-[0000:00]-+-00.0 Root Complex > | | +-00.2 ... > | | +-00.3 ... > | -- > | | +-18.6 ... > | | \-18.7 ... > | +-[0000:40]-+-00.0 Root Complex > | | +-00.2 ... > | | +-00.3 ... > | | +-01.0 ... > | | +-01.1-[41]----00.0 ${VENDOR} NVMe > | | +-01.3-[42-43]-- > | | +-01.4-[44-45]-- > | -- > | | | \-00.1 ... > | | \-07.2 ... > | +-[0000:80]-+-00.0 Root Complex > | | +-00.2 ... > | | +-00.3 ... > | -- > | | +-03.0 ... > | | +-03.1-[83-84]-- > | | +-03.2-[85-86]----00.0 ${VENDOR} NVMe > | | +-03.3-[87-88]-- > | | +-03.4-[89-8a]----00.0 ${VENDOR} NVMe > | | +-04.0 ... > | | +-05.0 ... > | -- > | | | \-00.1 ... > | | \-07.2 ... > | \-[0000:c0]-+-00.0 Root Complex > | +-00.2 ... > | +-00.3 ... > > The first set of xdigits, "[0000:n0]" are a "domain" and "bus", which are > only shown for the Root Complex devices. The second set of xdigits, "xy.z", > are either an endpoint's "slot" and "function", or else a bridge device's > (address?) and (slot?). If there is a bridge, there is a set of xdigits in > brackets next to each (slot?), which becomes the "bus" of the attached > endpoint, and then "xy.z", which is the endpoint's "slot" and "function". > > Thus, we can see from the tree that the NVMe devices are "0000:41:00.0", > "0000:85:00.0", and "0000:89:00.0". (Which, if you convert to decimal, is > the same as reported by `pciconf': "pci0:65:0:0", "pci0:133:0:0", > "pci0:137:0:0".) It is also apparent that the latter two devices are > connected to the same bridge, which in turn is connected to a different > root complex than the first device. > > The problem is, depending on what devices are connected to a given root > complex, the "bus" component which is associated with a bridge slot can > change. In the example above, with the current population of devices in the > "0000:80" portion of the tree, the "bus" components associated with bridge > "03" are "83", "85", "87", and "89". But add another device to "0000:80" > and reboot, and the addresses associated with bridge "03" become "84", > "86", "88", and "8a". > > The question is this: How do I indicate that I would like a certain device > unit to be wired to a specific bridge device address and slot -- which > cannot change -- rather than to a specific <D, B, S, F>, where the "B" > component can change. > > Any thoughts? > Yes. You can use what's already there, but maybe not documented or is at the very least underdocumented. You can wire devices to the UEFI path, which is guaranteed to be unique and avoid all these problems. hint.nvme.77.at="UEFI:PcieRoot(2)/Pci(0x1,0x1)/Pci(0x0,0x0)" Which is on pcie root complex 2, then follow device 1 function 1 on that bus to device 0 function 0 on the second zero. `devctl getpath UEFI nvme0` will do all the heavy lifting for you. TaDa! No bus numbers. I added this several years ago to solve exactly this problem, or what happens when you lose a riser card, etc. Warner