Re: PCI topology-based hints
- In reply to: Ravi Pokala : "Re: PCI topology-based hints"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sat, 01 Mar 2025 22:25:04 UTC
On Sat, Mar 1, 2025 at 3:19 PM Ravi Pokala <rpokala@freebsd.org> wrote: > > Yes. You can use what's already there, but maybe not documented or is at > the very least underdocumented. You can wire devices to the UEFI path, > which is guaranteed to be unique and avoid all these problems. > > > > > > hint.nvme.77.at="UEFI:PcieRoot(2)/Pci(0x1,0x1)/Pci(0x0,0x0)" > > > > > > Which is on pcie root complex 2, then follow device 1 function 1 on that > bus to device 0 function 0 on the second zero. `devctl getpath UEFI nvme0` > will do all the heavy lifting for you. TaDa! No bus numbers. > > > > > > I added this several years ago to solve exactly this problem, or what > happens when you lose a riser card, etc. > > > > > > Warner > > > > Sweet! Thanks Warner, that’s exactly what I’m looking for. :-) > > > > You’re right that it’s under-documented. I think it should be relatively > easy to find a list of buses which support wiring; I think this search > should find them: > > > > | grep -Erl 'DEVMETHOD.*hint' /usr/src/sys > No. This is kinda independent of buses, but I think only PCI supports them now. I'd look for DEVMETHOD.*get_device_path, though, since that's required to make this work. I think I only implemented PCI though. > And then make sure that the bus’ manpage describes the hinting mechanism, > and add cross-refs between the bus’ manpage and device.hints.5 > > > > If that sounds right, I’ll see if I can find some time to do that in the > near future > Yes. I think that's right. I have this review https://reviews.freebsd.org/D49195 that I just whipped up (history suggests my writing will need a lot of help). Warner > Thanks again! > > > > -Ravi (rpokala@) > > > > > > *From: *Warner Losh <imp@bsdimp.com> > *Date: *Saturday, March 1, 2025 at 13:23 > *To: *Ravi Pokala <rpokala@freebsd.org> > *Cc: *"freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org> > *Subject: *Re: PCI topology-based hints > > > > > > > > On Fri, Feb 28, 2025 at 12:43 AM Ravi Pokala <rpokala@freebsd.org> wrote: > > Hi folks, > > Setting up device attachment hints based on PCI address is easy; it's > right there in the manual (pci.4): > > | DEVICE WIRING > | You can wire the device unit at a given location with device.hints. > | Entries of the form hints.<name>.<unit>.at="pci<B>:<S>:<F>" or > | hints.<name>.<unit>.at="pci<D>:<B>:<S>:<F>" will force the driver > name to > | probe and attach at unit unit for any PCI device found to match the > | specification, where: > | ... > | Examples > | Given the following lines in /boot/device.hints: > | hint.nvme.3.at="pci6:0:0" hint.igb.8.at="pci14:0:0" If there is a > device > | that supports igb(4) at PCI bus 14 slot 0 function 0, then it will > be > | assigned igb8 for probe and attach. Likewise, if there is an > nvme(4) > > That's all well and good in a world without pluggable and hot-swappable > devices, but things get tricker when devices can appear and disappear. > > We have systems which have multiple U.2 bays, which take NVMe PCIe > devices. Across multiple reboots, the <D, B, S, F> address assigned to the > device in each of those bays was consistent. Great! We set up wring hints > for those devices, and confirmed that the wiring worked when devices were > swapped ... > > .. until we added NIC into the hot-swap OCP slot and rebooted. > > While things continued to work before the reboot, upon reboot, many > addresses changed. It looks like the slot into which the NIC was installed, > is on the same segment of the bus as the U.2 bays. When that segment was > enumerated, the addresses got shuffled to include the NIC. > > So, we can't necessarily rely on the PCI <D, B, S, F> address. But the > PCIe topology is consistent, even when devices are added and removed -- > it's the physical wiring between the root complex, bridges, devices, and > expansion slots. > > The `lspci' utility -- ubiquitous on Linux, and available via the > "sysutils/pciutils" port on FreeBSD -- can show the topology. For example, > consider three NVMe devices, reported by `pciconf', and by `lspci's tree > view (device details redacted): > > | % pciconf -l | tr '@' ' ' | sort -V -k2 | grep nvme > | nvme2 pci0:65:0:0: ... > | nvme0 pci0:133:0:0: ... > | nvme1 pci0:137:0:0: ... > | % > | % lspci -vt | grep -C2 -E '^..-|NVMe' > | -+-[0000:00]-+-00.0 Root Complex > | | +-00.2 ... > | | +-00.3 ... > | -- > | | +-18.6 ... > | | \-18.7 ... > | +-[0000:40]-+-00.0 Root Complex > | | +-00.2 ... > | | +-00.3 ... > | | +-01.0 ... > | | +-01.1-[41]----00.0 ${VENDOR} NVMe > | | +-01.3-[42-43]-- > | | +-01.4-[44-45]-- > | -- > | | | \-00.1 ... > | | \-07.2 ... > | +-[0000:80]-+-00.0 Root Complex > | | +-00.2 ... > | | +-00.3 ... > | -- > | | +-03.0 ... > | | +-03.1-[83-84]-- > | | +-03.2-[85-86]----00.0 ${VENDOR} NVMe > | | +-03.3-[87-88]-- > | | +-03.4-[89-8a]----00.0 ${VENDOR} NVMe > | | +-04.0 ... > | | +-05.0 ... > | -- > | | | \-00.1 ... > | | \-07.2 ... > | \-[0000:c0]-+-00.0 Root Complex > | +-00.2 ... > | +-00.3 ... > > The first set of xdigits, "[0000:n0]" are a "domain" and "bus", which are > only shown for the Root Complex devices. The second set of xdigits, "xy.z", > are either an endpoint's "slot" and "function", or else a bridge device's > (address?) and (slot?). If there is a bridge, there is a set of xdigits in > brackets next to each (slot?), which becomes the "bus" of the attached > endpoint, and then "xy.z", which is the endpoint's "slot" and "function". > > Thus, we can see from the tree that the NVMe devices are "0000:41:00.0", > "0000:85:00.0", and "0000:89:00.0". (Which, if you convert to decimal, is > the same as reported by `pciconf': "pci0:65:0:0", "pci0:133:0:0", > "pci0:137:0:0".) It is also apparent that the latter two devices are > connected to the same bridge, which in turn is connected to a different > root complex than the first device. > > The problem is, depending on what devices are connected to a given root > complex, the "bus" component which is associated with a bridge slot can > change. In the example above, with the current population of devices in the > "0000:80" portion of the tree, the "bus" components associated with bridge > "03" are "83", "85", "87", and "89". But add another device to "0000:80" > and reboot, and the addresses associated with bridge "03" become "84", > "86", "88", and "8a". > > The question is this: How do I indicate that I would like a certain device > unit to be wired to a specific bridge device address and slot -- which > cannot change -- rather than to a specific <D, B, S, F>, where the "B" > component can change. > > Any thoughts? > > > > Yes. You can use what's already there, but maybe not documented or is at > the very least underdocumented. You can wire devices to the UEFI path, > which is guaranteed to be unique and avoid all these problems. > > > > hint.nvme.77.at="UEFI:PcieRoot(2)/Pci(0x1,0x1)/Pci(0x0,0x0)" > > > > Which is on pcie root complex 2, then follow device 1 function 1 on that > bus to device 0 function 0 on the second zero. `devctl getpath UEFI nvme0` > will do all the heavy lifting for you. TaDa! No bus numbers. > > > > I added this several years ago to solve exactly this problem, or what > happens when you lose a riser card, etc. > > > > Warner > > >