Re: PCI topology-based hints

From: Warner Losh <imp_at_bsdimp.com>
Date: Sat, 01 Mar 2025 22:25:04 UTC
On Sat, Mar 1, 2025 at 3:19 PM Ravi Pokala <rpokala@freebsd.org> wrote:

> > Yes. You can use what's already there, but maybe not documented or is at
> the very least underdocumented. You can wire devices to the UEFI path,
> which is guaranteed to be unique and avoid all these problems.
>
> >
>
> > hint.nvme.77.at="UEFI:PcieRoot(2)/Pci(0x1,0x1)/Pci(0x0,0x0)"
>
> >
>
> > Which is on pcie root complex 2, then follow device 1 function 1 on that
> bus to device 0 function 0 on the second zero. `devctl getpath UEFI nvme0`
> will do all the heavy lifting for you. TaDa! No bus numbers.
>
> >
>
> > I added this several years ago to solve exactly this problem, or what
> happens when you lose a riser card, etc.
>
> >
>
> > Warner
>
>
>
> Sweet! Thanks Warner, that’s exactly what I’m looking for. :-)
>
>
>
> You’re right that it’s under-documented. I think it should be relatively
> easy to find a list of buses which support wiring; I think this search
> should find them:
>
>
>
> | grep -Erl 'DEVMETHOD.*hint' /usr/src/sys
>

No. This is kinda independent of buses, but I think only PCI supports them
now. I'd look for DEVMETHOD.*get_device_path, though, since that's required
to make this work. I think I only implemented PCI though.


> And then make sure that the bus’ manpage describes the hinting mechanism,
> and add cross-refs between the bus’ manpage and device.hints.5
>
>
>
> If that sounds right, I’ll see if I can find some time to do that in the
> near future
>

Yes. I think that's right. I have this review
https://reviews.freebsd.org/D49195 that I just whipped up (history suggests
my writing will need a lot of help).

Warner


> Thanks again!
>
>
>
> -Ravi (rpokala@)
>
>
>
>
>
> *From: *Warner Losh <imp@bsdimp.com>
> *Date: *Saturday, March 1, 2025 at 13:23
> *To: *Ravi Pokala <rpokala@freebsd.org>
> *Cc: *"freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>
> *Subject: *Re: PCI topology-based hints
>
>
>
>
>
>
>
> On Fri, Feb 28, 2025 at 12:43 AM Ravi Pokala <rpokala@freebsd.org> wrote:
>
> Hi folks,
>
> Setting up device attachment hints based on PCI address is easy; it's
> right there in the manual (pci.4):
>
> | DEVICE WIRING
> |      You can wire the device unit at a given location with device.hints.
> |      Entries of the form hints.<name>.<unit>.at="pci<B>:<S>:<F>" or
> |      hints.<name>.<unit>.at="pci<D>:<B>:<S>:<F>" will force the driver
> name to
> |      probe and attach at unit unit for any PCI device found to match the
> |      specification, where:
> | ...
> |    Examples
> |      Given the following lines in /boot/device.hints:
> |      hint.nvme.3.at="pci6:0:0" hint.igb.8.at="pci14:0:0" If there is a
> device
> |      that supports igb(4) at PCI bus 14 slot 0 function 0, then it will
> be
> |      assigned igb8 for probe and attach.  Likewise, if there is an
> nvme(4)
>
> That's all well and good in a world without pluggable and hot-swappable
> devices, but things get tricker when devices can appear and disappear.
>
> We have systems which have multiple U.2 bays, which take NVMe PCIe
> devices. Across multiple reboots, the <D, B, S, F> address assigned to the
> device in each of those bays was consistent. Great! We set up wring hints
> for those devices, and confirmed that the wiring worked when devices were
> swapped ...
>
> .. until we added NIC into the hot-swap OCP slot and rebooted.
>
> While things continued to work before the reboot, upon reboot, many
> addresses changed. It looks like the slot into which the NIC was installed,
> is on the same segment of the bus as the U.2 bays. When that segment was
> enumerated, the addresses got shuffled to include the NIC.
>
> So, we can't necessarily rely on the PCI <D, B, S, F> address. But the
> PCIe topology is consistent, even when devices are added and removed --
> it's the physical wiring between the root complex, bridges, devices, and
> expansion slots.
>
> The `lspci' utility -- ubiquitous on Linux, and available via the
> "sysutils/pciutils" port on FreeBSD -- can show the topology. For example,
> consider three NVMe devices, reported by `pciconf', and by `lspci's tree
> view (device details redacted):
>
> | % pciconf -l | tr '@' ' ' | sort -V -k2 | grep nvme
> | nvme2 pci0:65:0:0: ...
> | nvme0 pci0:133:0:0: ...
> | nvme1 pci0:137:0:0: ...
> | %
> | % lspci -vt | grep -C2 -E '^..-|NVMe'
> | -+-[0000:00]-+-00.0  Root Complex
> |  |           +-00.2  ...
> |  |           +-00.3  ...
> | --
> |  |           +-18.6  ...
> |  |           \-18.7  ...
> |  +-[0000:40]-+-00.0  Root Complex
> |  |           +-00.2  ...
> |  |           +-00.3  ...
> |  |           +-01.0  ...
> |  |           +-01.1-[41]----00.0  ${VENDOR} NVMe
> |  |           +-01.3-[42-43]--
> |  |           +-01.4-[44-45]--
> | --
> |  |           |            \-00.1  ...
> |  |           \-07.2  ...
> |  +-[0000:80]-+-00.0  Root Complex
> |  |           +-00.2  ...
> |  |           +-00.3  ...
> | --
> |  |           +-03.0  ...
> |  |           +-03.1-[83-84]--
> |  |           +-03.2-[85-86]----00.0  ${VENDOR} NVMe
> |  |           +-03.3-[87-88]--
> |  |           +-03.4-[89-8a]----00.0  ${VENDOR} NVMe
> |  |           +-04.0  ...
> |  |           +-05.0  ...
> | --
> |  |           |            \-00.1  ...
> |  |           \-07.2  ...
> |  \-[0000:c0]-+-00.0  Root Complex
> |              +-00.2  ...
> |              +-00.3  ...
>
> The first set of xdigits, "[0000:n0]" are a "domain" and "bus", which are
> only shown for the Root Complex devices. The second set of xdigits, "xy.z",
> are either an endpoint's "slot" and "function", or else a bridge device's
> (address?) and (slot?). If there is a bridge, there is a set of xdigits in
> brackets next to each (slot?), which becomes the "bus" of the attached
> endpoint, and then "xy.z", which is the endpoint's "slot" and "function".
>
> Thus, we can see from the tree that the NVMe devices are "0000:41:00.0",
> "0000:85:00.0", and "0000:89:00.0". (Which, if you convert to decimal, is
> the same as reported by `pciconf': "pci0:65:0:0", "pci0:133:0:0",
> "pci0:137:0:0".) It is also apparent that the latter two devices are
> connected to the same bridge, which in turn is connected to a different
> root complex than the first device.
>
> The problem is, depending on what devices are connected to a given root
> complex, the "bus" component which is associated with a bridge slot can
> change. In the example above, with the current population of devices in the
> "0000:80" portion of the tree, the "bus" components associated with bridge
> "03" are "83", "85", "87", and "89". But add another device to "0000:80"
> and reboot, and the addresses associated with bridge "03" become "84",
> "86", "88", and "8a".
>
> The question is this: How do I indicate that I would like a certain device
> unit to be wired to a specific bridge device address and slot -- which
> cannot change -- rather than to a specific <D, B, S, F>, where the "B"
> component can change.
>
> Any thoughts?
>
>
>
> Yes. You can use what's already there, but maybe not documented or is at
> the very least underdocumented. You can wire devices to the UEFI path,
> which is guaranteed to be unique and avoid all these problems.
>
>
>
> hint.nvme.77.at="UEFI:PcieRoot(2)/Pci(0x1,0x1)/Pci(0x0,0x0)"
>
>
>
> Which is on pcie root complex 2, then follow device 1 function 1 on that
> bus to device 0 function 0 on the second zero. `devctl getpath UEFI nvme0`
> will do all the heavy lifting for you. TaDa! No bus numbers.
>
>
>
> I added this several years ago to solve exactly this problem, or what
> happens when you lose a riser card, etc.
>
>
>
> Warner
>
>
>