From nobody Fri Feb 28 07:42:57 2025 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4Z40Yd4XVXz5Vm6r for ; Fri, 28 Feb 2025 07:43:01 +0000 (UTC) (envelope-from rpokala@freebsd.org) Received: from smtp.freebsd.org (smtp.freebsd.org [96.47.72.83]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "smtp.freebsd.org", Issuer "R11" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4Z40Yd3wvrz4GW3 for ; Fri, 28 Feb 2025 07:43:01 +0000 (UTC) (envelope-from rpokala@freebsd.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1740728581; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=bZOx/cN8MQkHl+tSAsk1GR5Y+yN48/q1R9DFOf7ai6U=; b=XuS1HrK1/Yfnz6VRTUKLLU24/a8peGEqdrYvYHpgjszHu8eKpUHpN/KQojMoipfMCWI4j5 uEOyoLVPhUzQfL2I0jec/6/MxUu1Pp3t7d77u1v+M1FaLXxxSbbtivmbSgjhtpKRQ/XwDY q33QIi7If1/ujvVd1hTH7HBytr/5A4IQ38LKCKvviVeM3s2rLj5X1x1pZxSDeL3JWSXjRK DuzrlZ3+yPDvaALin2TldqmRwxl0BLgjMS/6fJ5JxSqug8BroNlEvgNzoeFLydaGZWA67k kgTfXHevv813WDWpavScpuCxtxpnVygpjiSlk/XemFoXro0a+pOnovZ7OkmVdw== ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1740728581; a=rsa-sha256; cv=none; b=ooa82PZdrXrDHSliguAjptkBAR+ol62nQLbP2ihsxp7ZIYN7tiJDHb1sn1wnOnLd312GyG UArzXuYrpp8FvWz+epM6ADTyPlZRI7mPGTLkegiuqNxHP7OGruOK0Tio5WeEt/WKdq1KHr NnOqiTrun7L/JXT8Tf+EZitRZHuuMMqBvcaBuIYD4+tqeZO/HmM0oWiamsrgGah8VwFw9T QHZcSnE+OKuR6fZT7h6N58pKlxBmFsY94EsNz126mumEk2CSswo984Rpw2av8gmKN1ed8Q ibbFNVvsMI/zdFfZe8tlLIUNzBkw8h2UhQ79OhNpHKGxQrQOyMZt5vHzAhk+7w== ARC-Authentication-Results: i=1; mx1.freebsd.org; none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1740728581; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=bZOx/cN8MQkHl+tSAsk1GR5Y+yN48/q1R9DFOf7ai6U=; b=I2pZlkApCo61usvFAMUlPuQxQxU781ebTKJReJeHrPNIIenZUrUPhjsWs4owr8XrDboEyh 4AguX7CaXVB9LSMfXoqwzK3U8aMQTL6Y3xEiNcmc8F8phjJo/YA7+1YBzX8DYQ+4UdjFxJ B6nYa1opTbs0F8W5/79P4xHyIMXVPHS14hAJMWOLVtUHaiVDBT5hICV4aKBCu/kv30sHDG lKgcAxsUSSmeuqKmDq9sKmgRM5FauRIYY49awgSwuYiI++K2qOdy4+VDBygh/L58hF6OzK WNrjYNJxBh0m2yNa1R4UosJdc2te4YeFMBjehUEO4GE7+TkU4pxEau/9zLP/FA== Received: from [192.168.1.52] (c-73-231-46-254.hsd1.ca.comcast.net [73.231.46.254]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) (Authenticated sender: rpokala) by smtp.freebsd.org (Postfix) with ESMTPSA id 4Z40Yd11KZzqCv for ; Fri, 28 Feb 2025 07:43:00 +0000 (UTC) (envelope-from rpokala@freebsd.org) User-Agent: Microsoft-MacOutlook/16.94.25022327 Date: Thu, 27 Feb 2025 23:42:57 -0800 Subject: PCI topology-based hints From: Ravi Pokala To: "freebsd-hackers@freebsd.org" Message-ID: <60B4BAAA-E333-4219-99BE-D6C1B198E0BD@freebsd.org> Thread-Topic: PCI topology-based hints List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@FreeBSD.org Mime-version: 1.0 Content-type: text/plain; charset="UTF-8" Content-transfer-encoding: 7bit Hi folks, Setting up device attachment hints based on PCI address is easy; it's right there in the manual (pci.4): | DEVICE WIRING | You can wire the device unit at a given location with device.hints. | Entries of the form hints...at="pci::" or | hints...at="pci:::" will force the driver name to | probe and attach at unit unit for any PCI device found to match the | specification, where: | ... | Examples | Given the following lines in /boot/device.hints: | hint.nvme.3.at="pci6:0:0" hint.igb.8.at="pci14:0:0" If there is a device | that supports igb(4) at PCI bus 14 slot 0 function 0, then it will be | assigned igb8 for probe and attach. Likewise, if there is an nvme(4) That's all well and good in a world without pluggable and hot-swappable devices, but things get tricker when devices can appear and disappear. We have systems which have multiple U.2 bays, which take NVMe PCIe devices. Across multiple reboots, the address assigned to the device in each of those bays was consistent. Great! We set up wring hints for those devices, and confirmed that the wiring worked when devices were swapped ... .. until we added NIC into the hot-swap OCP slot and rebooted. While things continued to work before the reboot, upon reboot, many addresses changed. It looks like the slot into which the NIC was installed, is on the same segment of the bus as the U.2 bays. When that segment was enumerated, the addresses got shuffled to include the NIC. So, we can't necessarily rely on the PCI address. But the PCIe topology is consistent, even when devices are added and removed -- it's the physical wiring between the root complex, bridges, devices, and expansion slots. The `lspci' utility -- ubiquitous on Linux, and available via the "sysutils/pciutils" port on FreeBSD -- can show the topology. For example, consider three NVMe devices, reported by `pciconf', and by `lspci's tree view (device details redacted): | % pciconf -l | tr '@' ' ' | sort -V -k2 | grep nvme | nvme2 pci0:65:0:0: ... | nvme0 pci0:133:0:0: ... | nvme1 pci0:137:0:0: ... | % | % lspci -vt | grep -C2 -E '^..-|NVMe' | -+-[0000:00]-+-00.0 Root Complex | | +-00.2 ... | | +-00.3 ... | -- | | +-18.6 ... | | \-18.7 ... | +-[0000:40]-+-00.0 Root Complex | | +-00.2 ... | | +-00.3 ... | | +-01.0 ... | | +-01.1-[41]----00.0 ${VENDOR} NVMe | | +-01.3-[42-43]-- | | +-01.4-[44-45]-- | -- | | | \-00.1 ... | | \-07.2 ... | +-[0000:80]-+-00.0 Root Complex | | +-00.2 ... | | +-00.3 ... | -- | | +-03.0 ... | | +-03.1-[83-84]-- | | +-03.2-[85-86]----00.0 ${VENDOR} NVMe | | +-03.3-[87-88]-- | | +-03.4-[89-8a]----00.0 ${VENDOR} NVMe | | +-04.0 ... | | +-05.0 ... | -- | | | \-00.1 ... | | \-07.2 ... | \-[0000:c0]-+-00.0 Root Complex | +-00.2 ... | +-00.3 ... The first set of xdigits, "[0000:n0]" are a "domain" and "bus", which are only shown for the Root Complex devices. The second set of xdigits, "xy.z", are either an endpoint's "slot" and "function", or else a bridge device's (address?) and (slot?). If there is a bridge, there is a set of xdigits in brackets next to each (slot?), which becomes the "bus" of the attached endpoint, and then "xy.z", which is the endpoint's "slot" and "function". Thus, we can see from the tree that the NVMe devices are "0000:41:00.0", "0000:85:00.0", and "0000:89:00.0". (Which, if you convert to decimal, is the same as reported by `pciconf': "pci0:65:0:0", "pci0:133:0:0", "pci0:137:0:0".) It is also apparent that the latter two devices are connected to the same bridge, which in turn is connected to a different root complex than the first device. The problem is, depending on what devices are connected to a given root complex, the "bus" component which is associated with a bridge slot can change. In the example above, with the current population of devices in the "0000:80" portion of the tree, the "bus" components associated with bridge "03" are "83", "85", "87", and "89". But add another device to "0000:80" and reboot, and the addresses associated with bridge "03" become "84", "86", "88", and "8a". The question is this: How do I indicate that I would like a certain device unit to be wired to a specific bridge device address and slot -- which cannot change -- rather than to a specific , where the "B" component can change. Any thoughts? Thanks, Ravi (rpokala@)