git: 20e15e905c58 - main - mlx5: Decrease FW init timeout from 120 seconds to 5 seconds
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sun, 29 Jun 2025 20:53:00 UTC
The branch main has been updated by gallatin:
URL: https://cgit.FreeBSD.org/src/commit/?id=20e15e905c58e9e2020b2c3e40caa2e8406e5827
commit 20e15e905c58e9e2020b2c3e40caa2e8406e5827
Author: Andrew Gallatin <gallatin@FreeBSD.org>
AuthorDate: 2025-06-29 20:51:50 +0000
Commit: Andrew Gallatin <gallatin@FreeBSD.org>
CommitDate: 2025-06-29 20:51:50 +0000
mlx5: Decrease FW init timeout from 120 seconds to 5 seconds
When encountering a failed NIC, the mlx5 driver will wait up to 120
secs for the firmware to respond. This timeout is absurdly huge, and
leads to boot times of 40 minutes to over an hour on our servers when a
NIC fails. This is because the driver will attempt to attach to the
failed NIC multiple times (once for each driver loaded after mlx5),
and wait 2 minutes on each attempt. This happens because the mlx5
driver is still the best match for the device. This delay then
triggers watchdog timeouts in our environment, rendering servers
with a failed NIC entirely unbootable without manual intervention.
Note that FW_INIT_WARN_MESSAGE_INTERVAL must also be decreased, as
it must be less than the init timeout.
Reviewed by: kib (initial version, before reducing warn interval)
Sponsored by: Netflix
---
sys/dev/mlx5/device.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/sys/dev/mlx5/device.h b/sys/dev/mlx5/device.h
index e6d46507a5d2..3e2c4f15a5cc 100644
--- a/sys/dev/mlx5/device.h
+++ b/sys/dev/mlx5/device.h
@@ -32,8 +32,8 @@
#define FW_INIT_TIMEOUT_MILI 2000
#define FW_INIT_WAIT_MS 2
-#define FW_PRE_INIT_TIMEOUT_MILI 120000
-#define FW_INIT_WARN_MESSAGE_INTERVAL 20000
+#define FW_PRE_INIT_TIMEOUT_MILI 5000
+#define FW_INIT_WARN_MESSAGE_INTERVAL 2000
#if defined(__LITTLE_ENDIAN)
#define MLX5_SET_HOST_ENDIANNESS 0