Re: Intermittent failure of routing/gateway with ix(4) (x86_64)

From: R Tyler Croy <rtyler_at_brokenco.de>
Date: Sat, 13 Sep 2025 13:28:04 UTC
(replies inline)


On Saturday, August 30th, 2025 at 10:45 AM, Kevin Bowling <kevin.bowling@kev009.com> wrote:

> 

> You've got an assortment of MAC level errors going on in the 'after':
> +dev.ix.1.mac_stats.checksum_errs: 137
> +dev.ix.1.mac_stats.rx_missed_packets: 930676
> +dev.ix.1.mac_stats.rx_errs: 930676
> +dev.ix.0.mac_stats.checksum_errs: 45
> +dev.ix.0.mac_stats.local_faults: 563829
> +dev.ix.0.mac_stats.short_discards: 6
> +dev.ix.0.mac_stats.byte_errs: 6
> +dev.ix.0.mac_stats.ill_errs: 6
> 

> I would be surprised if it is not a hardware issue given the MAC
> errors on both ports.
> 

> It's been a minute since I looked at this but IIRC the thermal diode
> is somehow botched somewhere in the ix family so we don't get a
> notification in software if the PHY over temps. But it's also
> possible yours is already cooked or some other issue. A potentially
> useful hint, the X550 uses a lot less power and produces less heat.


I have modified the case and added a lot more ventilation pushing specifically over the two NICs which seems to have helped, maybe? It's such an intermittent issue that it's hard to tell whether something has changed due to my actions, lunar alignment, relative humidity, etc. 


Here's another snippet from before/after a stall that I observed overnight here.

-dev.ix.1.mac_stats.checksum_errs: 0
+dev.ix.1.mac_stats.checksum_errs: 100
 dev.ix.1.mac_stats.rx_errs: 0
 dev.ix.1.queue1.interrupt_rate: 31250
 dev.ix.1.queue0.interrupt_rate: 31250
-dev.ix.0.mac_stats.checksum_errs: 1
+dev.ix.0.mac_stats.checksum_errs: 373
 dev.ix.0.mac_stats.rec_len_errs: 0
-dev.ix.0.mac_stats.byte_errs: 0
-dev.ix.0.mac_stats.ill_errs: 0
+dev.ix.0.mac_stats.byte_errs: 3
+dev.ix.0.mac_stats.ill_errs: 3
 dev.ix.0.mac_stats.crc_errs: 0
-dev.ix.0.mac_stats.rx_errs: 0
+dev.ix.0.mac_stats.rx_errs: 15284
 dev.ix.0.queue1.interrupt_rate: 31250
 dev.ix.0.queue0.interrupt_rate: 31250


With the "cooked" comment I'm wondering if you might think these NICs are irreparably damaged and thus no amount of fan-fiddling will improve the situation, assuming it's thermal? The other interesting variable I have observed is that since making cooling modifications and relocating in the rack, the stalls _seem_ to be occurring overnight, sometime now between 3-5am local. 


I have set up a periodic cron to snapshot the `sysctl dev.ix` to see if I can observe any other patterns with the data.

I purchased these in a lot with two other NICs that are in other machines in the same rack but those do _not_ act as gateways. Those two devices have performed to expectations albeit in wildly different chassis. If 10GigE NICs were cheaper I would toss these in the bin and get some different cards, but I'm rather motivated to ensure there's not a software/configuration issue here first :)


Cheers