Re: igb(4) and VLAN issue?

From: Franco Fichtner <franco_at_lastsummer.de>
Date: Tue, 03 Aug 2021 15:27:51 UTC
Hi Kevin,

[RESENT TO MAILING LIST AS SUBSCRIBER]

> On 2. Aug 2021, at 7:51 PM, Kevin Bowling <kevin.bowling@kev009.com> wrote:
> 
> I caught wind that an igb(4) commit I've done to main and that has
> been in stable/12 for a few months seems to be causing a regression on
> opnsense.  The commit in question is
> https://cgit.freebsd.org/src/commit/?id=eea55de7b10808b86277d7fdbed2d05d3c6db1b2
> 
> The report is at:
> https://forum.opnsense.org/index.php?topic=23867.0

Looks like I spoke to soon earlier.  This is a weird one for sure.  :)

So first of all this causes an ifconfig hang for VLAN/LAGG combo creation,
but later reports were coming in about ahci errors and cam timeouts.
Some reported the instabilities start with using netmap, but later others
confirmed the same for high load scenarios without netmap in use.

The does not appear to happen when MSIX is disabled, e.g.:

# sysctl -a | grep dev.igb | grep msix
dev.igb.5.iflib.disable_msix: 1
dev.igb.4.iflib.disable_msix: 1
dev.igb.3.iflib.disable_msix: 1
dev.igb.2.iflib.disable_msix: 1
dev.igb.1.iflib.disable_msix: 1
dev.igb.0.iflib.disable_msix: 1

What's also being linked to this is some form of softraid misbehaving
and the general tendency for cheaper hardware with particular igb
chipsets.

> I haven't heard of this issue elsewhere and cannot replicate it on my
> I210s running main.  I've gone over the code changes line by line
> several times and verified all the logic and register writes and it
> all looks correct to my understanding.  The only hypothesis I have at
> the moment is it may be some subtle timing issue since VLAN changes
> unnecessarily restart the interface on e1000 until I push in a work in
> progress to stop doing that.

I also have no way of reproducing this locally, but the community is
probably willing to give any kernel change a try that would address
the problem without havinbg to back out the commit in question.

> I'd like to see the output of all the processes or at least the
> process configuring the VLANs to see where it is stuck.  Franco, do
> you have the ability to 'control+t' there or otherwise set up a break
> into a debugger?  Stacktraces would be a great start but a core and a
> kernel may be necessary if it isn't obvious.

Let me see if I can deliver on this easily.


Cheers,
Franco