Re: igb(4) and VLAN issue?

From: Kevin Bowling <kevin.bowling_at_kev009.com>
Date: Tue, 03 Aug 2021 15:50:48 UTC
On Tue, Aug 3, 2021 at 8:27 AM Franco Fichtner <franco@lastsummer.de> wrote:
>
> Hi Kevin,
>
> [RESENT TO MAILING LIST AS SUBSCRIBER]
>
> > On 2. Aug 2021, at 7:51 PM, Kevin Bowling <kevin.bowling@kev009.com> wrote:
> >
> > I caught wind that an igb(4) commit I've done to main and that has
> > been in stable/12 for a few months seems to be causing a regression on
> > opnsense.  The commit in question is
> > https://cgit.freebsd.org/src/commit/?id=eea55de7b10808b86277d7fdbed2d05d3c6db1b2
> >
> > The report is at:
> > https://forum.opnsense.org/index.php?topic=23867.0
>
> Looks like I spoke to soon earlier.  This is a weird one for sure.  :)
>
> So first of all this causes an ifconfig hang for VLAN/LAGG combo creation,
> but later reports were coming in about ahci errors and cam timeouts.
> Some reported the instabilities start with using netmap, but later others
> confirmed the same for high load scenarios without netmap in use.
>
> The does not appear to happen when MSIX is disabled, e.g.:
>
> # sysctl -a | grep dev.igb | grep msix
> dev.igb.5.iflib.disable_msix: 1
> dev.igb.4.iflib.disable_msix: 1
> dev.igb.3.iflib.disable_msix: 1
> dev.igb.2.iflib.disable_msix: 1
> dev.igb.1.iflib.disable_msix: 1
> dev.igb.0.iflib.disable_msix: 1
>
> What's also being linked to this is some form of softraid misbehaving
> and the general tendency for cheaper hardware with particular igb
> chipsets.

Hmm, there is so much that /could/ be going on it's not easy to
pinpoint anything yet.  If nothing jumps out after getting more data
it may be worth mitigating in your build that way and retrying once
you have updated to FreeBSD 13.

> > I haven't heard of this issue elsewhere and cannot replicate it on my
> > I210s running main.  I've gone over the code changes line by line
> > several times and verified all the logic and register writes and it
> > all looks correct to my understanding.  The only hypothesis I have at
> > the moment is it may be some subtle timing issue since VLAN changes
> > unnecessarily restart the interface on e1000 until I push in a work in
> > progress to stop doing that.
>
> I also have no way of reproducing this locally, but the community is
> probably willing to give any kernel change a try that would address
> the problem without havinbg to back out the commit in question.

I need some more info before making any changes.  A full dmesg of the
older working version and a (partial?) dmesg of the broken would be
another useful data point to start out with, let's see if there is
something going on during MSI-X vector allocation etc.

> > I'd like to see the output of all the processes or at least the
> > process configuring the VLANs to see where it is stuck.  Franco, do
> > you have the ability to 'control+t' there or otherwise set up a break
> > into a debugger?  Stacktraces would be a great start but a core and a
> > kernel may be necessary if it isn't obvious.
>
> Let me see if I can deliver on this easily.
>
>
> Cheers,
> Franco
>