Re: igb(4) and VLAN issue?

From: Kevin Bowling <kevin.bowling_at_kev009.com>
Date: Thu, 23 Sep 2021 22:46:37 UTC
Franco,

I think I found it: https://reviews.freebsd.org/D32087

Regards,
Kevin

On Tue, Aug 3, 2021 at 8:50 AM Kevin Bowling <kevin.bowling@kev009.com> wrote:
>
> On Tue, Aug 3, 2021 at 8:27 AM Franco Fichtner <franco@lastsummer.de> wrote:
> >
> > Hi Kevin,
> >
> > [RESENT TO MAILING LIST AS SUBSCRIBER]
> >
> > > On 2. Aug 2021, at 7:51 PM, Kevin Bowling <kevin.bowling@kev009.com> wrote:
> > >
> > > I caught wind that an igb(4) commit I've done to main and that has
> > > been in stable/12 for a few months seems to be causing a regression on
> > > opnsense.  The commit in question is
> > > https://cgit.freebsd.org/src/commit/?id=eea55de7b10808b86277d7fdbed2d05d3c6db1b2
> > >
> > > The report is at:
> > > https://forum.opnsense.org/index.php?topic=23867.0
> >
> > Looks like I spoke to soon earlier.  This is a weird one for sure.  :)
> >
> > So first of all this causes an ifconfig hang for VLAN/LAGG combo creation,
> > but later reports were coming in about ahci errors and cam timeouts.
> > Some reported the instabilities start with using netmap, but later others
> > confirmed the same for high load scenarios without netmap in use.
> >
> > The does not appear to happen when MSIX is disabled, e.g.:
> >
> > # sysctl -a | grep dev.igb | grep msix
> > dev.igb.5.iflib.disable_msix: 1
> > dev.igb.4.iflib.disable_msix: 1
> > dev.igb.3.iflib.disable_msix: 1
> > dev.igb.2.iflib.disable_msix: 1
> > dev.igb.1.iflib.disable_msix: 1
> > dev.igb.0.iflib.disable_msix: 1
> >
> > What's also being linked to this is some form of softraid misbehaving
> > and the general tendency for cheaper hardware with particular igb
> > chipsets.
>
> Hmm, there is so much that /could/ be going on it's not easy to
> pinpoint anything yet.  If nothing jumps out after getting more data
> it may be worth mitigating in your build that way and retrying once
> you have updated to FreeBSD 13.
>
> > > I haven't heard of this issue elsewhere and cannot replicate it on my
> > > I210s running main.  I've gone over the code changes line by line
> > > several times and verified all the logic and register writes and it
> > > all looks correct to my understanding.  The only hypothesis I have at
> > > the moment is it may be some subtle timing issue since VLAN changes
> > > unnecessarily restart the interface on e1000 until I push in a work in
> > > progress to stop doing that.
> >
> > I also have no way of reproducing this locally, but the community is
> > probably willing to give any kernel change a try that would address
> > the problem without havinbg to back out the commit in question.
>
> I need some more info before making any changes.  A full dmesg of the
> older working version and a (partial?) dmesg of the broken would be
> another useful data point to start out with, let's see if there is
> something going on during MSI-X vector allocation etc.
>
> > > I'd like to see the output of all the processes or at least the
> > > process configuring the VLANs to see where it is stuck.  Franco, do
> > > you have the ability to 'control+t' there or otherwise set up a break
> > > into a debugger?  Stacktraces would be a great start but a core and a
> > > kernel may be necessary if it isn't obvious.
> >
> > Let me see if I can deliver on this easily.
> >
> >
> > Cheers,
> > Franco
> >