Re: epair and vnet jail loose connection.

From: Michael Gmelin <grembo_at_freebsd.org>
Date: Tue, 15 Mar 2022 17:48:34 UTC

On Tue, 15 Mar 2022 10:30:41 -0600
Kristof Provost <kp@FreeBSD.org> wrote:

> On 14 Mar 2022, at 18:02, Michael Gmelin wrote:
> > On Mon, 14 Mar 2022 09:09:49 -0600
> > Kristof Provost <kp@FreeBSD.org> wrote:
> >  
> >> On 14 Mar 2022, at 7:44, Michael Gmelin wrote:  
> >>> On Sun, 13 Mar 2022 17:53:44 +0000
> >>> "Bjoern A. Zeeb" <bzeeb-lists@lists.zabbadoz.net> wrote:
> >>>  
> >>>> On 13 Mar 2022, at 17:45, Michael Gmelin wrote:
> >>>>  
> >>>>>> On 13. Mar 2022, at 18:16, Bjoern A. Zeeb
> >>>>>> <bzeeb-lists@lists.zabbadoz.net> wrote:
> >>>>>>
> >>>>>> On 13 Mar 2022, at 16:33, Michael Gmelin wrote:  
> >>>>>>> It's important to point out that this only happens with  
> >>>>>>> kern.ncpu>1. With kern.ncpu==1 nothing gets stuck.  
> >>>>>>>
> >>>>>>> This perfectly fits into the picture, since, as pointed out by
> >>>>>>> Johan,
> >>>>>>> the first commit that is affected[0] is about multicore
> >>>>>>> support.  
> >>>>>>
> >>>>>> Ignore my ignorance, what is the default of net.isr.maxthreads
> >>>>>> and net.isr.bindthreads (in stable/13) these days?
> >>>>>>  
> >>>>>
> >>>>> My tests were on CURRENT and I’m afk, but according to
> >>>>> cgit[0][1], max is 1 and bind is 0.
> >>>>>
> >>>>> Would it make sense to repeat the test with max=-1?  
> >>>>
> >>>> I’d say yes, I’d also bind, but that’s just me.
> >>>>
> >>>> I would almost assume Kristof running with -1 by default (but he
> >>>> can chime in on that).  
> >>>
> >>> I tried various configuration permutations, all with ncpu=2:
> >>>
> >>> - 14.0-CURRENT #0 main-n253697-f1d450ddee6
> >>> - 13.1-BETA1 #0 releng/13.1-n249974-ad329796bdb
> >>> - net.isr.maxthreads: -1 (which results in 2 threads), 1, 2
> >>> - net.isr.bindthreads: -1, 0, 1, 2
> >>> - net.isr.dispatch: direct, deferred
> >>>
> >>> All resulting in the same behavior (hang after a few seconds).
> >>> They all
> >>> work ok when running on a single core instance (threads=1 in this
> >>> case).
> >>>
> >>> I also ran the same test on 13.0-RELEASE-p7 for
> >>> comparison (unsurprisingly, it's ok).
> >>>
> >>> I placed the script to reproduce the issue on freefall for your
> >>> convenience, so running it is as simple as:
> >>>
> >>>     fetch https://people.freebsd.org/~grembo/hang_epair.sh
> >>>     # inspect content
> >>>     sh hang_epair.sh
> >>>
> >>> or, if you feel lucky
> >>>
> >>>     fetch -o - https://people.freebsd.org/~grembo/hang_epair.sh |
> >>> sh 
> >> With that script I can also reproduce the problem.
> >>
> >> I’ve experimented with this hack:
> >>
> >> 	diff --git a/sys/net/if_epair.c b/sys/net/if_epair.c
> >> 	index c39434b31b9f..1e6bb07ccc4e 100644
> >> 	--- a/sys/net/if_epair.c
> >> 	+++ b/sys/net/if_epair.c
> >> 	@@ -415,7 +415,10 @@ epair_ioctl(struct ifnet *ifp, u_long
> >> cmd, caddr_t data)
> >>
> >> 	        case SIOCSIFMEDIA:
> >> 	        case SIOCGIFMEDIA:
> >> 	+               printf("KP: %s() SIOCGIFMEDIA\n",
> >> __func__); sc = ifp->if_softc;
> >> 	+               taskqueue_enqueue(epair_tasks.tq[0],
> >> &sc->queues[0].tx_task);
> >> 	+
> >> 	                error = ifmedia_ioctl(ifp, ifr, &sc->media,
> >> cmd); break;
> >>
> >> That kicks the receive code whenever I `ifconfig epair0a`, and I
> >> see a little more traffic every time I do so.
> >> That suggests pretty strongly that there’s an issue with how we
> >> dispatch work to the handler thread. So presumably there’s a race
> >> between epair_menq() and epair_tx_start_deferred().
> >>
> >> epair_menq() tries to only enqueue the receive work if there’s
> >> nothing in the buf_ring, on the grounds that if there is the
> >> previous packet scheduled the work. Clearly there’s an issue there.
> >>
> >> I’ll try to dig into that in the next few days.
> >>  
> >
> > Hi Kristof,
> >
> > This sounds plausible. I spent a few hours getting familiar with the
> > epair code and came up with a patch that seems to fix the issue at
> > hand (both with and without RSS). I'm not certain that it is a good
> > solution, especially in terms of performance, but I wanted to share
> > it with you anyway, maybe it helps:
> > https://people.freebsd.org/~grembo/epair.patch
> >  
> That seems to be working, and at first glance doesn’t look like it’d
> hurt performance too badly.
> 
> Can you write up a commit message and post it on phabricator?
> 

Please see https://reviews.freebsd.org/D34569

Best
Michael

-- 
Michael Gmelin