Re: epair and vnet jail loose connection.

From: Johan Hendriks <joh.hendriks_at_gmail.com>
Date: Mon, 14 Mar 2022 15:38:16 UTC
On 14/03/2022 16:09, Kristof Provost wrote:
>
> On 14 Mar 2022, at 7:44, Michael Gmelin wrote:
>
>     On Sun, 13 Mar 2022 17:53:44 +0000
>     "Bjoern A. Zeeb" <bzeeb-lists@lists.zabbadoz.net> wrote:
>
>         On 13 Mar 2022, at 17:45, Michael Gmelin wrote:
>
>                 On 13. Mar 2022, at 18:16, Bjoern A. Zeeb
>                 <bzeeb-lists@lists.zabbadoz.net> wrote:
>
>                 On 13 Mar 2022, at 16:33, Michael Gmelin wrote:
>
>                     It's important to point out that this only happens
>                     with
>                     kern.ncpu>1. With kern.ncpu==1 nothing gets stuck.
>
>                     This perfectly fits into the picture, since, as
>                     pointed out by
>                     Johan,
>                     the first commit that is affected[0] is about
>                     multicore support.
>
>                 Ignore my ignorance, what is the default of
>                 net.isr.maxthreads and
>                 net.isr.bindthreads (in stable/13) these days?
>
>             My tests were on CURRENT and I’m afk, but according to
>             cgit[0][1],
>             max is 1 and bind is 0.
>
>             Would it make sense to repeat the test with max=-1?
>
>         I’d say yes, I’d also bind, but that’s just me.
>
>         I would almost assume Kristof running with -1 by default (but
>         he can
>         chime in on that).
>
>     I tried various configuration permutations, all with ncpu=2:
>
>     - 14.0-CURRENT #0 main-n253697-f1d450ddee6
>     - 13.1-BETA1 #0 releng/13.1-n249974-ad329796bdb
>     - net.isr.maxthreads: -1 (which results in 2 threads), 1, 2
>     - net.isr.bindthreads: -1, 0, 1, 2
>     - net.isr.dispatch: direct, deferred
>
>     All resulting in the same behavior (hang after a few seconds).
>     They all
>     work ok when running on a single core instance (threads=1 in this
>     case).
>
>     I also ran the same test on 13.0-RELEASE-p7 for
>     comparison (unsurprisingly, it's ok).
>
>     I placed the script to reproduce the issue on freefall for your
>     convenience, so running it is as simple as:
>
>     fetch https://people.freebsd.org/~grembo/hang_epair.sh
>     # inspect content
>     sh hang_epair.sh
>
>     or, if you feel lucky
>
>     fetch -o - https://people.freebsd.org/~grembo/hang_epair.sh | sh
>
>
> With that script I can also reproduce the problem.
>
> I’ve experimented with this hack:
>
> |diff --git a/sys/net/if_epair.c b/sys/net/if_epair.c index 
> c39434b31b9f..1e6bb07ccc4e 100644 --- a/sys/net/if_epair.c +++ 
> b/sys/net/if_epair.c @@ -415,7 +415,10 @@ epair_ioctl(struct ifnet 
> *ifp, u_long cmd, caddr_t data) case SIOCSIFMEDIA: case SIOCGIFMEDIA: 
> + printf("KP: %s() SIOCGIFMEDIA\n", __func__); sc = ifp->if_softc; + 
> taskqueue_enqueue(epair_tasks.tq[0], &sc->queues[0].tx_task); + error 
> = ifmedia_ioctl(ifp, ifr, &sc->media, cmd); break; |
>
> That kicks the receive code whenever I |ifconfig epair0a|, and I see a 
> little more traffic every time I do so.
> That suggests pretty strongly that there’s an issue with how we 
> dispatch work to the handler thread. So presumably there’s a race 
> between epair_menq() and epair_tx_start_deferred().
>
> epair_menq() tries to only enqueue the receive work if there’s nothing 
> in the buf_ring, on the grounds that if there is the previous packet 
> scheduled the work. Clearly there’s an issue there.
>
> I’ll try to dig into that in the next few days.
>
> Kristof
>
I am willing to try patches, so if you have them i can try them on 
14-HEAD, 13-STABLEand 13.1-PRERELEASE.