Re: epair and vnet jail loose connection.
Date: Mon, 14 Mar 2022 15:09:49 UTC
On 14 Mar 2022, at 7:44, Michael Gmelin wrote:
> On Sun, 13 Mar 2022 17:53:44 +0000
> "Bjoern A. Zeeb" <bzeeb-lists@lists.zabbadoz.net> wrote:
>
>> On 13 Mar 2022, at 17:45, Michael Gmelin wrote:
>>
>>>> On 13. Mar 2022, at 18:16, Bjoern A. Zeeb
>>>> <bzeeb-lists@lists.zabbadoz.net> wrote:
>>>>
>>>> On 13 Mar 2022, at 16:33, Michael Gmelin wrote:
>>>>> It's important to point out that this only happens with
>>>>> kern.ncpu>1. With kern.ncpu==1 nothing gets stuck.
>>>>>
>>>>> This perfectly fits into the picture, since, as pointed out by
>>>>> Johan,
>>>>> the first commit that is affected[0] is about multicore support.
>>>>
>>>> Ignore my ignorance, what is the default of net.isr.maxthreads and
>>>> net.isr.bindthreads (in stable/13) these days?
>>>>
>>>
>>> My tests were on CURRENT and I’m afk, but according to cgit[0][1],
>>> max is 1 and bind is 0.
>>>
>>> Would it make sense to repeat the test with max=-1?
>>
>> I’d say yes, I’d also bind, but that’s just me.
>>
>> I would almost assume Kristof running with -1 by default (but he can
>> chime in on that).
>
> I tried various configuration permutations, all with ncpu=2:
>
> - 14.0-CURRENT #0 main-n253697-f1d450ddee6
> - 13.1-BETA1 #0 releng/13.1-n249974-ad329796bdb
> - net.isr.maxthreads: -1 (which results in 2 threads), 1, 2
> - net.isr.bindthreads: -1, 0, 1, 2
> - net.isr.dispatch: direct, deferred
>
> All resulting in the same behavior (hang after a few seconds). They
> all
> work ok when running on a single core instance (threads=1 in this
> case).
>
> I also ran the same test on 13.0-RELEASE-p7 for
> comparison (unsurprisingly, it's ok).
>
> I placed the script to reproduce the issue on freefall for your
> convenience, so running it is as simple as:
>
> fetch https://people.freebsd.org/~grembo/hang_epair.sh
> # inspect content
> sh hang_epair.sh
>
> or, if you feel lucky
>
> fetch -o - https://people.freebsd.org/~grembo/hang_epair.sh | sh
>
With that script I can also reproduce the problem.
I’ve experimented with this hack:
diff --git a/sys/net/if_epair.c b/sys/net/if_epair.c
index c39434b31b9f..1e6bb07ccc4e 100644
--- a/sys/net/if_epair.c
+++ b/sys/net/if_epair.c
@@ -415,7 +415,10 @@ epair_ioctl(struct ifnet *ifp, u_long cmd, caddr_t
data)
case SIOCSIFMEDIA:
case SIOCGIFMEDIA:
+ printf("KP: %s() SIOCGIFMEDIA\n", __func__);
sc = ifp->if_softc;
+ taskqueue_enqueue(epair_tasks.tq[0],
&sc->queues[0].tx_task);
+
error = ifmedia_ioctl(ifp, ifr, &sc->media, cmd);
break;
That kicks the receive code whenever I `ifconfig epair0a`, and I see a
little more traffic every time I do so.
That suggests pretty strongly that there’s an issue with how we
dispatch work to the handler thread. So presumably there’s a race
between epair_menq() and epair_tx_start_deferred().
epair_menq() tries to only enqueue the receive work if there’s nothing
in the buf_ring, on the grounds that if there is the previous packet
scheduled the work. Clearly there’s an issue there.
I’ll try to dig into that in the next few days.
Kristof