Re: epair and vnet jail loose connection.

From: Johan Hendriks <joh.hendriks_at_gmail.com>
Date: Thu, 10 Mar 2022 14:31:33 UTC
On 10/03/2022 13:37, Wolfgang Zenker wrote:
> Hi Kristof,
>
> Am Thu, Mar 10, 2022 at 12:44:00PM +0100 schrieb Kristof Provost:
>> On 10 Mar 2022, at 10:13, Johan Hendriks wrote:
>>> On 10/03/2022 08:54, Patrick M. Hausen wrote:
>>>> Hi Johan,
>>>>
>>>> we experience the same on 13.1-PRERELEASE. Currently trying to collect some evidence
>>>> (dtrace) to send to Kristof Provost who was so kind to assist. We are hit by the problem
>>>> in production in 12-24 hour intervals. Have not done any artificial load tests, yet.
>>>>
>>>> May I ask you to run this dtrace script while at least one jail is disconnected and while
>>>> traffic is present that is trying to reach the jail? If you can afford to do that in production (?)
>>>> that would be great. Forward to Kristof (kp@), please.
>>>>
>>>> Thanks and kind regards
>>>> Patrick
>>>> ----------
>>>> #!/usr/sbin/dtrace -s
>>>>
>>>> BEGIN
>>>> {
>>>>      self->in_menq = 0;
>>>> }
>>>>
>>>> fbt:if_epair:epair_menq:entry
>>>> {
>>>>      self->in_menq = 1;
>>>>      printf("In epair_menq");
>>>> }
>>>>
>>>> fbt:if_epair:epair_menq:return
>>>> / self->in_menq == 1 /
>>>> {
>>>>      self->in_menq = 0;
>>>>      printf("Leave epair_menq");
>>>> }
>>>>
>>>> fbt:kernel:taskqueue_enqueue:entry
>>>> / self->in_menq == 1 /
>>>> {
>>>>      printf("Enqueue task");
>>>>
>>>> }
>>>>
>>>> fbt:if_epair:epair_tx_start_deferred:entry
>>>> {
>>>>      printf("epair_tx_start_deferred");
>>>> }
>>>> ----------
>>>>
>>> I was asked the above, so hereby the output of that command.
>>> I did do a  hey -h2 -n 10 -c 10 -z 60s https://wp.test.nl to that machine and in the 60 seconds the jail became unresponsive. Then i did run the dtrace.sh script above like so /root/bin/dtrace.sh > /root/dtrace_output
>>>
>>> I hope this helps, if you need anything please let me know. Also root access is possible if you want. That way you do not have to create a test environment.
>> Were there other epair interfaces running at this time, with active traffic?
>> The dtrace output appears to show that the appropriate callouts (to epair_tx_start_deferred()) are getting through, so I’d expect traffic to be flowing.
> There is one second jail using epair on that system, using the same
> bridge as well. This second jail is a low-traffic system, it is unlikely
> but possible that there was some traffic during that time.
> In all previous cases this second jail continued to be reachable all
> the time.
>
> Wolfgang
>
I use 13-STABLE from 01-02-2022 this year and i can not replicate this, 
i step ahead a week and do a rebuild and try again.