[vnet] [epair] epair interface stops working after some time

Tue Mar 27 18:34:18 UTC 2018

On 27 Mar 2018, at 16:48, Bjoern A. Zeeb wrote:

> On 27 Mar 2018, at 14:40, Kristof Provost wrote:
>
>> (Re-cc freebsd-net, because this is useful information)
>>
>> On 27 Mar 2018, at 13:07, Reshad Patuck wrote:
>>> The epair crash occurred again today running the epair module code 
>>> with the added dtrace sdt providers.
>>> 
>>> Running the same command as last time, 'dtrace -n ::epair\*:' 
>>> returns the following:
>>> ```
>>> CPU     ID                    FUNCTION:NAME
>> …
>>>   0  66499   epair_transmit_locked:enqueued
>>> ```
>>
>>> Looks like its filled up a queue somewhere and is dropping 
>>> connections post that.
>>> 
>>> The value of the 'error' is 55 I can see both the ifp and m structs 
>>> but don't know what to look for in them.
>>>
>> That’s useful. Error 55 is ENOBUFS, which in IFQ_ENQUEUE() means 
>> we’re hitting _IF_QFULL().
>> There don’t seem to be counters for that drop though, so that makes 
>> it hard to diagnose without these extra probe points.
>> It also explains why you don’t really see any drop counters 
>> incrementing.
>>
>> The fact that this queue is full presumably means that the other side 
>> is not reading packets off it any more.
>> That’s supposed to happen in epair_start_locked() (Look for the 
>> IFQ_DEQUEUE() calls).
>>
>> It’s not at all clear to my how, but it looks like the receive side 
>> is not doing its work.
>>
>> It looks like the IFQ code is already a fallback for when the netisr 
>> queue is full.
>> That code might be broken, or there might be a different issue that 
>> will just mean you’ll always end up in the same situation, 
>> regardless of queue size.
>>
>> It’s probably worth trying to play with 
>> ‘net.route.netisr_maxqlen’. I’d recommend *lowering* it, to see 
>> if the problem happens more frequently that way. If it does it’ll 
>> be helpful in reproducing and trying to fix this. If it doesn’t the 
>> full queues is probably a consequence rather than a cause/trigger.
>> (Of course, once you’ve confirmed that lowering the netisr_maxqlen 
>> makes the problem more frequent go ahead and increase it.)
>
> netstat -Q  will be useful

Reshad included that in his e-mail to me:

> On the system with the bug 'netstat -Q' seems to have queue drops for 
> epair.
> ```
> # netstat -Q
> Configuration:
> Setting Current Limit
> Thread count 1 1
> Default queue limit 256 10240
> Dispatch policy direct n/a
> Threads bound to CPUs disabled n/a
> 
> Protocols:
> Name Proto QLimit Policy Dispatch Flags
> ip 1 256 flow default ---
> igmp 2 256 source default ---
> rtsock 3 256 source default ---
> arp 4 256 source default ---
> ether 5 256 source direct ---
> ip6 6 256 flow default ---
> epair 8 2100 cpu default CD-
> 
> Workstreams:
> WSID CPU Name Len WMark Disp'd HDisp'd QDrops Queued Handled
> 0 0 ip 0 30 11150458 0 0 13092275 24242558
> 0 0 igmp 0 0 0 0 0 0 0
> 0 0 rtsock 0 1 0 0 0 42 42
> 0 0 arp 0 0 56380919 0 0 0 56380919
> 0 0 ether 0 0 108761357 0 0 0 108761357
> 0 0 ip6 0 10 34999359 0 0 4091259 39090613
> 0 0 epair 0 2100 0 0 210972 303785724 303785724
> ```
> 
> I also noticed that the values for 'epair' in the 'Workstreams' 
> section including drops do not change, while all others increase after 
> some time.

I think I’ve triggered this problem by setting 
net.link.epair.netisr_maxqlen to an absurdly low value (2 in my case).
It looks like there’s an issue with the handling over an overflow of 
the “hardware” queue, but I don’t really understand that code.

Regards,
Kristof