epair failure in production on 11.1-STABLE (r328930) ? weird!

Mon Jul 2 21:11:40 UTC 2018

We’re experiencing a strange issue in production failure with epair (which we’re using to talk vimage to jails).

FreeBSD s5 11.1-STABLE FreeBSD 11.1-STABLE #2 r328930: Tue Feb  6 16:05:59 GMT 2018     root at s5:/usr/obj/usr/src/sys/TRUESPEED  amd64

Looks like epair has suddenly stopped forwarding packets between the pair interfaces. Our server has been up for 82 days and it’s been working fine, but suddenly packets have stopped being forwarded between epairs across the entire system. (We’ve got around 30 epairs on the host).  So, we’ve got a sudden ARP resolution failure which is affecting all services. :(.

Here’s the test. On a working machine this works fine:

	# Create an email and put an IP address on it, so we can generate ARP traffic with PING. 
	root at magnesium:/usr/home/systems # ifconfig epair create
	epair7a
	root at magnesium:/usr/home/systems # ifconfig epair7a up
	root at magnesium:/usr/home/systems # ifconfig epair7b up
	root at magnesium:/usr/home/systems # ifconfig epair7a inet 10.140.0.1/30

	# Generate ARP traffic over the epair… should see arp requests on epair7b.
	root at magnesium:/usr/home/systems # ping 10.140.0.2
	PING 10.140.0.2 (10.140.0.2): 56 data bytes

	# Watch traffic coming in from the epair
	root at magnesium:/usr/home/systems # tcpdump -i epair7b
	10:22:27.446651 ARP, Request who-has 10.140.0.2 tell 10.140.0.1, length 28
	10:22:28.475086 ARP, Request who-has 10.140.0.2 tell 10.140.0.1, length 28
	^C
	2 packets captured
	2 packets received by filter
	0 packets dropped by kernel

Works fine.

However, on the failing machine we don’t get any packets forwarded (any more — remember it’s been working fine for a few months - suddenly stopped working :( ).

	root at s5:/usr/home/systems # ifconfig pair create
	epair19a
	root at s5:/usr/home/systems # ifconfig epair19a up
	root at s5:/usr/home/systems # ifconfig epair7b up
	root at s5:/usr/home/systems # ifconfig epair7a inet 10.140.0.1/30

	root at s5:/usr/home/systems # ping 10.140.0.2
	PING 10.140.0.2 (10.140.0.2): 56 data bytes

	root at s5:/usr/home/systems # tcpdump -ni epair19a
	09:24:20.396384 ARP, Request who-has 10.130.0.2 tell 10.130.0.1, length 28
	09:24:21.404737 ARP, Request who-has 10.130.0.2 tell 10.130.0.1, length 28
	^C 

	root at s5:/usr/home/systems # tcpdump -ni epair19b
	[Tumble weed - no traffic seen]
	^C

Has anyone seen this before? We’re going to reboot and see if that fixes the problem.

The failing kernel in question is:

FreeBSD s5 11.1-STABLE FreeBSD 11.1-STABLE #2 r328930: Tue Feb  6 16:05:59 GMT 2018     root at s5:/usr/obj/usr/src/sys/TRUESPEED  amd64

Break break. We’ve just seen a bug bugzilla report 22710, reporting that epair fails when the queue limit is hit (net.link.epair.netisr_maxqlen). We’ve just introduced a high bandwidth service on this machine and so it’s probably that that’s what’s caused the issue.

We’ve currently got a value of:

	net.link.epair.netisr_maxqlen: 2100

root at s5:/usr/home/systems # netstat -Q
Configuration:
Setting                        Current        Limit
Thread count                         1            1
Default queue limit                256        10240
Dispatch policy                 direct          n/a
Threads bound to CPUs         disabled          n/a

Protocols:
Name   Proto QLimit Policy Dispatch Flags
ip         1    256   flow  default   ---
igmp       2    256 source  default   ---
rtsock     3    256 source  default   ---
arp        4    256 source  default   ---
ether      5    256 source   direct   ---
ip6        6    256   flow  default   ---
epair      8   2100    cpu  default   CD-

Workstreams:
WSID CPU   Name     Len WMark   Disp'd  HDisp'd   QDrops   Queued  Handled
   0   0   ip         0   253 385468689        0        0 49360754 434829441
   0   0   igmp       0     0        0        0        0        0        0
   0   0   rtsock     0     5        0        0        0     1144     1144
   0   0   arp        0     0  5573045        0        0        0  5573045
   0   0   ether      0     0 1125223166        0        0        0 1125223166
   0   0   ip6        0     4       90        0        0  1220274  1220364
   0   0   epair      0  2100        0        0      214 4994675481 4994675481

But we can’t see how much of the queue is currently being used, or what size we need to set it to.

But, why has hitting the queue limit broken it entirely! 

Help!

Cheers,
Joe
— 
Dr Josef Karthauser
Chief Technical Officer
(01225) 300371 / (07703) 596893
www.truespeed.com <http://www.truespeed.com/>
  / theTRUESPEED <http://www.facebook.com/theTRUESPEED> 
  @theTRUESPEED <https://twitter.com/thetruespeed>

This email contains TrueSpeed information, which may be privileged or confidential. It's meant only for the individual(s) or entity named above. If you're not the intended recipient, note that disclosing, copying, distributing or using this information is prohibited. If you've received this email in error, please let me know immediately on the email address above. Thank you.
We monitor our email system, and may record your emails.