svn commit: r301198 - head/sys/dev/xen/netfront

Wed May 3 05:13:47 UTC 2017

On 06/02/16 04:16, Roger Pau Monné wrote:
> Author: royger
> Date: Thu Jun  2 11:16:35 2016
> New Revision: 301198
> URL: https://svnweb.freebsd.org/changeset/base/301198

I think this commit is responsible for panics I'm seeing in EC2 on T2 family
instances.  Every time a DHCP request is made, we call into xn_ifinit_locked
(not sure why -- something to do with making the interface promiscuous?) and
hit this code

> @@ -1760,7 +1715,7 @@ xn_ifinit_locked(struct netfront_info *n
>  		xn_alloc_rx_buffers(rxq);
>  		rxq->ring.sring->rsp_event = rxq->ring.rsp_cons + 1;
>  		if (RING_HAS_UNCONSUMED_RESPONSES(&rxq->ring))
> -			taskqueue_enqueue(rxq->tq, &rxq->intrtask);
> +			xn_rxeof(rxq);
>  		XN_RX_UNLOCK(rxq);
>  	}

but under high traffic volumes I think a separate thread can already be
running in xn_rxeof, having dropped the RX lock while it passes a packet
up the stack.  This would result in two different threads trying to process
the same set of responses from the ring, with (unsurprisingly) bad results.

I'm not 100% sure that this is what's causing the panic, but it's definitely
happening under high traffic conditions immediately after xn_ifinit_locked is
called, so I think my speculation is well-founded.

There are a few things I don't understand here:
1. Why DHCP requests are resulting in calls into xn_ifinit_locked.
2. Why the calls into xn_ifinit_locked are only happening on T2 instances
and not on any of the other EC2 instances I've tried.
3. Why xn_ifinit_locked is consuming ring responses.
so I'm not sure what the solution is, but hopefully someone who knows this
code better will be able to help...

-- 
Colin Percival
Security Officer Emeritus, FreeBSD | The power to serve
Founder, Tarsnap | www.tarsnap.com | Online backups for the truly paranoid