kern/176446: [netinet] [patch] Concurrency in ixgbe driving out-of-order packet process and spurious RST

Thu Mar 14 15:00:01 UTC 2013

The following reply was made to PR kern/176446; it has been noted by GNATS.

From: John Baldwin <jhb at freebsd.org>
To: "Charbon, Julien" <jcharbon at verisign.com>
Cc: bug-followup at freebsd.org,
 "De La Gueronniere, Marc" <mdelagueronniere at verisign.com>,
 jfv at freebsd.org
Subject: Re: kern/176446: [netinet] [patch] Concurrency in ixgbe driving out-of-order packet process and spurious RST
Date: Thu, 14 Mar 2013 09:34:18 -0400

 On Thursday, March 07, 2013 5:11:25 am Charbon, Julien wrote:
 > On 2/28/13 8:10 PM, Charbon, Julien wrote:
 > > On 2/28/13 4:57 PM, John Baldwin wrote:
 > >> Can you try the fixes from 
 http://svnweb.freebsd.org/base?view=revision&revision=240968?
 > >
 > >    Actually, Marc (I CC'ed him) did find the r240968 fix for concurrency
 > > between ixgbe_msix_que() and ixgbe_handle_que(), and made a backport for
 > > release-8.3.0 (see patch [1] below).  However, the issue was still
 > > reproducible, then Marc found another place for concurrency from
 > > ixgbe_local_timer() and fix it (see patch [2]).  But it was still not
 > > enough, and he found a last place for concurrency due to
 > > ixgbe_rearm_queues() call (see patch [3]).  We all these patches
 > > applied, we were not able to reproduce this issue.
 > 
 >   Just for the record:  As expected this issue is reproducible on 
 > 9.1-RELEASE:
 > 
 > # uname -a
 > FreeBSD atlas 9.1-RELEASE FreeBSD 9.1-RELEASE #1 r247851M: Wed Mar  6 
 > 11:17:43 UTC 2013 
 > jcharbon at atlas:/usr/obj/app/jcharbon/9.1.0/sys/GENERIC  amd64
 > 
 >   Enable TCP debug log:
 > 
 > # sysctl net.inet.tcp.log_debug=1
 > 
 >   Load enough a TCP service and due to ixgbe race conditions between 
 > ixgbe_msix_que() and ixgbe_handle_que(), you will get:
 > 
 > Mar  7 10:01:04 atlas kernel: TCP: [192.168.100.21]:12918 to 
 > [192.168.100.152]:8080; syncache_socket: in_pcbconnect failed with error 48
 > Mar  7 10:01:04 atlas kernel: TCP: [192.168.100.21]:12918 to 
 > [192.168.100.152]:8080 tcpflags 0x10<ACK>; tcp_input: Listen socket: 
 > Socket allocation failed due to limits or memory shortage, sending RST
 > Mar  7 10:01:04 atlas kernel: TCP: [192.168.100.21]:12918 to 
 > [192.168.100.152]:8080 tcpflags 0x4<RST>; syncache_chkrst: Spurious RST 
 > without matching syncache entry (possibly syncookie only), segment ignored
 > 
 >   We will provide our current fix patch for 9.1-RELEASE.

 The place you noticed in 2) is broken, though your fix isn't quite correct.  
 I've been hesitant to reply yet as it requires a long reply.  The short 
 version is that the task to handle rx/tx processing should never be queued by 
 anything other than an interrupt handler or itself (when it reschedules 
 itself).  Anything else that schedules it is going to result in lock 
 contention and out-of-order packet delivery.

 Your 3rd case is also correct.  We should not re-enable interrupts on every 
 timer tick since the rx/tx task might already be running.  Similarly, re-
 enabling all queues anytime one queue processes RX interrupts can trigger an 
 interrupt on another queue while it's rx/tx task is already running.  Both of 
 these are pointless as each queue will rearm itself when the rx/tx task finds 
 no more pending RX packets to process.

 Now, some more details on the 2nd one which is due to watchdog handling which 
 is broken in both igb and ixgbe.  First, some background on how watchdog 
 handling works in nearly all other drivers (and specifically in single-queue 
 drivers):

 First, each device maintains a 'timer' field in the softc which is a count of 
 seconds until the transmit watchdog should expire.  Whenever a packet is 
 queued for transmit in the descriptor ring, it is set to the 'N' seconds (e.g. 
 5).  Whenever the transmit completion interrupt fully drains the descriptor 
 ring such that the ring is idle the timer is set to 0.

 Second, each device runs a periodic stats timer that fires once a second while 
 the interface is "up" (so it is started in the foo_init() routine and stopped 
 in foo_stop()).  Part of this timer's job is to check the transmit watchdog.  
 It uses logic like this to do so:

    if (timer > 0) {
        timer--;
        if (timer == 0) {
            /* watchdog expired */
        }
    }

 The typical implementation for the watchdog expiring is to just reset the chip 
 by doing 'foo_stop()' followed by 'foo_init_locked()'.  However, if you have a 
 NIC whose hardware is known to have a quirk where it can lose interrupts, then 
 a driver can decide to scan the TX ring to see if it makes any progress.  It 
 should do this synchronously from the timer, not by scheduling another task.  
 Also, if you do make progress, then you should reset the watchdog timer if 
 there are still any pending transmits.  In this case I would suggest only 
 setting it to '1' so you check it on the next second.  The psuedo-code for 
 this would look something like:

     if (timer > 0) {
         timer--;
         if (timer == 0) {
             /* Have this return true if it finds any completions */
             if (foo_txintr()) {
                 if (tx ring is not empty)
                     timer = 1;
                 return;
             }
             foo_stop();
             foo_init_locked();
     }

 However, most drivers don't need that sort of complication at all as working 
 hardware shouldn't be regularly failing to schedule interrupts.

 The one wrinkle that multiple queues throw into this is that it is hard to 
 know when your transmit interrupt routine can clear 'timer' to 0 to disable it 
 as you should only do so if all transmit queues are empty.  One easy solution 
 is to simply make the 'timer' field per transmit queue and have your stats 
 timer check each queue.  The remaining question then is when you do actually
 reset the chip.  I think you should do it as soon as one queue becomes "hung" 
 vs waiting for all of the queues to be hung.

 The issues I see with the igb/ixgbe driver's watchdog handling:

 1) Rather than managing 'timer' as described above, they store the current
    walltime when a packet is transmitted and then try to decide in the stats
    timer if too much time has gone by.  This makes things more complicated as
    you have to deal with 'ticks' rolling over, etc.  Also, igb tries to
    determine this in the tx interrupt handler rather than doing it directly
    in the stats timer.

 2) At least igb(4) tries to set IFF_DRV_OACTIVE when it thinks the chip is in
    the hung state.  This is not what OACTIVE means and shouldn't be there at
    all.  OACTIVE should be set in the if_start() case if the transmit ring is
    full and should be cleared either when the chip is reset or if a transmit
    completion interrupt frees up TX descriptors.

 3) The watchdog handler queues a RX/TX task on every stats timer tick if
    there are any pending TX frames.  a) It should only do this if the
    queue is hung, and b) it should only do TX processing, and c) it
    doesn't need to schedule a task.

 4) At least igb(4) attempts to maintain a separate transmit queue status.
    I find that this actually makes things more complex and harder to
    understand and that is simpler to check the relevant flag instead.

 My patch for igb(4) to fix this watchdog handling is below.  The fixes to 
 ixgbe are probably similar since these drivers share many algorithms:

 Index: if_igb.c
 ===================================================================
 --- if_igb.c	(revision 248162)
 +++ if_igb.c	(working copy)
 @@ -856,7 +856,8 @@ igb_resume(device_t dev)
  			    !drbr_empty(ifp, txr->br))
  				igb_mq_start_locked(ifp, txr);
  #else
 -			if (!IFQ_DRV_IS_EMPTY(&ifp->if_snd))
 +			if (((txr->queue_status & IGB_QUEUE_DEPLETED) == 0) &&
 +			    !IFQ_DRV_IS_EMPTY(&ifp->if_snd))
  				igb_start_locked(txr, ifp);
  #endif
  			IGB_TX_UNLOCK(txr);
 @@ -913,8 +914,10 @@ igb_start_locked(struct tx_ring *txr, struct ifnet
  		if (igb_xmit(txr, &m_head)) {
  			if (m_head != NULL)
  				IFQ_DRV_PREPEND(&ifp->if_snd, m_head);
 -			if (txr->tx_avail <= IGB_MAX_SCATTER)
 +			if (txr->tx_avail <= IGB_MAX_SCATTER) {
  				txr->queue_status |= IGB_QUEUE_DEPLETED;
 +				ifp->if_drv_flags |= IFF_DRV_OACTIVE;
 +			}
  			break;
  		}

 @@ -922,7 +925,7 @@ igb_start_locked(struct tx_ring *txr, struct ifnet
  		ETHER_BPF_MTAP(ifp, m_head);

  		/* Set watchdog on */
 -		txr->watchdog_time = ticks;
 +		txr->watchdog_time = IGB_WATCHDOG;
  		txr->queue_status |= IGB_QUEUE_WORKING;
  	}
  }
 @@ -1018,7 +1021,7 @@ igb_mq_start_locked(struct ifnet *ifp, struct tx_r
  	if (enq > 0) {
  		/* Set the watchdog */
  		txr->queue_status |= IGB_QUEUE_WORKING;
 -		txr->watchdog_time = ticks;
 +		txr->watchdog_time = IGB_WATCHDOG;
  	}
  	if (txr->tx_avail <= IGB_TX_CLEANUP_THRESHOLD)
  		igb_txeof(txr);
 @@ -1393,7 +1396,8 @@ igb_handle_que(void *context, int pending)
  		    !drbr_empty(ifp, txr->br))
  			igb_mq_start_locked(ifp, txr);
  #else
 -		if (!IFQ_DRV_IS_EMPTY(&ifp->if_snd))
 +		if (((txr->queue_status & IGB_QUEUE_DEPLETED) == 0) &&
 +		    !IFQ_DRV_IS_EMPTY(&ifp->if_snd))
  			igb_start_locked(txr, ifp);
  #endif
  		IGB_TX_UNLOCK(txr);
 @@ -1444,7 +1448,8 @@ igb_handle_link_locked(struct adapter *adapter)
  			    !drbr_empty(ifp, txr->br))
  				igb_mq_start_locked(ifp, txr);
  #else
 -			if (!IFQ_DRV_IS_EMPTY(&ifp->if_snd))
 +			if (((txr->queue_status & IGB_QUEUE_DEPLETED) == 0) &&
 +			    !IFQ_DRV_IS_EMPTY(&ifp->if_snd))
  				igb_start_locked(txr, ifp);
  #endif
  			IGB_TX_UNLOCK(txr);
 @@ -1581,7 +1586,8 @@ igb_msix_que(void *arg)
  	    !drbr_empty(ifp, txr->br))
  		igb_mq_start_locked(ifp, txr);
  #else
 -	if (!IFQ_DRV_IS_EMPTY(&ifp->if_snd))
 +	if (((txr->queue_status & IGB_QUEUE_DEPLETED) == 0) &&
 +	    !IFQ_DRV_IS_EMPTY(&ifp->if_snd))
  		igb_start_locked(txr, ifp);
  #endif
  	IGB_TX_UNLOCK(txr);
 @@ -2055,7 +2061,7 @@ retry:
  	tx_buffer = &txr->tx_buffers[first];
  	tx_buffer->next_eop = last;
  	/* Update the watchdog time early and often */
 -	txr->watchdog_time = ticks;
 +	txr->watchdog_time = IGB_WATCHDOG;

  	/*
  	 * Advance the Transmit Descriptor Tail (TDT), this tells the E1000
 @@ -2174,10 +2180,9 @@ igb_local_timer(void *arg)
  {
  	struct adapter		*adapter = arg;
  	device_t		dev = adapter->dev;
 -	struct ifnet		*ifp = adapter->ifp;
  	struct tx_ring		*txr = adapter->tx_rings;
  	struct igb_queue	*que = adapter->queues;
 -	int			hung = 0, busy = 0;
 +	int			hung = 0;

  	IGB_CORE_LOCK_ASSERT(adapter);
 @@ -2185,28 +2190,28 @@ igb_local_timer(void *arg)
  	igb_update_link_status(adapter);
  	igb_update_stats_counters(adapter);

 +	/*
 +	 * Don't check for any TX timeouts if the adapter received
 +	 * pause frames since the last tick or if the link is down.
 +	 */
 +	if (adapter->pause_frames != 0 || adapter->link_active == 0)
 +		goto out;
 +
          /*
          ** Check the TX queues status
 -	**	- central locked handling of OACTIVE
 -	**	- watchdog only if all queues show hung
 +	**	- watchdog if any queue hangs
          */
  	for (int i = 0; i < adapter->num_queues; i++, que++, txr++) {
 -		if ((txr->queue_status & IGB_QUEUE_HUNG) &&
 -		    (adapter->pause_frames == 0))
 -			++hung;
 -		if (txr->queue_status & IGB_QUEUE_DEPLETED)
 -			++busy;
 -		if ((txr->queue_status & IGB_QUEUE_IDLE) == 0)
 -			taskqueue_enqueue(que->tq, &que->que_task);
 +		IGB_TX_LOCK(txr);
 +		if (txr->watchdog_time >= 0)
 +			if (--txr->watchdog_time == 0)
 +				++hung;
 +		IGB_TX_UNLOCK(txr);
  	}
 -	if (hung == adapter->num_queues)
 +	if (hung != 0)
  		goto timeout;
 -	if (busy == adapter->num_queues)
 -		ifp->if_drv_flags |= IFF_DRV_OACTIVE;
 -	else if ((ifp->if_drv_flags & IFF_DRV_OACTIVE) &&
 -	    (busy < adapter->num_queues))
 -		ifp->if_drv_flags &= ~IFF_DRV_OACTIVE;

 +out:
  	adapter->pause_frames = 0;
  	callout_reset(&adapter->timer, hz, igb_local_timer, adapter);
  #ifndef DEVICE_POLLING
 @@ -2349,13 +2354,13 @@ igb_stop(void *arg)
  	callout_stop(&adapter->timer);

  	/* Tell the stack that the interface is no longer active */
 -	ifp->if_drv_flags &= ~IFF_DRV_RUNNING;
 -	ifp->if_drv_flags |= IFF_DRV_OACTIVE;
 +	ifp->if_drv_flags &= ~(IFF_DRV_RUNNING | IFF_DRV_OACTIVE);

  	/* Disarm watchdog timer. */
  	for (int i = 0; i < adapter->num_queues; i++, txr++) {
  		IGB_TX_LOCK(txr);
  		txr->queue_status = IGB_QUEUE_IDLE;
 +		txr->watchdog_time = 0;
  		IGB_TX_UNLOCK(txr);
  	}

 @@ -3566,6 +3571,7 @@ igb_initialize_transmit_units(struct adapter *adap
  		    E1000_READ_REG(hw, E1000_TDBAL(i)),
  		    E1000_READ_REG(hw, E1000_TDLEN(i)));

 +		txr->watchdog_time = 0;
  		txr->queue_status = IGB_QUEUE_IDLE;

  		txdctl |= IGB_TX_PTHRESH;
 @@ -3930,7 +3936,7 @@ igb_txeof(struct tx_ring *txr)
                          	tx_buffer->m_head = NULL;
                  	}
  			tx_buffer->next_eop = -1;
 -			txr->watchdog_time = ticks;
 +			txr->watchdog_time = IGB_WATCHDOG;

  	                if (++first == adapter->num_tx_desc)
  				first = 0;
 @@ -3955,24 +3961,20 @@ igb_txeof(struct tx_ring *txr)

          txr->next_to_clean = first;

 -	/*
 -	** Watchdog calculation, we know there's
 -	** work outstanding or the first return
 -	** would have been taken, so none processed
 -	** for too long indicates a hang.
 -	*/
 -	if ((!processed) && ((ticks - txr->watchdog_time) > IGB_WATCHDOG))
 -		txr->queue_status |= IGB_QUEUE_HUNG;
          /*
           * If we have a minimum free,
           * clear depleted state bit
           */
 -        if (txr->tx_avail >= IGB_QUEUE_THRESHOLD)          
 +        if (txr->tx_avail >= IGB_QUEUE_THRESHOLD) {
                  txr->queue_status &= ~IGB_QUEUE_DEPLETED;
 +#if __FreeBSD_version >= 800000
 +		ifp->if_drv_flags &= ~IFF_DRV_OACTIVE;
 +#endif
 +	}

  	/* All clean, turn off the watchdog */
  	if (txr->tx_avail == adapter->num_tx_desc) {
 -		txr->queue_status = IGB_QUEUE_IDLE;
 +		txr->watchdog_time = 0;
  		return (FALSE);
          }

 -- 
 John Baldwin