IF_HANDOFF vs. IFQ_HANDOFF

Mon Jun 19 12:28:16 UTC 2006

On Mon, Jun 19, 2006 at 06:04:26PM +1000, Bruce Evans wrote:
 > On Sun, 18 Jun 2006, John-Mark Gurney wrote:
 > 
 > >John Polstra wrote this message on Thu, Jun 15, 2006 at 09:18 -0700:
 > >>in the HW but have not yet completed.  When the completion interrupt
 > >>comes in, the driver is supposed to check the if_snd queue for more
 > >>mbufs and process them.  Only when the transmit side of the HW goes
 > >>totally idle should IFF_OACTIVE be cleared again.  Most of our drivers
 > >>set the flag only when they run out of transmit descriptors (i.e.,
 > >>practically never), which is just plain wrong.
 > >
 > >But the problem is that for small packets, this can mean that there
 > >will be a delay in handling the ring if we wait to process packets
 > >once the tx ring is empty.. if we ever want to max out gige w/ 64byte
 > >packets, we have to clear OACTIVE whenever tx approches running out
 > >of packets before we can send this.. In most cases we don't know how
 > >long that is (since we don't keep track of packet sizes, etc), so it's
 > >easiest/best to clear it whenever the tx ring is not full...
 > 
 > To max out the link without unmaxing CPU for other uses, you do have
 > to know when the tx approaches running out of packets.  This is best
 > done using watermark stuff.  There should be a nearly-complete interrupt
 > at low water, and (only after low water is reached and the interrupt
 > handler doesn't refill the tx ring to be above low water again) a
 > completion interrupt at actual completion.  My version of the sk driver
 > does this.  It arrange for the nearly-complete interrupt at about 32
 > fragments (min 128 uS) before the tx runs dry, and no other tx interrupts
 > unless the queue length stays below 32, while the -current driver gets
 > an interrupt after every packet.  It does this mainly to reduce the
 > tx interrupt load from 1 per packet to (under load) 1 per 480 fragments.
 > The correct handling of OACTIVE is obtained as a side effect almost
 > automatically.  It must be decided when to interrupt (sk hardware
 > allows interrupting or not interrupting after every fragment), and it
 > would be obviously wrong to interrupt only after the last fragment in
 > the ring since the tx might run dry then (even if the tx interrupt
 > occurs when the last fragment is removed by the hardware from the ring
 > but before it is sent, it only takes a few uS to send it so the tx
 > would often run dry due to software latency).
 > 
 > I'm not very familiar with NIC hardware and don't know how other NICs
 > support timing of tx interrupts, but watermark stuff like the above
 > is routine for serial devices/drivers.  sk's support for interrupting
 > on any fragment is too flexible to be good (it is painful to program,
 > and there doesn't seem to be a good way to time out if there is no
 > good fragment to interrupt on or when you program the interruption on
 > a wrong fragment).
 > 
 > Related serial device programming: 8250-16650 UARTs interrupt when the
 > last character is removed from the tx "ring".  This is not programmable,
 > but the delay is long enough at low speeds (87 uS at 115200 bps).  The
 > 16950 UART has a programmable tx interrupt trigger level which defaults
 > to 1 character time.  The delay from this is too short at higher speeds
 > (11 uS at 921600 bps...).  I use 16.  The "tx" ring size of a 16950
 > is 128 characters.  Timing for characters in a UART at 921600 bps is
 > similar to timing for normal packets in 1G bps ethernet (1G/921600 ~=
 > 1K ~= 1500+ normal ethernet packet size), so similar ring sizes and
 > trigger levels are good (smaller ones would be better for smaller
 > packets).  Strangely, at 921600 bps, the tx trigger levels become more
 > critical for maxing out the device than the rx trigger levels, since
 > rx is forced to keep up by the external device (provided that maxes
 > out the connection and it is possible to keep up), while poorly chosen
 > tx trigger levels ensure significant dead time when the tx runs dry.
 > 
 > BTW, I can't see any significant effect (good or bad) from sk's
 > interrupt moderation, at least with tx changed as above.  sk's interrupt
 > moderation is very primitive compared with that of some NICs (it's
 > just a single timer for tx and rx).  Interrupting on every packet gives
 > too many interrupts, and my changes fix this much better than any
 > simple timeout-based moderation could do.  My changes don't help at
 > all for rx, and interrupt moderation doesn't seem to help either.
 > OTOH, fxp's interrupt moderation works well in practice (I don't know
 > how) and em's interrupt moderation works well in theory (I understand
 > its documentation but haven't used any em devices).  em has several
 > independent trigger levels and timeouts, and the problem of using them
 > effectively for rx is one of predicting future traffic.  IIRC, em has
 > sysctls to move this problem to the user.
 > 
 > In the current sk driver, I think keeping IFF_OACTIVE set for longer
 > would work, and you can also keep track of the queue lenghth, because
 > of the excessive interrupts -- you get an interrupt after every packet
 > (modulo interrupt moderation), not just on completion, and the interrupt
 > handler can both keep the h/w queue full while IFF_OACTIVE is set and
 > keep track of the queue length as needed for deciding when to set
 > IFF_OACTIVE.  The CPU usage is thus large no matter whether IFF_ACTIVE
 > is set correctly.  Interrupt moderation complicates things and unmaxes
 > the link.  The interrupt moderation timeout is normally set to 100 uS.
 > This allows significant tx-dry times (the worst case (if IFF_OACTIVE
 > is not set incorrectly) is sending a tnygram in 4uS, idling for ~96
 > uS, ...) but isn't very moderate since sending or receiving a normal
 > packet takes about 15uS.  I think the interrupt moderation timeout for
 > sk is purely periodic, while for better hardware (even 16550 UARTs!)
 > at least rx timeouts only occur after the device (in the relevant
 > direction) has been idle for some time.
 > 

AFAIK SK GENESIS has no programming interface for a watermark.
Some advanced hardware provides a way to interrupt when it reaches
a programmed threshold but SK does not. It just provides a way whether
hardware should raise an interrupt depending on Tx descriptor value.
By tracking number of index it's possible to generate an interrupt
for every N frames instead of every frame(1 <= N <= MAX Tx. Desc.).
We may also need to add a routine to reclaim pending Tx descriptors
before sending frames in sk_start if number of available Tx descriptors
are less then a threshold.
However I don't know how the driver should handle transmit errors
occurred between interrupt-less Tx operations. Just flushing all
committed frames would result in poor TCP performance.
The difference between Yukon and SK hardware also make it hard to
implement above interrupt-less Tx operations. There is no publicly
available documentation for Yukon adapters and Yukon seems to use
completely different registers for FIFO handling and flow control.
This is one of main reason why I couldn't implement polling(4) for
sk(4). It is also known to me Yukon adapters have a bug which loses
Tx completion interrupts under certain conditions.

BTW, as SK adapters have no limit on the number of Tx/Rx descriptors
how about increasing the number of Tx descriptors(i.e. 1024 or 2048)
to reduce the chance of running out of Tx descriptors?
It does not decrease number of interrupts generated but it would help
to push the hardware to the limit without much overhead, I guess.

-- 
Regards,
Pyun YongHyeon