bge dropping packets issue

Fri Apr 18 01:13:45 UTC 2008

On Thu, 17 Apr 2008, Alexander Sack wrote:

> On Wed, Apr 16, 2008 at 10:53 PM, Bruce Evans <brde at optusnet.com.au> wrote:
>> On Wed, 16 Apr 2008, Alexander Sack wrote:

>[DEVICE_POLLING]
> But why was it added to begin with if standard interrupt driven I/O is
> faster?  (was it the fact that historically hardware didn't do
> interrupt coalescing initially)

See Robert's reply.

>>> However, my point still stands:
>>>
>>> #define TG3_RX_RCB_RING_SIZE(tp) ((tp->tg3_flags2 &
>>> TG3_FLG2_5705_PLUS) ?  512 : 1024)
>>>
>>> Even the Linux driver uses higher number of RX descriptors than
>>> FreeBSD's static 256.  I think minimally making this tunable is a fair
>>> approach.
>>>
>>> If not, no biggie, but I think its worth it.
>>
>>  I use a fixed value of 512 (jkim gave a pointer to old mail containing
>>  a fairly up to date version of my patches for this and more important
>>  things).  This should only make a difference with DEVICE_POLLING.
>
> Then minimally 512 could be used if DEVICE_POLLING is enabled and my
> point still stands.  Though in light of the other statistics you cited
> I understand now that this may not make that big of an impact.

em uses only 256 too (I misread it as using 2048).  Someone reported that
increasing this to 4096 reduced packet loss with polling.

>>  Without DEVICE_POLLING, device interrupts normally occur every 150 usec
>
> Is that the coal ticks value you are referring too?  Sorry this is my
> first time looking at this driver!

Yes, the driver normally configures coal ticks as 150.  This is a good
default.

>>  or even more frequently (too frequently if the average is much lower
>>  than 150 usec), so 512 descriptors is more than enough for 1Gbps ethernet
>>  (the minimum possible inter-descriptor time for tiny packets is about 0.6
>>  usec,
>
> How do you measure this number?

0.6 usec is the theoretical minimum.  I actually measure a minumum of
about 1.5 usec for my hardware (5701 PCI/X on plain PCI) by making
timestamps in bge_rxeof() and bge_txeof().  (1.5 usec is the average
for a ring full of descriptors.)

> I'm assuming when you say "inter-descriptor time" you mean the time it
> takes the card to fill a RX descriptor on receipt of a packet (really
> the firmware latency?).

No, it is part of the Ethernet spec (96 bit times for all speeds of
Ethernet IIRC, so it is much shorter than it was for original Ethernet).
At least my hardware takes significantly longer than this (1.5 - 0.6
usec = 900 nsec!).  It is unclear where the extra time is spent, but
presumably the hardware implements the Ethernet spec and is limited
mainly by the bus speed (if the bus is plain PCI, otherwise DMA speed
might be the limit), so if packets arrived every 0.6 usec then it
would buffer many of them in fast device memory and then be forced to
drop 9 in every 15 on average.

>>  For timeouts instead of device polls, at least on old systems it was
>>  quite common for timeouts at a frequency of HZ not actually being
>>  delivered, even when HZ was only 100, because some timeouts run for
>>  too long (several msec each, possibly combining to > 10 msec occasionally).
>>  Device polls are at a lower level, so they have a better chance of
>>  actually keeping up with HZ.  Now the main source of timeouts that run
>>  for too long is probably mii tick routines.  These won't combine, at
>>  least for MPSAFE drivers, but they will block both interrupts and
>>  device polls for their own device.  So the rx ring size needs to be
>>  large enough to cover max(150 usec or whatever interrupt moderation time,
>>  mii tick time) of latency plus any other latencies due to interrupt
>> handling
>>  or polling of for other devices.  Latencies due to interrupts on other
>>  devices is only certain to be significant if the other devices have higher
>>  or the same priority.
>
> You described what I'm seeing.  Couple this with the fact that the
> driver uses one mtx for everything doesn't help.  I'm pretty sure I'm
> running into RX descriptor starvation despite the fact that
> statistically speaking, 256 descriptors is enough for 1Gbps (I'm
> talking 100MBps the firmware is dropping packets).  The problem gets
> worse if I add some kind of I/O workload on the system (my load
> happens to be a gzip of a large log file in /tmp).

I haven't found the mii tick latency to be a problem in practice, though
I once suspected it.  Oh, I just remembered that this requires working
PREEMPTION so that lower-priority interrupt handlers like ata and sc get
preempted.  PREEMPTION wasn't the default and didn't work very well until
relatively recently.  But I think it works in 7.0.

> I noticed that if I put ANY kind of debugging messages in bge_tick()
> the drop gets much worse (for example just printing out the number of
> dropped packets read from bge_stats_update() when a drop occurs causes
> EVEN more drops to incur and if I had to guess its the printf just
> uses up more cycles which delays the drain of RX chain and causes a
> longer time to recover - this is a constant stream from a traffic
> generator).

Delays while holding the lock will cause problems of course.  Hmm,
bge_tick() is a callout, so it may itself be delayed or preempted.
Delaying it shouldn't matter, and latency from preempting it is
supposed to be handled by priority propagation:

 	callout ithread runs
 	calls bge_tick()
 	acquires device mutex
 	...
 		preempted by unrelated ithread
 		...
 			preempted by bge ithread
 			tries to acquire device mutex; blocks
 			bge ithread priority is propagated to callout ithread
 		preempted by callout ithread
 	... // now it is high priority; should be more careful not to take long
 	releases device mutex; loses its propagated priority
 			preempted by bge ithread
 			acquires device mutex
 			...

>>  Some numbers for [1 Gbps] ethernet:
>>
>>  minimum frame size = 64 bytes =    512 bits
>>  minimum inter-frame gap =           96 bits
>>  minimum total frame time =         608 nsec (may be off by 64)
>>  bge descriptors per tiny frame   = 1 (1 for mbuf)
>>  buffering provided by 256 descriptors = 256 * 608 = 155.648 usec (marginal)
>
> So as I read this, its takes 155 usec to fill up the entre RX chain of
> rx_bd's if its just small packets, correct?

At least that long, depending on bus and DMA speeds.

>>  normal frame size = 1518 bytes = 12144 bits
>>  normal total frame time =        12240 nsec
>>  bge descriptors per normal frame = 2 (1 for mbuf and 1 for mbuf cluster)
>>  buffering provided by 256 descriptors = 256/2 * 12240 = 1556.720 usec
>> (plenty)
>
> Is this based again on your own instrumentation based on the last
> patch?  (just curious, I believe you, I just wanted to know if this
> was an artifact of you doing some tuning research or something else)

This is a theoretical minimum too, but in practice even a PCI bus can
almost keep up with 1Gbps ethernet in 1 direction, so I've measured
average packet rates of > 81 kpps for normal frames (81 kpps = 12345
nsec per packet).  Timestamps made in bge_rxeof() at a rate of only
62.7 kpps (since my em card can't go faster than this) look like this:

%%%
   97 1208479322.632804  13   0   7 1208479322.632804   6
  104 1208479322.632908  11   0   6 1208479322.632908   5
  105 1208479322.633013   9   1   5 1208479322.633014   4
   64 1208479322.633078  10   0   4 1208479322.633078   6
   95 1208479322.633173  11   1   6 1208479322.633174   5
%%%

Here the columns give:
1st: time in usec between bge_rxeof() calls
4th: time in usec taken by this cal
5th: number of descriptors processed by this call
other: raw timestamps and ring indexes

The inter-rxeof time is ~100 usec since rx_coal_ticks is configured to 100.
Thus there are only a few packets per interrupt at the "low" rate of 62.7 kpps.
There are no latency problems in sight in this truncated output.

This output is inconsistent with what I said above -- there is no sign of
the factor of 2 for the mbuf+cluster split.  I now think that that split
only affects output.

> So the million dollar question:  Do you believe that if I disable
> DEVICE_POLLING and use interrupt driven I/O, I could achieve zero
> packet loss over a 1Gbps link?  This is the main issue I need to solve
> (solve means either no its not really achievable without a heavy
> rewrite of the driver OR yes it is with some tuning).  If the answer
> is yes, then I have to understand the impact on the system in general.
> I just want to be sure I'm on a viable path through the BGE maze!

I think you can get close enough if the bus and memory and CPU(s)
permit and you don't need to get too close to the theoretical limits.

Bruce