Performance Intel Pro 1000 MT (PWLA8490MT)

Wed Apr 20 07:53:50 PDT 2005

On Wed, Apr 20, 2005 at 01:19:44PM +1000, Bruce Evans wrote:
> On Tue, 19 Apr 2005, Bosko Milekic wrote:
> 
> > My experience with 6.0-CURRENT has been that I am able to push at
> > least about 400kpps INTO THE KERNEL from a gigE em card on its own
> > 64-bit PCI-X 133MHz bus (i.e., the bus is uncontested) and that's
> 
> A 64-bit bus doesn't seem to be essential for reasonable performance.
> 
> I get about 210 kpps (receive) for a bge card on an old Athlon system
> with a 32-bit PCI 33MHz bus.  Overclocking this bus speeds up at least
> sending almost proportionally to the overclocking :-).  This is with
> my version of an old version of -current, with no mpsafenet, no driver
> tuning, and no mistuning (no INVARIANTS, etc., no POLLING, no HZ > 100).
> Sending goes slightly slower (about 200 kppps).

  That is still half as much as I get on the faster bus.  Unfortunately,
  we are not comparing apples with apples, but apples with oranges,
  since I don't know what "my version of an old version of -current"
  refers to. :-)

> I get about 220 kpps (send) for a much-maligned (last year) sk non-card
> on a much-maligned Athlon nForce2 newer Athlon system with a 32-bit
> PCI 33MHz bus.  This is with a similar setup but with sending in the
> driver changed to not use the braindamaged sk interrupt moderation.
> The changes don't improve the throughput significantly since it is
> limited by the sk or bus to 4 us per packet, but they reduce interrupt
> overhead.
> 
> > basically out of the box GENERIC on a dual-CPU box with HTT disabled
> > and no debugging options, with small 50-60 byte UDP packets.
> 
> I used an old version of ttcp for testing.  A small packet for me is
> 5 bytes UDP data since that is the minimum that ttcp will send, but
> I repeated the tests with a packet size of 50 for comparison.  For
> the sk, the throughput with a packet size of 5 is only slightly larger
> (240 kpps).
>
> There are some kernel deficiencies which at best break testing using
> simple programs like ttcp and at worst reduce throughput:
> - when the tx queue fills up, the application should stop sending, at
>   least in the udp case, but there is no way for userland to tell
>   when the queue becomes non-full so that it is useful to try to add
>   to it -- select() doesn't work for this.  Applications either have
>   to waste cycles by retrying immediately or waste send slots by
>   retrying after a short sleep.
> 
>   The old version of ttcp that I use uses the latter method, with a
>   sleep interval of 1000 usec.  This works poorly, especially with HZ
>   = 100 (which gives an actual sleep interval of 10000 to 20000 usec),
>   or with devices that have a smaller tx queue than sk (511).  The tx
>   queue always fills up when blasted with packets; it becomes non-full
>   a few usec later after a tx interrupt, and it becomes empty a few
>   usec or msec later, and then the transmitter is idle while ttcp
>   sleeps.  With sk and HZ = 100, throughput is reduced to approximately
>   511 * (1000000 / 15000) = 34066 pps.  HZ = 1000 is just large enough
>   for the sleep to always be shorter than the tx draining time (2/HZ
>   seconds = 2 msec < 4 * 511 usec = 2.044 msec), so transmission can
>   stream.
> 
>   Newer versions of ttcp like the on in ports are aware of this problem
>   but can't fix it since it is in the kernel.  tools/netrate is less
>   explicitly aware of this problem and can't fix it...  However, if
>   you don't care about using the sender for anything else and don't
>   want to measure efficiency of sending, then retrying immediately can
>   be used to generate almost the maximum pps.  Parts of netrate do this.
> 
> - the tx queue length is too small for all drivers, so the tx queue fills
>   up too often.  It defaults to IFQ_MAXLEN = 50.  This may be right for
>   1 Mbps ethernet or even for 10 Mbps ethernet, but it is too small for
>   100 Mbps ethernet and far too small for 1000 Mbps ethernet.  Drivers
>   with a larger hardware tx queue length all bump it up to their tx
>   queue length (often, bogusly, less 1), but it needs to be larger for
>   transmission to stream.  I use (SK_TX_RING_CNT + imax(2*tick, 10000) / 4)
>   for sk.

  Yes, I think em bumps it up.  FWIW, I use ng_source(4) with a custom
  packet crafting tool to craft up arbitrary packets and sequences
  thereof and feed them into the ng_source node.  The ng_source node
  blasts out as most as the tx queue can handle at every clock tick.
  It runs out-of-kernel and is very fast.

> > My tests were done without polling so with very high interrupt load
> > and that also sucks when you have a high-traffic scenario.
> 
> Interrupt load isn't necessarily very high, relevant or reduced by
> polling.  For transmission, with non-broken hardware and software,
> there should be not many more than (pps / <size of hardware tx queue>)
> tx interrupts per second, and <size of hardware tx queue> should be
> small so that there aren't many txintrs/sec.  For sk, this gives 240000
> / 511 = 489.  After reprogramming sk's interrupt handling, I get 539.
> The standard driver used to get 7000+ with the old interrupt moderation
> timeout of 200 usec (actually 137 usec for Yukon, 200 for Genesis),
> and now 14000+ with an an interrupt moderation timeout of 200 (68.5)
> usec.  The interrupt load for 539 txintrs/sec and 240 kpps is 10% on an
> AthlonXP2600 (Barton) overclocked.  Very little of this is related to
> interrupts, so the term "interrupt load" is misleading.  About 480
> packets are handled for every tx interrupt (512 less 32 for watermark
> stuff).  Much more than 90% of the handling is useful work and would
> have to be done somewhere; it just happens to be done in the interrupt
> handler, and that is the best place to do it.  With polling, it would
> take longer to do it and the load is poorly reported so it is hard to see.
> The system load for 539 txintrs/sec and 240 kpps is much larger.  It
> is about 45% (up from 25% in RELENG_4 :-().

   Mainly, I was talking about receiver performance and how interrupt
   load is obviously reduced by polling.  That is a fact of polling.
   Whether this results in better performance or not is the subject of
   various papers, notably Luigi's original paper on his original
   4.x-based polling implementation.  It turns out that in many cases
   the polling model is better, but not because it can or cannot pull
   more packets out of the card (that is not as relevant), but because
   it allows for other things to happen and mitigates the live-lock
   scenario.  You note this very fact in your reply below.

> [Context almost lost to top posting.]
> 
> >>>>On 4/19/2005 1:32 PM, Eivind Hestnes wrote:
> >>>>
> >>>>>I have an Intel Pro 1000 MT (PWLA8490MT) NIC (em(4) driver 1.7.35)
> >>>>>installed
> >>>>>in a Pentium III 500 Mhz with 512 MB RAM (100 Mhz) running FreeBSD
> >>>>>5.4-RC3.
> >>>>>The machine is routing traffic between multiple VLANs. Recently I did a
> >>>>>benchmark with/without device polling enabled. Without device
> >>>>>polling I was
> >>>>>able to transfer roughly 180 Mbit/s. The router however was
> >>>>>suffering when
> >>>>>doing this benchmark. Interrupt load was peaking 100% - overall the
> >>>>>system
> >>>>>itself was quite unusable (_very_ high system load).
> 
> I think it is CPU-bound.  My Athlon2600 (overclocked) is many times
> faster than your P3/500 (5-10 times?), but it doesn't have much CPU
> left over (sending 240000 5-byte udp packets per second from sk takes
> 60% of the CPU, and sending 53000 1500-byte udp packets per second
> takes 30% of the CPU; sending tcp packets takes less CPU but goes
> slower).  Apparently 2 or 3 P3/500's worth of CPU is needed just to
> keep up with the transmitter (with 100% of the CPU used but no
> transmission slots missed).  RELENG_4 has lower overheads so it might
> need only 1 or 2 P3/500's worth of CPU to keep up.
> 
> >>>>>With device
> >>>>>polling
> >>>>>enabled the interrupt kept stable around 40-50% and max transfer
> >>>>>rate was
> >>>>>nearly 70 Mbit/s. Not very scientific tests, but it gave me a pin
> >>>>>point.
> 
> I don't believe in device polling.  It's not surprising that it reduces
> throughput for a device that has large enough hardware queues.  It just
> lets a machine that is too slow to handle 1Gbps ethernet (at least under
> FreeBSD) sort of work by not using the hardware to its full potentially.
> 70 Mbit/s is still bad -- it's easy to get more than that with a 100Mbps
> NIC.
> 
> >>>>>eivind at core-gw:~$ sysctl -a | grep kern.polling
> >>>>>...
> >>>>>kern.polling.idle_poll: 0
> 
> Setting this should increase throughput when the system is idle by taking
> 100% of the CPU then.  With just polling every 1 msec (from HZ = 1000),
> there are the same problems as with ttcp retrying every 10-20 msec, but
> scaled down by a factor of 10-20.  For my ttcp example, the transmitter
> runs dry every 2.044 msec so the polling interval must be shorter than
> 2.044 msec, but this is with a full hardare tx queue (511 entries) on
> a not very fast NIC.  If the hardware is just twice as fast or the tx
> queue is just half as large of half as full, then the hardware tx queue
> it will run dry when polled every 1 msec and hardware capability will be
> wasted.  This problem can be reduced by increasing HZ some more, but I
> don't believe in increasing it beyond 100, since only software that
> does too much polling would noticed it being larger.
> 
> Bruce

  This last point brings up a whole flurry of thoughts, albeit seemingly
  unrelated:  Have you
  thought about routing all network device interrupts for a particular
  network device from the IO APIC to the _same_ Local APIC, always?  I
  don't see an advantage in round-robining them.  Do you? 

--
Bosko Milekic
bmilekic at technokratis.com
bmilekic at FreeBSD.org