Performance Intel Pro 1000 MT (PWLA8490MT)

Tue Apr 19 20:19:52 PDT 2005

On Tue, 19 Apr 2005, Bosko Milekic wrote:

>  My experience with 6.0-CURRENT has been that I am able to push at
>  least about 400kpps INTO THE KERNEL from a gigE em card on its own
>  64-bit PCI-X 133MHz bus (i.e., the bus is uncontested) and that's

A 64-bit bus doesn't seem to be essential for reasonable performance.

I get about 210 kpps (receive) for a bge card on an old Athlon system
with a 32-bit PCI 33MHz bus.  Overclocking this bus speeds up at least
sending almost proportionally to the overclocking :-).  This is with
my version of an old version of -current, with no mpsafenet, no driver
tuning, and no mistuning (no INVARIANTS, etc., no POLLING, no HZ > 100).
Sending goes slightly slower (about 200 kppps).

I get about 220 kpps (send) for a much-maligned (last year) sk non-card
on a much-maligned Athlon nForce2 newer Athlon system with a 32-bit
PCI 33MHz bus.  This is with a similar setup but with sending in the
driver changed to not use the braindamaged sk interrupt moderation.
The changes don't improve the throughput significantly since it is
limited by the sk or bus to 4 us per packet, but they reduce interrupt
overhead.

>  basically out of the box GENERIC on a dual-CPU box with HTT disabled
>  and no debugging options, with small 50-60 byte UDP packets.

I used an old version of ttcp for testing.  A small packet for me is
5 bytes UDP data since that is the minimum that ttcp will send, but
I repeated the tests with a packet size of 50 for comparison.  For
the sk, the throughput with a packet size of 5 is only slightly larger
(240 kpps).

There are some kernel deficiencies which at best break testing using
simple programs like ttcp and at worst reduce throughput:
- when the tx queue fills up, the application should stop sending, at
   least in the udp case, but there is no way for userland to tell
   when the queue becomes non-full so that it is useful to try to add
   to it -- select() doesn't work for this.  Applications either have
   to waste cycles by retrying immediately or waste send slots by
   retrying after a short sleep.

   The old version of ttcp that I use uses the latter method, with a
   sleep interval of 1000 usec.  This works poorly, especially with HZ
   = 100 (which gives an actual sleep interval of 10000 to 20000 usec),
   or with devices that have a smaller tx queue than sk (511).  The tx
   queue always fills up when blasted with packets; it becomes non-full
   a few usec later after a tx interrupt, and it becomes empty a few
   usec or msec later, and then the transmitter is idle while ttcp
   sleeps.  With sk and HZ = 100, throughput is reduced to approximately
   511 * (1000000 / 15000) = 34066 pps.  HZ = 1000 is just large enough
   for the sleep to always be shorter than the tx draining time (2/HZ
   seconds = 2 msec < 4 * 511 usec = 2.044 msec), so transmission can
   stream.

   Newer versions of ttcp like the on in ports are aware of this problem
   but can't fix it since it is in the kernel.  tools/netrate is less
   explicitly aware of this problem and can't fix it...  However, if
   you don't care about using the sender for anything else and don't
   want to measure efficiency of sending, then retrying immediately can
   be used to generate almost the maximum pps.  Parts of netrate do this.

- the tx queue length is too small for all drivers, so the tx queue fills
   up too often.  It defaults to IFQ_MAXLEN = 50.  This may be right for
   1 Mbps ethernet or even for 10 Mbps ethernet, but it is too small for
   100 Mbps ethernet and far too small for 1000 Mbps ethernet.  Drivers
   with a larger hardware tx queue length all bump it up to their tx
   queue length (often, bogusly, less 1), but it needs to be larger for
   transmission to stream.  I use (SK_TX_RING_CNT + imax(2*tick, 10000) / 4)
   for sk.

>  My tests were done without polling so with very high interrupt load
>  and that also sucks when you have a high-traffic scenario.

Interrupt load isn't necessarily very high, relevant or reduced by
polling.  For transmission, with non-broken hardware and software,
there should be not many more than (pps / <size of hardware tx queue>)
tx interrupts per second, and <size of hardware tx queue> should be
small so that there aren't many txintrs/sec.  For sk, this gives 240000
/ 511 = 489.  After reprogramming sk's interrupt handling, I get 539.
The standard driver used to get 7000+ with the old interrupt moderation
timeout of 200 usec (actually 137 usec for Yukon, 200 for Genesis),
and now 14000+ with an an interrupt moderation timeout of 200 (68.5)
usec.  The interrupt load for 539 txintrs/sec and 240 kpps is 10% on an
AthlonXP2600 (Barton) overclocked.  Very little of this is related to
interrupts, so the term "interrupt load" is misleading.  About 480
packets are handled for every tx interrupt (512 less 32 for watermark
stuff).  Much more than 90% of the handling is useful work and would
have to be done somewhere; it just happens to be done in the interrupt
handler, and that is the best place to do it.  With polling, it would
take longer to do it and the load is poorly reported so it is hard to see.
The system load for 539 txintrs/sec and 240 kpps is much larger.  It
is about 45% (up from 25% in RELENG_4 :-().

[Context almost lost to top posting.]

>>>> On 4/19/2005 1:32 PM, Eivind Hestnes wrote:
>>>>
>>>>> I have an Intel Pro 1000 MT (PWLA8490MT) NIC (em(4) driver 1.7.35)
>>>>> installed
>>>>> in a Pentium III 500 Mhz with 512 MB RAM (100 Mhz) running FreeBSD
>>>>> 5.4-RC3.
>>>>> The machine is routing traffic between multiple VLANs. Recently I did a
>>>>> benchmark with/without device polling enabled. Without device
>>>>> polling I was
>>>>> able to transfer roughly 180 Mbit/s. The router however was
>>>>> suffering when
>>>>> doing this benchmark. Interrupt load was peaking 100% - overall the
>>>>> system
>>>>> itself was quite unusable (_very_ high system load).

I think it is CPU-bound.  My Athlon2600 (overclocked) is many times
faster than your P3/500 (5-10 times?), but it doesn't have much CPU
left over (sending 240000 5-byte udp packets per second from sk takes
60% of the CPU, and sending 53000 1500-byte udp packets per second
takes 30% of the CPU; sending tcp packets takes less CPU but goes
slower).  Apparently 2 or 3 P3/500's worth of CPU is needed just to
keep up with the transmitter (with 100% of the CPU used but no
transmission slots missed).  RELENG_4 has lower overheads so it might
need only 1 or 2 P3/500's worth of CPU to keep up.

>>>>> With device
>>>>> polling
>>>>> enabled the interrupt kept stable around 40-50% and max transfer
>>>>> rate was
>>>>> nearly 70 Mbit/s. Not very scientific tests, but it gave me a pin
>>>>> point.

I don't believe in device polling.  It's not surprising that it reduces
throughput for a device that has large enough hardware queues.  It just
lets a machine that is too slow to handle 1Gbps ethernet (at least under
FreeBSD) sort of work by not using the hardware to its full potentially.
70 Mbit/s is still bad -- it's easy to get more than that with a 100Mbps
NIC.

>>>>> eivind at core-gw:~$ sysctl -a | grep kern.polling
>>>>> ...
>>>>> kern.polling.idle_poll: 0

Setting this should increase throughput when the system is idle by taking
100% of the CPU then.  With just polling every 1 msec (from HZ = 1000),
there are the same problems as with ttcp retrying every 10-20 msec, but
scaled down by a factor of 10-20.  For my ttcp example, the transmitter
runs dry every 2.044 msec so the polling interval must be shorter than
2.044 msec, but this is with a full hardare tx queue (511 entries) on
a not very fast NIC.  If the hardware is just twice as fast or the tx
queue is just half as large of half as full, then the hardware tx queue
it will run dry when polled every 1 msec and hardware capability will be
wasted.  This problem can be reduced by increasing HZ some more, but I
don't believe in increasing it beyond 100, since only software that
does too much polling would noticed it being larger.

Bruce