Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]

Mon Jul 7 09:02:08 UTC 2008

Ingo Flaschberger wrote:
> Dear Paul,
> 
>> I tried all of this :/  still, 256/512 descriptors seem to work the best.
>> Happy to let you log into the machine and fiddle around if you want :)
> 
> yes, but I'm shure I will also not be able to achieve much more pps.
> As it seems that you hit hardware-software-level-barriers, my only idea 
> is to test dragonfly bsd, which seems to have less software overhead.

I tested DragonFly some time ago with an Agilent N2X tester and it
was by far the slowest of the pack.

> I don't think you will be able to route 64byte packets at 1gbit 
> wirespeed (2Mpps) with a current x86 platform.

You have to take inter-frame gap and other overheads too.  That gives
about 1.244Mpps max on a 1GigE interface.

In general the chipsets and buses are able to transfer quite a bit of
data.  On a dual-opteron 848 I was able to sink 2.5Mpps into the machine
with "ifconfig em[01] monitor" without hitting the cpu ceiling.  This
means that the bus and interrupt handling is not where most of the time
is spent.

When I did my profiling the saturation point was the cache miss penalty
for accessing the packet headers.  At saturation point about 50% of the
time was spent waiting for the memory to make its way into the CPU.

> I hoped to reach 1Mpps with the hardware I mentioned some mails before, 
> but 2Mpps is far far away.
> Currently I get 160kpps via pci-32mbit-33mhz-1,2ghz mobile pentium.

This is more or less expected.  PCI32 is not able to sustain high
packet rates.  The bus setup times kill the speed.  For larger packets
the ratio gets much better and some reasonable throughput can be achieved.

> Perhaps you have some better luck at some different hardware systems
> (ppc, mips, ..?) or use freebsd only for routing-table-updates and 
> special network-cards (netfpga) for real routing.

NetFPGA doesn't have enough TCAM space to be useful for real routing
(as in Internet sized routing table).  The trick many embedded networking
CPUs use is cache prefetching that is integrated with the network
controller.  The first 64-128bytes of every packet are transferred
automatically into the L2 cache by the hardware.  This allows relatively
slow CPUs (700 MHz Broadcom BCM1250 in Cisco NPE-G1 or 1.67-GHz Freescale
7448 in NPE-G2) to get more than 1Mpps.  Until something like this is
possible on Intel or AMD x86 CPUs we have a ceiling limited by RAM speed.

-- 
Andre