dummynet, em driver, device polling issues :-((

Wed Oct 5 05:51:57 PDT 2005

On Oct 5, 2005, at 7:21 AM, Ferdinand Goldmann wrote:
>
>> In one case, we had a system acting as a router. It was a Dell  
>> PowerEdge 2650, with two dual "server" adapters. each were on  
>> separate PCI busses. 3 were "lan" links, and one was a "wan" link.  
>> The lan links were receiving about 300mbps each, all going out the  
>> "wan" link at near 900mbps at peak. We were never able to get  
>> above 944mbps, but I never cared enough to figure out where the  
>> bottleneck was there.
>>
>
> 944mbps is a very good value, anyway. What we see in our setup are  
> throuput rates around 300mbps or below. When testing with tcpspray,  
> throughput hardly exceeded 13MB/s.
>
> Are you running vlans on your interface? Our em0-card connects  
> several sites together, which are all sitting on separate vlan  
> interfaces for which the em0 acts as parent interface.
>

Two of the interfaces had vlans, two didn't.

>
>> This was with PCI-X, and a pretty stripped config on the server side.
>>
>
> Maybe this makes a difference, too. We only have a quite old  
> xSeries 330 with PCI and a 1.2GHz CPU.
>

I think that's a really important key. If you're running a "normal"  
32 bit 33MHz PCI bus, the math just doesn't work for high speeds. The  
entire bandwidth of the bus is just a tad over 1gbps. Assuming 100%  
efficiency (you receive a packet, then turn around and resend it  
immediately) you'll only be able to reach 500mbps. When you add in  
the overhead of each PCI transaction, the fact that the CPU can't  
instantly turn around and send data out the same cycle that the last  
packet was finished being received, and other inefficiencies you will  
probably only see something in the 250-300mbps range at MOST, if that.

I believe the xSeries 330 uses 64 bit 33MHz slots though. That gives  
you double the bandwidth to play with. But, I'm still not convinced  
that the CPU isn't the bottleneck there. If you know you're running  
64/33 in the slot you have the card in, I'd be willing to say you  
could do 500mbps or so at peak. A bunch of IPFW rules, the CPU just  
not being able to keep up, other activity on the system, or a complex  
routing table will reduce that.

Just to sum up:

A 64/33MHz bus has the theoretical speed of 2gbps. If you're  
forwarding packets in one interface then out another, you have to cut  
that in half. PCI is half duplex, you can't receive and send at the  
same time. This leaves 1gbps left. PCI itself isn't 100% efficient.  
You burn cycles setting up each PCI transaction. When the card  
busmasters to dump the packet into RAM, it frequently will have to  
wait for the memory controller to proceed. The ethernet card itself  
requires some PCI bandwidth to operate - the kernel needs to check  
its registers, the card has to update pointers in ram for the  
busmaster circular buffer, etc. All those things take time on the PCI  
bus, leaving maybe 750-800mbps left for actual data.

The rest of the system isn't 100% efficient either. The CPU/kernel/ 
etc can't immediately turn around a packet to send out the instant  
it's received, further lowering your overall bandwidth limit.

I've done a lot of work on custom ethernet interfaces both in FreeBSD  
and in custom embedded OS projects. The safe bet is to assume that  
you can route/forward 250mbps on 32/33 and 500mbps on 64/33 if you  
have enough CPU efficiency to fill the bus.

>
>> Nothing fancy on polling, i think we set HZ to 10000
>>
>
> Ten-thousand? Or is this a typo, and did you mean thousand?
>
> This is weird. :-( Please, is there any good documentation on  
> tuning device polling? The man page does not give any useful  
> pointers about values to use for Gbit cards. I have already read  
> things about people using 2000, 4000HZ ... Gaaah!
>
> I tried with 1000 and 2000 so far, without good results. It seems  
> like everybody makes wild assumptions on what values to use for  
> polling.
>

We arrived at 10000 by experimentation. A large number of interfaces,  
a ton of traffic... I'm not sure the complete reasons why it helped,  
but it did.

>
>> , turned on idle_poll, and set user_frac to 10 because we had some  
>> cpu hungry tasks that were not a high priority.
>>
>
> I think I red somewhere about problems with idle_poll. How high is  
> your burst_max value? Are you seeing a lot of ierrs?
>

No ierrs at all, and we never touched burst_max.

In the end, if you're getting "Receive No Buffers" incrementing, that  
basically means what it implies. The ethernet chip received a packet  
and was out of room to store it because the CPU hadn't dumped the  
receive buffers from previous packets yet. Either the CPU is too busy  
and can't keep up, the PCI bus is being saturated and the ethernet  
chip can't move packets out of its tiny internal memory fast enough,  
or there is some polling problem that's being hit here.

If I had to bet, what I think is happening is that you've got a  
bottleneck somewhere (ipfw rules, not enough CPU, too much PCI  
activity, or you're in a 32/33 PCI slot) and it can't keep up with  
what you're doing. Turning polling on is exposing different symptoms  
of this than having polling off. Polling may be increasing your  
overall speed enough that instead of having packets getting backed up  
in the kernel and stopping you from going higher than XXmbps, with  
polling you're getting to XXXmbps and seeing a new symptom of a  
bottleneck.

> Forgot to ask - do you have fastforwarding enabled in your sysctl?

No. But we were running either 2.8 or 3.2GHz P4 Xeons, so we had the  
CPU to burn.