Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]

Mon Jul 7 09:11:39 UTC 2008

Robert Watson wrote:
> 
> On Mon, 7 Jul 2008, Andre Oppermann wrote:
> 
>> Robert Watson wrote:
>>> Experience suggests that forwarding workloads see significant lock 
>>> contention in the routing and transmit queue code.  The former needs 
>>> some kernel hacking to address in order to improve parallelism for 
>>> routing lookups.  The latter is harder to address given the hardware 
>>> you're using: modern 10gbps cards frequently offer multiple transmit 
>>> queues that can be used independently (which our cxgb driver 
>>> supports), but 1gbps cards generally don't.
>>
>> Actually the routing code is not contended.  The workload in router is 
>> mostly serialized without much opportunity for contention.  With many 
>> interfaces and any-to-any traffic patterns it may get some 
>> contention.  The locking overhead per packet is always there and has 
>> some impact though.
> 
> Yes, I don't see any real sources of contention until we reach the 
> output code, which will run in the input if_em taskqueue threads, as the 
> input path generates little or no contention of the packets are not 
> destined for local delivery.  I was a little concerned about mention of 

The interface output was the second largest block after the cache misses
IIRC.  The output part seems to have received only moderate attention
and detailed performance analysis compared to the interface input path.
Most network drivers do a write to the hardware for every packet sent
in addition to other overhead that may be necessary for their transmit
DMA rings.  That adds significant overhead compared to the RX path where
those costs are amortized over a larger number packets.

> degrading performance as firewall complexity grows -- I suspect there's 
> a nice project for someone to do looking at why this is the case.  I was 
> under the impression that, in 7.x and later, we use rwlocks to protect 
> firewall state, and that unless stateful firewall rules are used, these 
> are locked read-only rather than writable...

The overhead of just looking at the packet (twice) in ipfw or other
firewall packets is a huge overhead.  The main loop of ipfw is a very
large block of code.  Unless one implements compilation of firewall to
native machine code there is not much that can be done.  With LLVM we
will see some very interesting opportunity in that area.  Other than
that the ipfw instruction over per rule seems to be quite close to the
optimum.  I'm not saying one shouldn't take a close look with a profiler
to verify this is actually the case.

-- 
Andre