cvs commit: src/sys/dev/bge if_bge.c

Sun Dec 24 01:01:10 PST 2006

On Sun, 24 Dec 2006, Scott Long wrote:

>> I try this experiement every few years, and generally don't measure much 
>> improvement.  I'll try it again with 10gbps early next year once back in 
>> the office again.  The more interesting transition is between the link 
>> layer and the network layer, which is high on my list of topics to look 
>> into in the next few weeks.  In particular, reworking the ifqueue handoff. 
>> The tricky bit is balancing latency, overhead, and concurrency...
>> 
>> FYI, there are several sets of patches floating around to modify if_em to 
>> hand off queues of packets to the link layer, etc.  They probably need 
>> updating, of course, since if_em has changed quite a bit in the last year. 
>> In my implementaiton, I add a new input routine that accepts mbuf packet 
>> queues.
>
> Have you tested this with more than just your simple netblast and netperf 
> tests?  Have you measured CPU usage during your tests?  With 10Gb coming, 
> pipelined processing of RX packets is becoming an interesting topic for all 
> OSes from a number of companies.  I understand your feeling about the 
> bottleneck being higher up than at just if_input. We'll see how this holds 
> up.

In my previous test runs, I was generally testing two general scenarios:

(1) Local sink - sinking small and large packet sizes to a single socket at a
     high rate.

(2) Local source - sourcing small and large packet sizes via a single socket
     at a high rate.

(3) IP forwarding - both unidirectional and bidirectional packet streams
     acrossan IP forwarding host with small and large packet sizes.

>From the perspective of optimizing these particular paths, small packet sizes 
best reveal processing overhead up to about the TCP/socket buffer layer on 
modern hardware (DMA, etc).  The uni/bidirectional axis is interesting because 
it helps reveal the impact of the direct dispatch vs. netisr dispatch choice 
for the IP layer with respect to exercising parallelism.  I didn't explicitly 
measure CPU, but as the configurations max out the CPUs in my test bed, 
typically any significant CPU reduction is measurable in an improvement in 
throughput.  For example, I was easily able to measure the CPU reduction in 
switching from using the socket reference to the file descriptor reference in 
sosend() on small packet transmit, which was a relatively minor functional 
change in locking and reference counting.

I have tentative plans to explicitly measuring cycle counts between context 
switches and during dispatches, but have not yet implemented that in the new 
setup.  I expect to have a chance to set up these new test runs and get back 
into experimenting with the dispatch model between the device driver, link 
layer, and network layer sometime in mid-January.  As the test runs are very 
time-consuming, I'd welcome suggestions on the testing before, rather than 
after, I run them. :-)

Robert N M Watson
Computer Laboratory
University of Cambridge