Routing benchmarks

Tue Sep 9 15:33:08 UTC 2008

Jacques Fourie wrote:
> On Tue, Sep 9, 2008 at 5:02 PM, Sam Leffler <sam at freebsd.org> wrote:
>   
>> Jacques Fourie wrote:
>>     
>>> On Tue, Sep 9, 2008 at 3:55 PM, Stanislav Sedov <stas at freebsd.org> wrote:
>>>
>>>       
>>>> On Tue, 9 Sep 2008 15:33:30 +0200
>>>> "Jacques Fourie" <jacques.fourie at gmail.com> mentioned:
>>>>
>>>>
>>>>         
>>>>> Hi,
>>>>>
>>>>> I've performed some benchmark tests on my Gumstix Connex 400 (Intel
>>>>> Xscale PXA 255 CPU clocked at 400MHz) with a netDuo expansion board.
>>>>> This board has two smc network interfaces. I configure the gumstix as
>>>>> a router and measure network throughput with netperf running on
>>>>> seperate boxes on either side of the gumstix. My initial tests showed
>>>>> a TCP throughput of 2Mbit/s. After adapting the smc driver to use DMA
>>>>> this figure went up to 7Mbit/s. Although this is a significant
>>>>> improvement, it still seems to be a bit slow. Does anyone have any
>>>>> tips on how I can go about to try and figure out where the bottleneck
>>>>> lies?  Initial profiling showed that a significant amount of time was
>>>>> spent doing memory to memory copies of data, but after the DMA change
>>>>> profiling does not show any obvious culprits.
>>>>>
>>>>>
>>>>>           
>>>> Have you tried checking the speed of the interface itself? Without
>>>> routing involved? May it be the interfaces itself being so slow?
>>>>
>>>> --
>>>> Stanislav Sedov
>>>> ST4096-RIPE
>>>>
>>>>
>>>>         
>>> Running netserver on the gumstix shows a throughput of 2.4Mbit/s. At
>>> the moment I can't get if_bridge to work - will try to figure out what
>>> is going on. A bridging benchmark may be more informative.
>>>
>>>       
>> You said you did profiling but you didn't provide the data to inspect.  It's
>> possible kernel profiling has never been tried on your platform; did you
>> sanity check the results?  (e.g. run a known test load and check results;
>> verify all routines that should execute appear in the profile).  Also if
>> copy overhead shows up as significant look to see why those copies are being
>> done; it's often possible to avoid a copy.
>>
>> My experience in working with architectures like this is that cache handling
>> can be a significant cost that doesn't always show up on a profile.
>>
>> Also you may find useful information by tracking mbufs using the h/w clock
>> at important places along the "fast path" then look at whether the overhead
>> for each step is reasonable.  I did this for bridged traffic by forcing the
>> rx dma to go to an mbuf+cluster then used the free storage in the mbuf
>> header to store timestamps.  At the end of the processing path I sorted the
>> data into buckets by the sample points and added a sysctl to dump the
>> histogram to see min/max/avg.
>>
>>   Sam
>>
>>
>>     
>
> Thanks for the nice idea - will try something similar. At the moment
> I'm also suspecting that cache handling has got a lot to do with the
> performance figures that I'm seeing. The PXA255 has a 32KB data and
> 32KB instruction cache.
>   
I was thinking more of cases where you must flush the d-cache because a 
memory object is treated r/w (e.g. packet data).  bus_dmamap_sync ops 
can do cache flushes and may not be required or may be overly 
expensive.  Also, sometimes you can get away with treating objects as 
read-only and avoid the cache flush.

    Sam