any place to look at for PCI-express performance issues ?
jilles at stack.nl
Sat Jun 11 23:02:50 UTC 2011
On Sat, Jun 11, 2011 at 02:41:50AM +0200, Luigi Rizzo wrote:
> just for the records: the AMD motherboard works fine and can reach
> 14.88Mpps, i was just doing a couple of mistakes in my AMD tests,
> including the use of a slot with 16x form factor but only 4 lanes
> This said, the i7-870 is about twice as fast as the Athlon II X4-635
> in generating packets for the same clock speed.
> I think the different cache size might have some impact on the
> result given the Athlon has no L3 cache and the test program surely
> overflows the 512k L2 cache (i am using a total of 8k packet buffers,
> touching 64 bytes each for the payload, plus 24 bytes each for
> Unfortunately at these speeds even small things matter a lot!
It may help to use non-temporal stores to fill the packet buffers.
Because this data will never be read again by the CPU, caching it is
useless. Also, non-temporal stores may help avoid reading a cache line
only to overwrite it completely.
With SSE, this could be done with a loop of four MOVUPS and four MOVNTPS
instructions, transferring 64 bytes per iteration, and an SFENCE at the
end (or the corresponding intrinsics from <xmmintrin.h>, _mm_loadu_ps(),
For the receive side, there are also various non-temporal loads and
On the other hand, because generating small packets only writes to 64
bytes of each 2048 byte aligned block, only a small portion of the cache
will be polluted. This is because caches are usually not fully
associative. This small portion could contain other important data,
however. When generating full 1500 byte packets, most of the cache will
Because caching is not useful for the ring buffers, it is probably not a
problem that they are laid out in such a way that they cannot be cached
More information about the freebsd-current