4.7 vs 5.2.1 SMP/UP bridging performance
Gerrit Nagelhout
gnagelhout at sandvine.com
Tue May 4 15:17:41 PDT 2004
>>>>I would like to move to CURRENT for new hardware support, and the
>>>>ability to properly use multi-threading in user-space, but can't do
>>>>this until the performance bottlenecks are solved. I realize that
>>>>5.x is still a work in progress and hasn't been tuned as well as 4.7
>>>>yet, but are there any plans for optimizations in this area? Does
>>>>anyone have any suggestions on what else I can try?
>>>
>>>
>>>Try rwatson's netperf patches:
>>>
>>> http://www.watson.org/~robert/freebsd/netperf/
>>>
>>>There is at least one outstanding panic condition known, but more
>>>testing will be a great help.
>>>
>>>Kris
>>>
>>>P.S. You didn't mention the status of WITNESS, but I'm assuming you
>>>read the docs and disabled it since it's a huge performance killer.
>
>
>>WITNESS and INVARIANTS are turned off for the 5.2.1 release bits.
>>However, the debug.mpsafenet sysctl is also turned off. Turning this
>>on might give a significant performance boost for bridging.
>
>
>>Scott
>
>
> Thanks for all the responses so far. WITNESS is definitely disabled,
> as are the other INVARIANTS. I had a look through the netperf patches,
> but I don't think they will affect bridging very much. They seem be
> directed more towards the socket layer and above.
>
> I still think that one of the bigger bottlenecks is the cost of all
> the mutexes in SMP mode, and some of the new bus_dma and mbuf code that
> was introduced.
>
> With previous platforms I have worked on (vxWorks), we had similar
> issues, and ended up pushing buckets of packets through the data path,
> so each mutex was only taken once for every 10-100 packets.
>
> Also, polling is currently done by only one CPU at a time. If this
> were changed to have multiple threads poll multiple devices at the
> same time, the performance should become much better.
>
> Thanks,
>
> Gerrit
>You are correct about the netperf patches being directed towards the
>socket layer. The IP stack and below was locked for 5.2, but the
>benefits won't be seen unless you turn on debug.mpsafenet. During
>the 5.2 development cycle I believe that benchmarking was done that
>showed that mpsafenet bridging was significantly faster than non-
>mpsafenet, and nearly as fast as 4.x if not a little faster.
>I'd be interest to know more about your comments about polling from
>multiple CPUs. Did you have a thread bound to each CPU, and did
>each thread poll every interface, or only an exclusive subset of the
>interfaces?
>Scott
>I tried enabling debug.mpsafenet, but it didn't make any difference.
>Which parts of the bridging path do you think should be faster with
>that enabled?
>I haven't actually tried implementing polling from multiple CPUs, but
>suggested it because I think it would help performance for certain
>applications (such as bridging). What I would probably do
>(without having given this a great deal of thought) is to:
>1) Have a variable controlling how many threads to use for polling
>2) Either lock an interface to a thread, or have interfaces switch
> between threads depending on their load dynamically.
>One obvious problem with this approach will be mutex contention
>between threads. Even though the source interface would be owned
>by a thread, the destination would likely be owned by a different
>thread. I'm assuming that with the current mutex setup, only one
>thread can receive from or transmit to an interface at a time.
>Before this becomes feasible though, the cost of the mutexes should
>be addressed first (assuming that is the current bottleneck for SMP)
>Gerrit
I ran the following fragment of code to determine the cost of a LOCK &
UNLOCK on both UP and SMP:
#define EM_LOCK(_sc) mtx_lock(&(_sc)->mtx)
#define EM_UNLOCK(_sc) mtx_unlock(&(_sc)->mtx)
unsigned int startTime, endTime, delta;
startTime = rdtsc();
for (i = 0; i < 100; i++)
{
EM_LOCK(adapter);
EM_UNLOCK(adapter);
}
endTime = rdtsc();
delta = endTime - startTime;
printf("delta %u start %u end %u \n", (unsigned int)delta, startTime,
endTime);
On a single hyperthreaded xeon 2.8Ghz, it took ~30 cycles (per LOCK&UNLOCK,
and dividing by 100) under UP, and ~300 cycles for SMP. Assuming 10
locks for every packet(which is conservative), at 500Kpps, this accounts
for:
300 * 10 * 500000 = 1.5 billion cycles (out of 2.8 billion cycles)
Any comments?
Thanks,
Gerrit
More information about the freebsd-current
mailing list