Scaling and performance issues with FreeBSD 9 (& 10) on 4 socket systems

Fri Jun 14 11:02:25 UTC 2013

On 06/14/13 04:05, David Xu wrote:
> On 2013/06/13 20:01, Remy Nonnenmacher wrote:
>>
>> On 06/13/13 13:32, Mark Felder wrote:
>>> On Wed, 12 Jun 2013 17:58:49 -0500, David O'Brien <obrien at freebsd.org>
>>> wrote:
>>>
>>>> We found FreeBSD 8.4 to perform better than FreeBSD 9.1, and Linux
>>>> considerably better than both on the same machine.
>>>
>>> http://svnweb.freebsd.org/base?view=revision&revision=241246
>>>
>>> The above link is likely why 8.4 is better than 9.1 on the same machine.
>>>
>>>> We've tried various things and haven't been able to explain why FreeBSD
>>>> isn't scaling on the new hardware.  Nor why it performs so much worse
>>>> than FreeBSD on the older "M2" machines.
>>>
>>> The CPUs between those machines are quite different. I'm sure we're
>>> looking at different cache sizes, different behavior for the
>>> hyperthreading, etc. I'm sure others would be greatly interested in you
>>> providing the same benchmark results for a recent snapshot of HEAD as
>>> well.
>>> _______________________________________________
>>> freebsd-performance at freebsd.org mailing list
>>> http://lists.freebsd.org/mailman/listinfo/freebsd-performance
>>> To unsubscribe, send any mail to
>>> "freebsd-performance-unsubscribe at freebsd.org"
>>
>> We had same problem on 4x12 cores (AMD) machines. After investigating
>> using hwpmc, it appears that performance was killed by a scheduler
>> function trying to find "least used cpu" that unfortunately works on
>> contended structures (ie: lots a cores are fighting to get works). A
>> solution was found by using artificially long queue of stuck process
>> (steal_thresh bumped to over 8) and by cpu affinity crafting.
>>
>> Was a year ago and from my memory. I guess you may give a try to see if
>> it helps.
>>
>> Disregard is a scheduler specialist contradicts.
>>
>> Thanks.
>>
>
> AMD's cache is very different than Intel, AFAIK eariler than Bulldozer,
> AMD's L3 is exclusive cache, util Bulldozer, AMD describes the L3 cache
> as a “non-inclusive victim cache”, it is still different than Intel
> which is inclusive.
>
> "- In sched_pickcpu() change general logic of CPU selection. First
> look for idle CPU, sharing last level cache with previously used one,
> skipping SMT CPU groups. If none found, search all CPUs for the least
> loaded
> one, where the thread with its priority can run now. If none found, search
> just for the least loaded CPU."
>
> For exclusive cache, the L3 has second-hand data, not hot data, when a
> thread is migrated, will have negative effect, its hot data is lost.
> I'd prefer to search idle CPU from L2, then L3.
>
>

The problem was not really the excellent job done on cache locality via 
cpu detection. It was more a scaling problem with the number of cores 
that exacerbate a contention when trying to steal works from others 
queues. Basically, what happened (I say happened because I've not 
retested recently), is that you may have 1 core running and 47 others 
fighting in a loop where there is one winner and 46 losers, all of them 
playing with locks, and O(N=48) loops. All in all, you see degraded 
performance with little indication of a cause. This is where hwpmc is a 
wonderfull tool...

Bumping up steal-thresh up changes the pattern. If it works for you, 
then the cause is probably the same.