[RFT][patch] Scheduling for HTT and not only

Mon Feb 6 07:59:49 UTC 2012

On 2012/2/6 15:44, Alexander Motin wrote:
> On 06.02.2012 09:40, David Xu wrote:
>> On 2012/2/6 15:04, Alexander Motin wrote:
>>> Hi.
>>>
>>> I've analyzed scheduler behavior and think found the problem with HTT.
>>> SCHED_ULE knows about HTT and when doing load balancing once a second,
>>> it does right things. Unluckily, if some other thread gets in the way,
>>> process can be easily pushed out to another CPU, where it will stay
>>> for another second because of CPU affinity, possibly sharing physical
>>> core with something else without need.
>>>
>>> I've made a patch, reworking SCHED_ULE affinity code, to fix that:
>>> http://people.freebsd.org/~mav/sched.htt.patch
>>>
>>> This patch does three things:
>>> - Disables strict affinity optimization when HTT detected to let more
>>> sophisticated code to take into account load of other logical core(s).
>> Yes, the HTT should first be skipped, looking up in upper layer to find
>> a more idling physical core. At least, if system is a dual-core,
>> 4-thread CPU,
>> and if there are two busy threads, they should be run on different
>> physical cores.
>>
>>> - Adds affinity support to the sched_lowest() function to prefer
>>> specified (last used) CPU (and CPU groups it belongs to) in case of
>>> equal load. Previous code always selected first valid CPU of evens. It
>>> caused threads migration to lower CPUs without need.
>>
>> Even some level of imbalance can be borne, until it exceeds a threshold,
>> this at least does not trash other cpu's cache, pushing a new thread
>> to another cpu trashes its cache. The cpus and groups can be arranged in
>> a circle list, so searching a lowest load cpu always starts from right
>> neighborhood to tail, then circles from head to left neighborhood.
>>
>>> - If current CPU group has no CPU where the process with its priority
>>> can run now, sequentially check parent CPU groups before doing global
>>> search. That should improve affinity for the next cache levels.
>>>
>>> I've made several different benchmarks to test it, and so far results
>>> look promising:
>>> - On Atom D525 (2 physical cores + HTT) I've tested HTTP receive with
>>> fetch and FTP transmit with ftpd. On receive I've got 103MB/s on
>>> interface; on transmit somewhat less -- about 85MB/s. In both cases
>>> scheduler kept interrupt thread and application on different physical
>>> cores. Without patch speed fluctuating about 103-80MB/s on receive and
>>> is about 85MB/s on transmit.
>>> - On the same Atom I've tested TCP speed with iperf and got mostly the
>>> same results:
>>> - receive to Atom with patch -- 755-765Mbit/s, without patch --
>>> 531-765Mbit/s.
>>> - transmit from Atom in both cases 679Mbit/s.
>>> Fluctuating receive behavior in both tests I think can be explained by
>>> some heavy callout handled by the swi4:clock process, called on
>>> receive (seen in top and schedgraph), but not on transmit. May be it
>>> is specifics of the Realtek NIC driver.
>>>
>>> - On the same Atom tested number of 512 byte reads from SSD with dd in
>>> 1 and 32 streams. Found no regressions, but no benefits also as with
>>> one stream there is no congestion and with multiple streams all cores
>>> congested.
>>>
>>> - On Core i7-2600K (4 physical cores + HTT) I've run more then 20
>>> `make buildworld`s with different -j values (1,2,4,6,8,12,16) for both
>>> original and patched kernel. I've found no performance regressions,
>>> while for -j4 I've got 10% improvement:
>>> # ministat -w 65 res4A res4B
>>> x res4A
>>> + res4B
>>> +-----------------------------------------------------------------+
>>> |+ |
>>> |++ x x x|
>>> |A| |______M__A__________| |
>>> +-----------------------------------------------------------------+
>>> N Min Max Median Avg Stddev
>>> x 3 1554.86 1617.43 1571.62 1581.3033 32.389449
>>> + 3 1420.69 1423.1 1421.36 1421.7167 1.2439587
>>> Difference at 95.0% confidence
>>> -159.587 ± 51.9496
>>> -10.0921% ± 3.28524%
>>> (Student's t, pooled s = 22.9197)
>>> , and for -j6 -- 3.6% improvement:
>>> # ministat -w 65 res6A res6B
>>> x res6A
>>> + res6B
>>> +-----------------------------------------------------------------+
>>> | + |
>>> | + + x x x |
>>> ||_M__A___| |__________A____M_____||
>>> +-----------------------------------------------------------------+
>>> N Min Max Median Avg Stddev
>>> x 3 1381.17 1402.94 1400.3 1394.8033 11.880372
>>> + 3 1340.4 1349.34 1341.23 1343.6567 4.9393758
>>> Difference at 95.0% confidence
>>> -51.1467 ± 20.6211
>>> -3.66694% ± 1.47842%
>>> (Student's t, pooled s = 9.09782)
>>>
>>> Who wants to do independent testing to verify my results or do some
>>> more interesting benchmarks? :)
>>>
>>> PS: Sponsored by iXsystems, Inc.
>>>
>> The benchmark is incomplete, a complete benchmark should at lease
>> includes cpu intensive applications.
>> Testing for release world databases and web servers and other importance
>> applications is needed.
>
> I plan to do this, but you may help. ;)
>
Thanks, I need to find time. I have cc'ed hackers@, my first mail seems
forgot to include it. I think designing a SMP scheduler is a dirty work,
many test and refining and still, you may get imperfect result. ;-)

Regards,
David Xu