Heavy I/O blocks FreeBSD box for several seconds
Andriy Gapon
avg at FreeBSD.org
Thu Jul 7 19:42:48 UTC 2011
on 07/07/2011 18:14 Steve Kargl said the following:
> On Thu, Jul 07, 2011 at 10:27:53AM +0300, Andriy Gapon wrote:
>> on 06/07/2011 21:11 Nathan Whitehorn said the following:
>>> On 07/06/11 13:00, Steve Kargl wrote:
>>>> AFAICT, it is a cpu affinity issue. If I launch n+1 MPI images
>>>> on a system with n cpus/cores, then 2 (and sometimes 3) images
>>>> are stuck on a cpu and those 2 (or 3) images ping-pong on that
>>>> cpu. I recall trying to use renice(8) to force some load
>>>> balancing, but vaguely remember that it did not help.
>>>
>>> I've seen exactly this problem with multi-threaded math libraries, as well.
>>
>> Exactly the same? Let's see.
>>
>>> Using parallel GotoBLAS on FreeBSD gives terrible performance because the
>>> threads keep migrating between CPUs, causing frequent cache misses.
[*]-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> So Steve reports that if he has Nthr > Ncpu, then some threads are "over-glued"
>> to a particular CPU, which results in sub-optimal scheduling for those threads.
>> I have to guess that Steve would want to see the threads being shuffled between
>> CPUs to produce more even CPU load.
>
> I'm using OpenMPI. These are N > Ncpu processes not threads,
I used 'thread' in a sense of a kernel thread. It shouldn't actually matter if
it's a process or a thread in userland in this context.
> and without
> the loss of generality let N = Ncpu + 1. It is a classic master-slave
> situation where 1 process initializes all others. The n-1 slave processes
> are then independent of each other. After 20 minutes or so of number
> crunching, each slave sends a few 10s of KB of data to the master. The
> master collects all the data, writes it to disk, and then sends the
> slaves the next set of computations to do. The computations are nearly
> identical, so each slave finishes it task in the same amount of time. The
> problem appears to be that 2 slaves are bound to the same cpu and the
> remaining N - 3 slaves are bound to a specific cpu. The N - 3 slaves
> finish their task, send data to the master, and then spin (chewing up
> nearly 100% cpu) waiting for the 2 ping-ponging slaves to finishes.
> This causes a stall in the computation. When a complete computation
> takes days to complete, theses stall become problematic. So, yes, I
> want the processes to get a more uniform access to cpus via migration
> to other cpus. This is what 4BSD appears to do.
I would imagine that periodic rebalancing would take care of this, but probably
the ULE rebalancing algorithm is not perfect.
There was a suggestion on performance@ to try to use a lower value for
kern.sched.steal_thresh, a value of 1 was recommended:
http://article.gmane.org/gmane.os.freebsd.performance/3459
>> On the other hand, you report that your threads keep being shuffled between CPUs
>> (I presume for Nthr == Ncpu case, where Nthr is a count of the number-crunching
>> threads). And I guess that you want them to stay glued to particular CPUs.
>>
>> So how is this the same problem? In fact, it sounds like somewhat opposite.
>> The only thing in common is that you both don't like how ULE works.
>
> Well, it may be similar in that N - 2 threads are bound to N - 2
> cpus, and the remaining 2 threads are ping ponging on the last
It could be, but Nathan has never said this [*] and I also have never seen this
in my very limited experiments with GotoBLAS.
> remaining cpu. I suspect that GotoBLAS has a large amount
> communication between threads, and once again the computations
> stalls waiting of the 2 threads to either finish battling for the
> 1 cpu or perhaps the process uses pthread_yield() in some clever
> way to try to get load balancing.
>
--
Andriy Gapon
More information about the freebsd-current
mailing list