Heavy I/O blocks FreeBSD box for several seconds

Thu Jul 7 19:42:48 UTC 2011

on 07/07/2011 18:14 Steve Kargl said the following:
> On Thu, Jul 07, 2011 at 10:27:53AM +0300, Andriy Gapon wrote:
>> on 06/07/2011 21:11 Nathan Whitehorn said the following:
>>> On 07/06/11 13:00, Steve Kargl wrote:
>>>> AFAICT, it is a cpu affinity issue.  If I launch n+1 MPI images
>>>> on a system with n cpus/cores, then 2 (and sometimes 3) images
>>>> are stuck on a cpu and those 2 (or 3) images ping-pong on that
>>>> cpu.  I recall trying to use renice(8) to force some load
>>>> balancing, but vaguely remember that it did not help.
>>>
>>> I've seen exactly this problem with multi-threaded math libraries, as well.
>>
>> Exactly the same?  Let's see.
>>
>>> Using parallel GotoBLAS on FreeBSD gives terrible performance because the
>>> threads keep migrating between CPUs, causing frequent cache misses.
[*]-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

>> So Steve reports that if he has Nthr > Ncpu, then some threads are "over-glued"
>> to a particular CPU, which results in sub-optimal scheduling for those threads.
>>  I have to guess that Steve would want to see the threads being shuffled between
>> CPUs to produce more even CPU load.
> 
> I'm using OpenMPI.  These are N > Ncpu processes not threads,

I used 'thread' in a sense of a kernel thread.  It shouldn't actually matter if
it's a process or a thread in userland in this context.

> and without
> the loss of generality let N = Ncpu + 1.  It is a classic master-slave
> situation where 1 process initializes all others.  The n-1 slave processes
> are then independent of each other.  After 20 minutes or so of number
> crunching, each slave sends a few 10s of KB of data to the master.  The
> master collects all the data, writes it to disk, and then sends the
> slaves the next set of computations to do.  The computations are nearly 
> identical, so each slave finishes it task in the same amount of time. The
> problem appears to be that 2 slaves are bound to the same cpu and the 
> remaining N - 3 slaves are bound to a specific cpu.  The N - 3 slaves
> finish their task, send data to the master, and then spin (chewing up
> nearly 100% cpu) waiting for the 2 ping-ponging slaves to finishes.
> This causes a stall in the computation.  When a complete computation
> takes days to complete, theses stall become problematic.  So, yes, I 
> want the processes to get a more uniform access to cpus via migration
> to other cpus.  This is what 4BSD appears to do.

I would imagine that periodic rebalancing would take care of this, but probably
the ULE rebalancing algorithm is not perfect.
There was a suggestion on performance@ to try to use a lower value for
kern.sched.steal_thresh, a value of 1 was recommended:
http://article.gmane.org/gmane.os.freebsd.performance/3459

>> On the other hand, you report that your threads keep being shuffled between CPUs
>> (I presume for Nthr == Ncpu case, where Nthr is a count of the number-crunching
>> threads).  And I guess that you want them to stay glued to particular CPUs.
>>
>> So how is this the same problem?  In fact, it sounds like somewhat opposite.
>> The only thing in common is that you both don't like how ULE works.
> 
> Well, it may be similar in that N - 2 threads are bound to N - 2
> cpus, and the remaining 2 threads are ping ponging on the last 

It could be, but Nathan has never said this [*] and I also have never seen this
in my very limited experiments with GotoBLAS.

> remaining cpu.  I suspect that GotoBLAS has a large amount 
> communication between threads, and once again the computations
> stalls waiting of the 2 threads to either finish battling for the
> 1 cpu or perhaps the process uses pthread_yield() in some clever
> way to try to get load balancing.
> 

-- 
Andriy Gapon