HPC with ULE vs 4BSD

Fri Oct 10 22:29:07 UTC 2008

On Fri, Oct 10, 2008 at 02:30:42PM -0700, Steve Kargl wrote:
> Yes, this is a long email.
> 
> In working with a colleague to diagnosis poor performance of
> his MPI code, we've discovered that ULE is drastically inferior
> to 4BSD in utilizing a system with 2 physical cpus (opteron) and
> a total of 8 cores.  We have observed this problem with the Open MPI
> implementation of MPI and with the MPICH2 implementation.
> 
> Note, I am using the exact same hardware and FreeBSD-current
> code dated Sep 22, 2008.  The only difference in the kernel
> config file is whether ULE or 4BSD is used.
> 
> Using the following command,
> 
> % time /OpenMPI/mpiexec -machinefile mf -n 8 ./Test_mpi |& tee sgk.log
> 
> we have 
> 
> ULE -->  546.99 real    0.02 user      0.03 sys
> 4BSD ->  218.96 real    0.03 user      0.02 sys
> 
> where the machinefile simply tells Open MPI to launch 8 jobs on the
> local node.  Test_mpi uses MPI's scatter, gather, and all_to_all
> functions to transmit various arrays between the 8 jobs.  To get
> meaningful numbers, a number of iterations are done in a tight loop.
> 
> With ULE, a snapshot of top(1) shows
> 
> last pid: 33765;  load averages:  7.98,  7.51,  5.63   up 10+03:20:30  13:13:56
> 43 processes:  9 running, 34 sleeping
> CPU: 68.6% user,  0.0% nice, 18.9% system,  0.0% interrupt, 12.5% idle
> Mem: 296M Active, 20M Inact, 192M Wired, 1112K Cache, 132M Buf, 31G Free
> Swap: 4096M Total, 4096M Free
> 
>   PID USERNAME    THR PRI NICE   SIZE    RES STATE  C   TIME    CPU COMMAND
> 33743 kargl         1 118    0   300M 22788K CPU7   7   4:48 100.00% Test_mpi
> 33747 kargl         1 118    0   300M 22820K CPU3   3   4:43 100.00% Test_mpi
> 33742 kargl         1 118    0   300M 22692K CPU5   5   4:42 100.00% Test_mpi
> 33744 kargl         1 117    0   300M 22752K CPU6   6   4:29 100.00% Test_mpi
> 33748 kargl         1 117    0   300M 22768K CPU2   2   4:31 96.39% Test_mpi
> 33741 kargl         1 112    0   299M 43628K CPU1   1   4:40 80.08% Test_mpi
> 33745 kargl         1 113    0   300M 44272K RUN    0   4:27 76.17% Test_mpi
> 33746 kargl         1 109    0   300M 22740K RUN    0   4:25 57.86% Test_mpi
> 33749 kargl         1  44    0  8196K  2280K CPU4   4   0:00  0.20% top
> 
> while with 4BSD, a snapshot of top(1) shows
> 
> last pid:  1019;  load averages:  7.24,  3.05,  1.25    up 0+00:04:40  13:27:09
> 43 processes:  9 running, 34 sleeping
> CPU: 45.4% user,  0.0% nice, 54.5% system,  0.1% interrupt,  0.0% idle
> Mem: 329M Active, 33M Inact, 107M Wired, 104K Cache, 14M Buf, 31G Free
> Swap: 4096M Total, 4096M Free
> 
>   PID USERNAME    THR PRI NICE   SIZE    RES STATE  C   TIME    CPU COMMAND
>  1012 kargl         1 126    0   300M 44744K CPU6   6   2:16 99.07% Test_mpi
>  1016 kargl         1 126    0   314M 59256K RUN    4   2:16 99.02% Test_mpi
>  1011 kargl         1 126    0   300M 44652K CPU5   5   2:16 99.02% Test_mpi
>  1013 kargl         1 126    0   300M 44680K CPU2   2   2:16 99.02% Test_mpi
>  1010 kargl         1 126    0   300M 44740K CPU7   7   2:16 99.02% Test_mpi
>  1009 kargl         1 126    0   299M 43884K CPU0   0   2:16 98.97% Test_mpi
>  1014 kargl         1 126    0   300M 44664K CPU1   1   2:16 98.97% Test_mpi
>  1015 kargl         1 126    0   300M 44620K CPU3   3   2:16 98.93% Test_mpi
>   989 kargl         1  96    0  8196K  2460K CPU4   4   0:00  0.10% top
> 
> Notice the interesting, or even perhaps odd, scheduling with ULE that results
> in a 20 second gap between the "fastest" job (4:48) and the "slowest" (4:25). 
> With ULE, 2 Test_mpi jobs are always scheduled on the same core while one
> core remains idle.  Also, note the difference in the reported load averages.
> 
> Various stats are generated by and collected from executing the MPI program
> With ULE, the numbers are
> 
> Procs  Array size   Kb   Iters  Function   Bandwidth(Mbs)   Time(s) 
>   8     800000     3125   100     scatter    12.58386     0.24251367
>   8     800000     3125   100  all_to_all    17.24503     0.17696444
>   8     800000     3125   100      gather    14.82058     0.20591355
>  
>   8    1600000     6250   100     scatter    28.25922     0.21598316
>   8    1600000     6250   100  all_to_all  1985.74915     0.00307366
>   8    1600000     6250   100      gather    30.42038     0.20063902
> 
>   8    2400000     9375   100     scatter    44.65615     0.20501709
>   8    2400000     9375   100  all_to_all    16.09386     0.56886748
>   8    2400000     9375   100      gather    44.38801     0.20625555
>  
>   8    3200000    12500   100     scatter    60.04160     0.20330956
>   8    3200000    12500   100  all_to_all  2157.10010     0.00565900
>   8    3200000    12500   100      gather    59.72242     0.20439614
>  
>   8    4000000    15625   100     scatter    86.65769     0.17608117
>   8    4000000    15625   100  all_to_all  2081.25195     0.00733154
>   8    4000000    15625   100      gather    27.47257     0.55541896
>  
>   8    4800000    18750   100     scatter    33.02306     0.55447768
>   8    4800000    18750   100  all_to_all   200.09908     0.09150740
>   8    4800000    18750   100      gather    91.08742     0.20102168
>  
>   8    5600000    21875   100     scatter   109.82005     0.19452098
>   8    5600000    21875   100  all_to_all    76.87574     0.27788095
>   8    5600000    21875   100      gather    41.67106     0.51264128
>  
>   8    6400000    25000   100     scatter    26.92482     0.90674917
>   8    6400000    25000   100  all_to_all    64.74528     0.37707868
>   8    6400000    25000   100      gather    41.29724     0.59117904
>  
> and with 4BSD, the numbers are
> 
> Procs  Array size   Kb     Iters  Function    Bandwidth(Mbs)  Time(s) 
>   8      800000     3125   100      scatter      21.33697    0.14302677
>   8      800000     3125   100   all_to_all    3941.39624    0.00077428
>   8      800000     3125   100       gather      24.75520    0.12327747
> 
>   8     1600000     6250   100      scatter      45.20134    0.13502954
>   8     1600000     6250   100   all_to_all    1987.94348    0.00307027
>   8     1600000     6250   100       gather      42.02498    0.14523541
> 
>   8     2400000     9375   100      scatter      63.03553    0.14523989
>   8     2400000     9375   100   all_to_all    2015.19580    0.00454312
>   8     2400000     9375   100       gather      66.72807    0.13720272
>  
>   8     3200000    12500   100      scatter      91.90541    0.13282169
>   8     3200000    12500   100   all_to_all    2029.62622    0.00601442
>   8     3200000    12500   100       gather      87.99693    0.13872112
> 
>   8     4000000    15625   100      scatter     107.48991    0.14195556
>   8     4000000    15625   100   all_to_all    1970.66907    0.00774295
>   8     4000000    15625   100       gather     110.70226    0.13783630
> 
>   8     4800000    18750   100      scatter     140.39014    0.13042616
>   8     4800000    18750   100   all_to_all    2401.80054    0.00762367
>   8     4800000    18750   100       gather     134.60948    0.13602717
>  
>   8     5600000    21875   100      scatter     152.31958    0.14024661
>   8     5600000    21875   100   all_to_all    2379.12207    0.00897907
>   8     5600000    21875   100       gather     154.60051    0.13817745
>  
>   8     6400000    25000   100      scatter     190.03561    0.12847099
>   8     6400000    25000   100   all_to_all    2661.36963    0.00917350
>   8     6400000    25000   100       gather     183.08250    0.13335006
>  
> Noting that all communication is over the memory bus, a comparison of
> the Bandwidth columns suggests that ULE is causing the MPI jobs to stall
> waiting for data.  This has potentially serious negative impact on
> clusters used for HPC.

What surprises me is that you didn't CC the individual who wrote ULE:
Jeff Roberson.  :-)  I've CC'd him here.

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |