HPC with ULE vs 4BSD

Fri Oct 10 21:30:54 UTC 2008

Yes, this is a long email.

In working with a colleague to diagnosis poor performance of
his MPI code, we've discovered that ULE is drastically inferior
to 4BSD in utilizing a system with 2 physical cpus (opteron) and
a total of 8 cores.  We have observed this problem with the Open MPI
implementation of MPI and with the MPICH2 implementation.

Note, I am using the exact same hardware and FreeBSD-current
code dated Sep 22, 2008.  The only difference in the kernel
config file is whether ULE or 4BSD is used.

Using the following command,

% time /OpenMPI/mpiexec -machinefile mf -n 8 ./Test_mpi |& tee sgk.log

we have 

ULE -->  546.99 real    0.02 user      0.03 sys
4BSD ->  218.96 real    0.03 user      0.02 sys

where the machinefile simply tells Open MPI to launch 8 jobs on the
local node.  Test_mpi uses MPI's scatter, gather, and all_to_all
functions to transmit various arrays between the 8 jobs.  To get
meaningful numbers, a number of iterations are done in a tight loop.

With ULE, a snapshot of top(1) shows

last pid: 33765;  load averages:  7.98,  7.51,  5.63   up 10+03:20:30  13:13:56
43 processes:  9 running, 34 sleeping
CPU: 68.6% user,  0.0% nice, 18.9% system,  0.0% interrupt, 12.5% idle
Mem: 296M Active, 20M Inact, 192M Wired, 1112K Cache, 132M Buf, 31G Free
Swap: 4096M Total, 4096M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE  C   TIME    CPU COMMAND
33743 kargl         1 118    0   300M 22788K CPU7   7   4:48 100.00% Test_mpi
33747 kargl         1 118    0   300M 22820K CPU3   3   4:43 100.00% Test_mpi
33742 kargl         1 118    0   300M 22692K CPU5   5   4:42 100.00% Test_mpi
33744 kargl         1 117    0   300M 22752K CPU6   6   4:29 100.00% Test_mpi
33748 kargl         1 117    0   300M 22768K CPU2   2   4:31 96.39% Test_mpi
33741 kargl         1 112    0   299M 43628K CPU1   1   4:40 80.08% Test_mpi
33745 kargl         1 113    0   300M 44272K RUN    0   4:27 76.17% Test_mpi
33746 kargl         1 109    0   300M 22740K RUN    0   4:25 57.86% Test_mpi
33749 kargl         1  44    0  8196K  2280K CPU4   4   0:00  0.20% top

while with 4BSD, a snapshot of top(1) shows

last pid:  1019;  load averages:  7.24,  3.05,  1.25    up 0+00:04:40  13:27:09
43 processes:  9 running, 34 sleeping
CPU: 45.4% user,  0.0% nice, 54.5% system,  0.1% interrupt,  0.0% idle
Mem: 329M Active, 33M Inact, 107M Wired, 104K Cache, 14M Buf, 31G Free
Swap: 4096M Total, 4096M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE  C   TIME    CPU COMMAND
 1012 kargl         1 126    0   300M 44744K CPU6   6   2:16 99.07% Test_mpi
 1016 kargl         1 126    0   314M 59256K RUN    4   2:16 99.02% Test_mpi
 1011 kargl         1 126    0   300M 44652K CPU5   5   2:16 99.02% Test_mpi
 1013 kargl         1 126    0   300M 44680K CPU2   2   2:16 99.02% Test_mpi
 1010 kargl         1 126    0   300M 44740K CPU7   7   2:16 99.02% Test_mpi
 1009 kargl         1 126    0   299M 43884K CPU0   0   2:16 98.97% Test_mpi
 1014 kargl         1 126    0   300M 44664K CPU1   1   2:16 98.97% Test_mpi
 1015 kargl         1 126    0   300M 44620K CPU3   3   2:16 98.93% Test_mpi
  989 kargl         1  96    0  8196K  2460K CPU4   4   0:00  0.10% top

Notice the interesting, or even perhaps odd, scheduling with ULE that results
in a 20 second gap between the "fastest" job (4:48) and the "slowest" (4:25). 
With ULE, 2 Test_mpi jobs are always scheduled on the same core while one
core remains idle.  Also, note the difference in the reported load averages.

Various stats are generated by and collected from executing the MPI program
With ULE, the numbers are

Procs  Array size   Kb   Iters  Function   Bandwidth(Mbs)   Time(s) 
  8     800000     3125   100     scatter    12.58386     0.24251367
  8     800000     3125   100  all_to_all    17.24503     0.17696444
  8     800000     3125   100      gather    14.82058     0.20591355

  8    1600000     6250   100     scatter    28.25922     0.21598316
  8    1600000     6250   100  all_to_all  1985.74915     0.00307366
  8    1600000     6250   100      gather    30.42038     0.20063902

  8    2400000     9375   100     scatter    44.65615     0.20501709
  8    2400000     9375   100  all_to_all    16.09386     0.56886748
  8    2400000     9375   100      gather    44.38801     0.20625555

  8    3200000    12500   100     scatter    60.04160     0.20330956
  8    3200000    12500   100  all_to_all  2157.10010     0.00565900
  8    3200000    12500   100      gather    59.72242     0.20439614

  8    4000000    15625   100     scatter    86.65769     0.17608117
  8    4000000    15625   100  all_to_all  2081.25195     0.00733154
  8    4000000    15625   100      gather    27.47257     0.55541896

  8    4800000    18750   100     scatter    33.02306     0.55447768
  8    4800000    18750   100  all_to_all   200.09908     0.09150740
  8    4800000    18750   100      gather    91.08742     0.20102168

  8    5600000    21875   100     scatter   109.82005     0.19452098
  8    5600000    21875   100  all_to_all    76.87574     0.27788095
  8    5600000    21875   100      gather    41.67106     0.51264128

  8    6400000    25000   100     scatter    26.92482     0.90674917
  8    6400000    25000   100  all_to_all    64.74528     0.37707868
  8    6400000    25000   100      gather    41.29724     0.59117904

and with 4BSD, the numbers are

Procs  Array size   Kb     Iters  Function    Bandwidth(Mbs)  Time(s) 
  8      800000     3125   100      scatter      21.33697    0.14302677
  8      800000     3125   100   all_to_all    3941.39624    0.00077428
  8      800000     3125   100       gather      24.75520    0.12327747

  8     1600000     6250   100      scatter      45.20134    0.13502954
  8     1600000     6250   100   all_to_all    1987.94348    0.00307027
  8     1600000     6250   100       gather      42.02498    0.14523541

  8     2400000     9375   100      scatter      63.03553    0.14523989
  8     2400000     9375   100   all_to_all    2015.19580    0.00454312
  8     2400000     9375   100       gather      66.72807    0.13720272

  8     3200000    12500   100      scatter      91.90541    0.13282169
  8     3200000    12500   100   all_to_all    2029.62622    0.00601442
  8     3200000    12500   100       gather      87.99693    0.13872112

  8     4000000    15625   100      scatter     107.48991    0.14195556
  8     4000000    15625   100   all_to_all    1970.66907    0.00774295
  8     4000000    15625   100       gather     110.70226    0.13783630

  8     4800000    18750   100      scatter     140.39014    0.13042616
  8     4800000    18750   100   all_to_all    2401.80054    0.00762367
  8     4800000    18750   100       gather     134.60948    0.13602717

  8     5600000    21875   100      scatter     152.31958    0.14024661
  8     5600000    21875   100   all_to_all    2379.12207    0.00897907
  8     5600000    21875   100       gather     154.60051    0.13817745

  8     6400000    25000   100      scatter     190.03561    0.12847099
  8     6400000    25000   100   all_to_all    2661.36963    0.00917350
  8     6400000    25000   100       gather     183.08250    0.13335006

Noting that all communication is over the memory bus, a comparison of
the Bandwidth columns suggests that ULE is causing the MPI jobs to stall
waiting for data.  This has potentially serious negative impact on
clusters used for HPC.

-- 
Steve