HPC with ULE vs 4BSD
Jeremy Chadwick
koitsu at FreeBSD.org
Fri Oct 10 22:29:07 UTC 2008
On Fri, Oct 10, 2008 at 02:30:42PM -0700, Steve Kargl wrote:
> Yes, this is a long email.
>
> In working with a colleague to diagnosis poor performance of
> his MPI code, we've discovered that ULE is drastically inferior
> to 4BSD in utilizing a system with 2 physical cpus (opteron) and
> a total of 8 cores. We have observed this problem with the Open MPI
> implementation of MPI and with the MPICH2 implementation.
>
> Note, I am using the exact same hardware and FreeBSD-current
> code dated Sep 22, 2008. The only difference in the kernel
> config file is whether ULE or 4BSD is used.
>
> Using the following command,
>
> % time /OpenMPI/mpiexec -machinefile mf -n 8 ./Test_mpi |& tee sgk.log
>
> we have
>
> ULE --> 546.99 real 0.02 user 0.03 sys
> 4BSD -> 218.96 real 0.03 user 0.02 sys
>
> where the machinefile simply tells Open MPI to launch 8 jobs on the
> local node. Test_mpi uses MPI's scatter, gather, and all_to_all
> functions to transmit various arrays between the 8 jobs. To get
> meaningful numbers, a number of iterations are done in a tight loop.
>
> With ULE, a snapshot of top(1) shows
>
> last pid: 33765; load averages: 7.98, 7.51, 5.63 up 10+03:20:30 13:13:56
> 43 processes: 9 running, 34 sleeping
> CPU: 68.6% user, 0.0% nice, 18.9% system, 0.0% interrupt, 12.5% idle
> Mem: 296M Active, 20M Inact, 192M Wired, 1112K Cache, 132M Buf, 31G Free
> Swap: 4096M Total, 4096M Free
>
> PID USERNAME THR PRI NICE SIZE RES STATE C TIME CPU COMMAND
> 33743 kargl 1 118 0 300M 22788K CPU7 7 4:48 100.00% Test_mpi
> 33747 kargl 1 118 0 300M 22820K CPU3 3 4:43 100.00% Test_mpi
> 33742 kargl 1 118 0 300M 22692K CPU5 5 4:42 100.00% Test_mpi
> 33744 kargl 1 117 0 300M 22752K CPU6 6 4:29 100.00% Test_mpi
> 33748 kargl 1 117 0 300M 22768K CPU2 2 4:31 96.39% Test_mpi
> 33741 kargl 1 112 0 299M 43628K CPU1 1 4:40 80.08% Test_mpi
> 33745 kargl 1 113 0 300M 44272K RUN 0 4:27 76.17% Test_mpi
> 33746 kargl 1 109 0 300M 22740K RUN 0 4:25 57.86% Test_mpi
> 33749 kargl 1 44 0 8196K 2280K CPU4 4 0:00 0.20% top
>
> while with 4BSD, a snapshot of top(1) shows
>
> last pid: 1019; load averages: 7.24, 3.05, 1.25 up 0+00:04:40 13:27:09
> 43 processes: 9 running, 34 sleeping
> CPU: 45.4% user, 0.0% nice, 54.5% system, 0.1% interrupt, 0.0% idle
> Mem: 329M Active, 33M Inact, 107M Wired, 104K Cache, 14M Buf, 31G Free
> Swap: 4096M Total, 4096M Free
>
> PID USERNAME THR PRI NICE SIZE RES STATE C TIME CPU COMMAND
> 1012 kargl 1 126 0 300M 44744K CPU6 6 2:16 99.07% Test_mpi
> 1016 kargl 1 126 0 314M 59256K RUN 4 2:16 99.02% Test_mpi
> 1011 kargl 1 126 0 300M 44652K CPU5 5 2:16 99.02% Test_mpi
> 1013 kargl 1 126 0 300M 44680K CPU2 2 2:16 99.02% Test_mpi
> 1010 kargl 1 126 0 300M 44740K CPU7 7 2:16 99.02% Test_mpi
> 1009 kargl 1 126 0 299M 43884K CPU0 0 2:16 98.97% Test_mpi
> 1014 kargl 1 126 0 300M 44664K CPU1 1 2:16 98.97% Test_mpi
> 1015 kargl 1 126 0 300M 44620K CPU3 3 2:16 98.93% Test_mpi
> 989 kargl 1 96 0 8196K 2460K CPU4 4 0:00 0.10% top
>
> Notice the interesting, or even perhaps odd, scheduling with ULE that results
> in a 20 second gap between the "fastest" job (4:48) and the "slowest" (4:25).
> With ULE, 2 Test_mpi jobs are always scheduled on the same core while one
> core remains idle. Also, note the difference in the reported load averages.
>
> Various stats are generated by and collected from executing the MPI program
> With ULE, the numbers are
>
> Procs Array size Kb Iters Function Bandwidth(Mbs) Time(s)
> 8 800000 3125 100 scatter 12.58386 0.24251367
> 8 800000 3125 100 all_to_all 17.24503 0.17696444
> 8 800000 3125 100 gather 14.82058 0.20591355
>
> 8 1600000 6250 100 scatter 28.25922 0.21598316
> 8 1600000 6250 100 all_to_all 1985.74915 0.00307366
> 8 1600000 6250 100 gather 30.42038 0.20063902
>
> 8 2400000 9375 100 scatter 44.65615 0.20501709
> 8 2400000 9375 100 all_to_all 16.09386 0.56886748
> 8 2400000 9375 100 gather 44.38801 0.20625555
>
> 8 3200000 12500 100 scatter 60.04160 0.20330956
> 8 3200000 12500 100 all_to_all 2157.10010 0.00565900
> 8 3200000 12500 100 gather 59.72242 0.20439614
>
> 8 4000000 15625 100 scatter 86.65769 0.17608117
> 8 4000000 15625 100 all_to_all 2081.25195 0.00733154
> 8 4000000 15625 100 gather 27.47257 0.55541896
>
> 8 4800000 18750 100 scatter 33.02306 0.55447768
> 8 4800000 18750 100 all_to_all 200.09908 0.09150740
> 8 4800000 18750 100 gather 91.08742 0.20102168
>
> 8 5600000 21875 100 scatter 109.82005 0.19452098
> 8 5600000 21875 100 all_to_all 76.87574 0.27788095
> 8 5600000 21875 100 gather 41.67106 0.51264128
>
> 8 6400000 25000 100 scatter 26.92482 0.90674917
> 8 6400000 25000 100 all_to_all 64.74528 0.37707868
> 8 6400000 25000 100 gather 41.29724 0.59117904
>
> and with 4BSD, the numbers are
>
> Procs Array size Kb Iters Function Bandwidth(Mbs) Time(s)
> 8 800000 3125 100 scatter 21.33697 0.14302677
> 8 800000 3125 100 all_to_all 3941.39624 0.00077428
> 8 800000 3125 100 gather 24.75520 0.12327747
>
> 8 1600000 6250 100 scatter 45.20134 0.13502954
> 8 1600000 6250 100 all_to_all 1987.94348 0.00307027
> 8 1600000 6250 100 gather 42.02498 0.14523541
>
> 8 2400000 9375 100 scatter 63.03553 0.14523989
> 8 2400000 9375 100 all_to_all 2015.19580 0.00454312
> 8 2400000 9375 100 gather 66.72807 0.13720272
>
> 8 3200000 12500 100 scatter 91.90541 0.13282169
> 8 3200000 12500 100 all_to_all 2029.62622 0.00601442
> 8 3200000 12500 100 gather 87.99693 0.13872112
>
> 8 4000000 15625 100 scatter 107.48991 0.14195556
> 8 4000000 15625 100 all_to_all 1970.66907 0.00774295
> 8 4000000 15625 100 gather 110.70226 0.13783630
>
> 8 4800000 18750 100 scatter 140.39014 0.13042616
> 8 4800000 18750 100 all_to_all 2401.80054 0.00762367
> 8 4800000 18750 100 gather 134.60948 0.13602717
>
> 8 5600000 21875 100 scatter 152.31958 0.14024661
> 8 5600000 21875 100 all_to_all 2379.12207 0.00897907
> 8 5600000 21875 100 gather 154.60051 0.13817745
>
> 8 6400000 25000 100 scatter 190.03561 0.12847099
> 8 6400000 25000 100 all_to_all 2661.36963 0.00917350
> 8 6400000 25000 100 gather 183.08250 0.13335006
>
> Noting that all communication is over the memory bus, a comparison of
> the Bandwidth columns suggests that ULE is causing the MPI jobs to stall
> waiting for data. This has potentially serious negative impact on
> clusters used for HPC.
What surprises me is that you didn't CC the individual who wrote ULE:
Jeff Roberson. :-) I've CC'd him here.
--
| Jeremy Chadwick jdc at parodius.com |
| Parodius Networking http://www.parodius.com/ |
| UNIX Systems Administrator Mountain View, CA, USA |
| Making life hard for others since 1977. PGP: 4BD6C0CB |
More information about the freebsd-hackers
mailing list