KSE and SMP problem in FreeBSD/amd64 5.3BETA3, namely KSE dosen't make use of SMP.

NAKATA Maho chat95 at mac.com
Sat Sep 11 20:08:07 PDT 2004


Dear amd64 freaks, I noticed that there seems to be a bug
in KSE with SMP configuration.

Here, I describe my problem in detail.

the math/atlas port utilize SMP by threading. namely,
if you have 2 processors you can gain the nearly double performance
so KSE is the key technology for SMP. However, for amd64, KSE doesn't
utilize second CPU at all.

My machine is:
Tyan S2885
Opteron 1.6GHz x 2
2G bytes of memory

I confirmed that:
o FreeBSD/amd64 5.2.1-RELEASE with KSE doesn't work at all,
dumps core or memory fault, while without KSE works well but
without performance gain (using libmap.conf, and this is not shown here).

o FreeBSD/amd64 5.3-BEAT3 with KSE works at least, however,
doesn't utilize SMP.

o FreeBSD/i386 5.2.1-RELEASE, and 5.3-BEAT3 works well.

How to repreat:
(it took huge hours to build math/atlas, so I put work dir at)

CVSup your ports tree, please use:
# $FreeBSD: ports/math/atlas/Makefile,v 1.27 2004/09/02 00:25:45 maho Exp $

0a. prepare opteron SMP machine, and install FreeBSD/amd64 5.3-BETA3.
1a. cd /usr/ports/math/atlas
2a. make
3a. wait for long time
4a. cd /usr/ports/math/atlas/work/ATLAS/bin/THREADED 
5a. make xdlutst (it took only seconds)
6a. make xdlutst_pt (it took only seconds)
7a. type ./xdlutst -N 1000 2000 200  (this doesn't utilize SMP and KSE)
NREPS  Major      M      N    lda  NPVTS      TIME     MFLOP     RESID
=====  =====  =====  =====  =====  =====  ========  ========  ========
    0  Col     1000   1000   1000    995     0.301  2210.755 3.821e-02
    0  Col     1200   1200   1200   1194     0.504  2282.569 3.793e-02
    0  Col     1400   1400   1400   1395     0.794  2303.707 2.843e-02
    0  Col     1600   1600   1600   1595     1.156  2360.557 2.893e-02
    0  Col     1800   1800   1800   1793     1.637  2374.130 2.803e-02
    0  Col     2000   2000   2000   1990     2.192  2431.838 2.744e-02

6 cases ran, 6 cases passed


8a. type ./xdlutst_pt -N 2000 3000 200
 ./xdlutst_pt -N 2000 3000 200
NREPS  Major      M      N    lda  NPVTS      TIME     MFLOP     RESID
=====  =====  =====  =====  =====  =====  ========  ========  ========
    0  Col     2000   2000   2000   1990     2.286  2332.527 2.744e-02
    0  Col     2200   2200   2200   2194     2.764  2567.795 2.639e-02
    0  Col     2400   2400   2400   2394     3.766  2446.449 2.721e-02
    0  Col     2600   2600   2600   2593     4.722  2480.761 2.472e-02
    0  Col     2800   2800   2800   2795     5.855  2499.038 2.441e-02
    0  Col     3000   3000   3000   2992     7.302  2464.553 2.442e-02

6 cases ran, 6 cases passed

Please see the MFLOP column. This indicates the FLOPS of the calculation.
Opteron 1.6G's performance is 2.4GFlops for LU decomposition.
and as you can see no perfomance gain :(

typical output of top is like that:

  PID USERNAME PRI NICE   SIZE    RES STATE  C   TIME   WCPU    CPU COMMAND
  716 root     134    0   185M   179M CPU0   0   1:05 21.09% 21.09% xdlutst_pt
  716 root     134    0   185M   179M RUN    0   1:05 19.53% 19.53% xdlutst_pt
  716 root      20    0   185M   179M kserel 1   1:05  0.00%  0.00% xdlutst_pt
  716 root      20    0   185M   179M ksesig 1   1:05  0.00%  0.00% xdlutst_pt
  716 root      20    0   185M   179M kserel 0   1:05  0.00%  0.00% xdlutst_pt

two threads of xdlutst_pt are always running on *ONLY CPU0 or CPU1*
--------------------------------------------------------------------
Next, I have tried i386 version

0i. prepare opteron SMP machine same as above, and install FreeBSD/i386
5.3-BETA3.
CVSup your ports tree.

1i. cd /usr/ports/math/atlas
2i. make
3i. wait for long time
4i. cd /usr/ports/math/atlas/work/ATLAS/bin/THREADED 
5i. make xdlutst (it took only seconds)
6i. make xdlutst_pt (it took only seconds)
7i. type ./xdlutst -N 1000 2000 200  (this doesn't utilize SMP and KSE)
./xdlutst -N 1000 2000 200
NREPS  Major      M      N    lda  NPVTS      TIME     MFLOP     RESID
=====  =====  =====  =====  =====  =====  ========  ========  ========
    0  Col     1000   1000   1000    995     0.307  2170.617 3.437e-02
    0  Col     1200   1200   1200   1194     0.522  2204.335 3.482e-02
    0  Col     1400   1400   1400   1395     0.799  2286.888 4.150e-02
    0  Col     1600   1600   1600   1595     1.164  2345.104 3.598e-02
    0  Col     1800   1800   1800   1793     1.616  2405.542 3.601e-02
    0  Col     2000   2000   2000   1990     2.218  2403.157 3.436e-02

6 cases ran, 6 cases passed

8i. type  ./xdlutst_pt -N 3000 4000 200 (this utilize KSE so that make
full use of SMP)
./xdlutst_pt -N 3000 4000 200
NREPS  Major      M      N    lda  NPVTS      TIME     MFLOP     RESID
=====  =====  =====  =====  =====  =====  ========  ========  ========
    0  Col     3000   3000   3000   2992     7.157  2514.351 3.650e-02
    0  Col     3200   3200   3200   3186     5.127  4259.986 3.207e-02
    0  Col     3400   3400   3400   3392     5.867  4465.006 3.528e-02
    0  Col     3600   3600   3600   3589     6.791  4579.468 3.519e-02
    0  Col     3800   3800   3800   3791     8.510  4297.730 3.285e-02
    0  Col     4000   4000   4000   3995     9.207  4633.234 3.218e-02

6 cases ran, 6 cases passed

yes, there are perfomance gain by utilizing SMP.

typical output of top seems like

  PID USERNAME PRI NICE   SIZE    RES STATE  C   TIME   WCPU    CPU COMMAND
  714 root     139    0   301M   300M CPU1   1   2:16 66.41% 66.41% xdlutst_pt
  714 root     139    0   301M   300M RUN    0   2:16 66.41% 66.41% xdlutst_pt
  714 root      20    0   301M   300M kserel 1   2:16  0.00%  0.00% xdlutst_pt
  714 root      20    0   301M   300M kserel 0   2:16  0.00%  0.00% xdlutst_pt
  714 root      20    0   301M   300M ksesig 0   2:16  0.00%  0.00% xdlutst_pt

Summary:
Difference between 8a and 8i are:
o there are no perfomance gain in 8a whereas 8i gains nearly double.
o the result of top indicates that by KSE of amd64, two threads are produced
correctly, however scheduling is somwhat odd, so that two threads runs
at the same processor, apparently threads are spread over different
processors, though.

You can try easily, work directory of these two ports are available:
http://people.freebsd.org/~maho/atlas/atlas-work-opteron_dual-amd64.tar.bz 
http://people.freebsd.org/~maho/atlas/atlas-work-opteron_dual-i386.tar.bz

MD5 (atlas-work-opteron_dual-amd64.tar.bz) = 9d9d7e8b00b34a783b7d2172bc404e23
MD5 (atlas-work-opteron_dual-i386.tar.bz) = 8076a753c7b3edaea7bd446c6473f120

Does anybody can fix it?

Best regards,
--nakata maho




More information about the freebsd-amd64 mailing list