LAM MPI on dual processor opteron box sees only one cpu...

Jeffrey Racine jracine at maxwell.syr.edu
Sat Apr 10 15:21:27 PDT 2004


Hi.

I am converging on getting a new dual opteron box running. Now I am
setting up and testing LAM MPI, however, the OS is not farming out 
the job as expected, and only sees one processor. 

This runs fine on RH 7.3 and RH 9.0 both on a cluster and on a dual
processor PIV desktop. I am running 5-current. Basically, mpirun -np 1
binaryfile has the same runtime as mpirun -np 2 binaryfile, while on the
dual PIV box it runs in half the time. When I check top, mpirun -np 2
both run on CPU 0... here is the relevant portion from top with -np 2...

9306 jracine    4    0  7188K  2448K sbwait 0   0:03 19.53% 19.53% n_lam
29307 jracine  119    0  7148K  2372K CPU0   0   0:03 19.53% 19.53%
n_lam

I include output from laminfo, dmesg (cpu relevnt info), and lamboot -d
bhost.lam... any suggestions most appreciated, and thanks in advance!

-- laminfo

           LAM/MPI: 7.0.4
            Prefix: /usr/local
      Architecture: amd64-unknown-freebsd5.2
     Configured by: root
     Configured on: Sat Apr 10 11:22:02 EDT 2004
    Configure host: jracine.maxwell.syr.edu
        C bindings: yes
      C++ bindings: yes
  Fortran bindings: yes
       C profiling: yes
     C++ profiling: yes
 Fortran profiling: yes
     ROMIO support: yes
      IMPI support: no
     Debug support: no
      Purify clean: no
          SSI boot: globus (Module v0.5)
          SSI boot: rsh (Module v1.0)
          SSI coll: lam_basic (Module v7.0)
          SSI coll: smp (Module v1.0)
           SSI rpi: crtcp (Module v1.0.1)
           SSI rpi: lamd (Module v7.0)
           SSI rpi: sysv (Module v7.0)
           SSI rpi: tcp (Module v7.0)
           SSI rpi: usysv (Module v7.0)

-- dmesg sees two cpus...

CPU: AMD Opteron(tm) Processor 248 (2205.02-MHz K8-class CPU)
  Origin = "AuthenticAMD"  Id = 0xf58  Stepping = 8

Features=0x78bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2>
  AMD Features=0xe0500800<SYSCALL,NX,MMX+,LM,3DNow!+,3DNow!>
real memory  = 3623813120 (3455 MB)
avail memory = 3494363136 (3332 MB)
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  1

-- bhost has the requisite information

128.230.130.10 cpu=2 user=jracine

-- Here are the results from lamboot -d bhost.lam

-bash-2.05b$ lamboot -d ~/bhost.lam
n0<29283> ssi:boot: Opening
n0<29283> ssi:boot: opening module globus
n0<29283> ssi:boot: initializing module globus
n0<29283> ssi:boot:globus: globus-job-run not found, globus boot will
not run
n0<29283> ssi:boot: module not available: globus
n0<29283> ssi:boot: opening module rsh
n0<29283> ssi:boot: initializing module rsh
n0<29283> ssi:boot:rsh: module initializing
n0<29283> ssi:boot:rsh:agent: rsh
n0<29283> ssi:boot:rsh:username: <same>
n0<29283> ssi:boot:rsh:verbose: 1000
n0<29283> ssi:boot:rsh:algorithm: linear
n0<29283> ssi:boot:rsh:priority: 10
n0<29283> ssi:boot: module available: rsh, priority: 10
n0<29283> ssi:boot: finalizing module globus
n0<29283> ssi:boot:globus: finalizing
n0<29283> ssi:boot: closing module globus
n0<29283> ssi:boot: Selected boot module rsh
 
LAM 7.0.4/MPI 2 C++/ROMIO - Indiana University
 
n0<29283> ssi:boot:base: looking for boot schema in following
directories:
n0<29283> ssi:boot:base:   <current directory>
n0<29283> ssi:boot:base:   $TROLLIUSHOME/etc
n0<29283> ssi:boot:base:   $LAMHOME/etc
n0<29283> ssi:boot:base:   /usr/local/etc
n0<29283> ssi:boot:base: looking for boot schema file:
n0<29283> ssi:boot:base:   /home/jracine/bhost.lam
n0<29283> ssi:boot:base: found boot schema: /home/jracine/bhost.lam
n0<29283> ssi:boot:rsh: found the following hosts:
n0<29283> ssi:boot:rsh:   n0 jracine.maxwell.syr.edu (cpu=2)
n0<29283> ssi:boot:rsh: resolved hosts:
n0<29283> ssi:boot:rsh:   n0 jracine.maxwell.syr.edu --> 128.230.130.10
(origin)n0<29283> ssi:boot:rsh: starting RTE procs
n0<29283> ssi:boot:base:linear: starting
n0<29283> ssi:boot:base:server: opening server TCP socket
n0<29283> ssi:boot:base:server: opened port 49832
n0<29283> ssi:boot:base:linear: booting n0 (jracine.maxwell.syr.edu)
n0<29283> ssi:boot:rsh: starting lamd on (jracine.maxwell.syr.edu)
n0<29283> ssi:boot:rsh: starting on n0 (jracine.maxwell.syr.edu): hboot
-t -c lam-conf.lamd -d -I -H 128.230.130.10 -P 49832 -n 0 -o 0
n0<29283> ssi:boot:rsh: launching locally
hboot: performing tkill
hboot: tkill -d
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname
back: /tmp/lam-jracine at jracine.maxwell.syr.edu/lam-killfile
tkill: removing socket file ...
tkill: socket
file: /tmp/lam-jracine at jracine.maxwell.syr.edu/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket
file: /tmp/lam-jracine at jracine.maxwell.syr.edu/lam-io-socket
tkill: f_kill = "/tmp/lam-jracine at jracine.maxwell.syr.edu/lam-killfile"
tkill: nothing to kill:
"/tmp/lam-jracine at jracine.maxwell.syr.edu/lam-killfile"
hboot: booting...
hboot: fork /usr/local/bin/lamd
[1]  29286 lamd -H 128.230.130.10 -P 49832 -n 0 -o 0 -d
n0<29283> ssi:boot:rsh: successfully launched on n0
(jracine.maxwell.syr.edu)
n0<29283> ssi:boot:base:server: expecting connection from finite list
hboot: attempting to execute
n-1<29286> ssi:boot: Opening
n-1<29286> ssi:boot: opening module globus
n-1<29286> ssi:boot: initializing module globus
n-1<29286> ssi:boot:globus: globus-job-run not found, globus boot will
not run
n-1<29286> ssi:boot: module not available: globus
n-1<29286> ssi:boot: opening module rsh
n-1<29286> ssi:boot: initializing module rsh
n-1<29286> ssi:boot:rsh: module initializing
n-1<29286> ssi:boot:rsh:agent: rsh
n-1<29286> ssi:boot:rsh:username: <same>
n-1<29286> ssi:boot:rsh:verbose: 1000
n-1<29286> ssi:boot:rsh:algorithm: linear
n-1<29286> ssi:boot:rsh:priority: 10
n-1<29286> ssi:boot: module available: rsh, priority: 10
n-1<29286> ssi:boot: finalizing module globus
n-1<29286> ssi:boot:globus: finalizing
n-1<29286> ssi:boot: closing module globus
n-1<29286> ssi:boot: Selected boot module rsh
n0<29283> ssi:boot:base:server: got connection from 128.230.130.10
n0<29283> ssi:boot:base:server: this connection is expected (n0)
n0<29283> ssi:boot:base:server: remote lamd is at 128.230.130.10:50206
n0<29283> ssi:boot:base:server: closing server socket
n0<29283> ssi:boot:base:server: connecting to lamd at
128.230.130.10:49833
n0<29283> ssi:boot:base:server: connected
n0<29283> ssi:boot:base:server: sending number of links (1)
n0<29283> ssi:boot:base:server: sending info: n0
(jracine.maxwell.syr.edu)
n0<29283> ssi:boot:base:server: finished sending
n0<29283> ssi:boot:base:server: disconnected from 128.230.130.10:49833
n0<29283> ssi:boot:base:linear: finished
n0<29283> ssi:boot:rsh: all RTE procs started
n0<29283> ssi:boot:rsh: finalizing
n0<29283> ssi:boot: Closing
n-1<29286> ssi:boot:rsh: finalizing
n-1<29286> ssi:boot: Closing





More information about the freebsd-amd64 mailing list