IBM SP Systems Overview Exercise


  1. Login to the workshop machine

    Workshops differ in how this is done. The instructor will go over this beforehand.

  2. Copy the example files

    1. In your home directory, create a subdirectory for the POE example codes and then cd to it.

      mkdir ~/poe
      cd ~/poe

    2. Copy either the Fortran or the C version of the exercise files to your poe subdirectory:

      C: cp /usr/local/spclass/blaise/poe/samples/C/*    ~/poe
      Fortran: cp /usr/local/spclass/blaise/poe/samples/Fortran/*    ~/poe

  3. List the contents of your poe subdirectory

    You should have the following files:

    C Files Fortran Files Description
    poe_hello.c poe_hello.f Simple MPI program which prints a task's rank and hostname.
    poe_bandwidth.c poe_bandwidth.f An MPI communications bandwidth test between two tasks only.
    smp_bandwidth.c smp_bandwidth.f An MPI communications bandwidth test between any even number of tasks.
    prog1
    prog2
    prog3
    prog4
    prog1
    prog2
    prog3
    prog4
    Simple shell scripts used for MPMD mode

  4. Understand your system configuration

    1. Display the pool configuration for the workshop machine:

      js

      Questions:

      • Which pool number has been configured?
      • What is the name of the pool?
      • How many nodes are in the pool?
      • What are the names of the nodes in the pool?

      Click to confirm the workshop pool and node names. You'll need to know this for later.

    2. Try the following commands, which display information about running jobs also:

      ju
      spjstat

      Note: since the workshop machine is reserved for the class, you probably won't see much. On a production machine, such as white or frost, you would see more.

  5. Authentication

    1. LLNL has already taken care of this step for you...you need to do nothing.

    2. You can verify (if you want) that LLNL has authorized you to use these nodes. Check the /etc/hosts.equiv file. It should contain the names of all nodes in the system.

    Note Note that not all SP sites use this method for authentication.

  6. Compile the poe_hello program

    Depending upon your language preference, use one of the IBM parallel compilers to compile the poe_hello program.

    C:
    mpcc -o poe_hello poe_hello.c
    Fortran:
    mpxlf -o poe_hello poe_hello.f 

  7. Setup your POE environment

    In this step you'll set a few POE environment variables. Specifically, those which answer the three questions:

    • How many tasks/nodes do I need;
    • How will nodes be allocated?
    • How will communications be conducted (protocol and network)?

    Depending upon your shell, set the following environment variables as shown:

    Environment Variable Setting Description
    MP_PROCS 4 Request 4 MPI tasks (processes)
    MP_RESD yes Non-specific allocation (let the Resource Manager decide which nodes to use)
    MP_RMPOOL 0 This is the node pool number. Set it to the number zero.
    MP_EUIDEVICE css0 Use the SP switch network interface
    MP_EUILIB us User Space protocol

  8. Run your poe_hello executable

    1. This is the simple part. Just issue the command:

      poe_hello

    2. Provided that everything is working and setup correctly, you should receive output that looks something like below (your node names may vary, of course).
      0:Total number of tasks = 4 
      0:Hello! From task 0 on host berg05.pacific.llnl.gov
      1:Hello! From task 1 on host berg06.pacific.llnl.gov
      2:Hello! From task 2 on host berg07.pacific.llnl.gov
      3:Hello! From task 3 on host berg08.pacific.llnl.gov
      

  9. Maximize your use of all 4 cpus on a node

    The previous step was the most "wasteful" way to run a POE program, since by default, POE will load only one task on a node. To make better use of the SMP nodes, try the following:

    1. Run four poe_hello tasks on each of 2 nodes. Three different ways to do this are shown below, all of which use command line flags. The corresponding environment variables could be used instead. See the POE man page for details.

      Method 1: Specify POE flags for number of nodes and number of tasks:

      poe_hello -nodes 2 -procs 8

      Method 2: Specify POE flags for number of tasks per node and and number of tasks:

      poe_hello -tasks_per_node 4 -procs 8

      Method 3: Specify POE flags for number of nodes and and number of tasks per node:

      unsetenv MP_PROCS
      poe_hello -nodes 2 -tasks_per_node 4

  10. Try the poe_bandwidth exercise code

    1. Depending upon your language preference, compile the poe_bandwidth source file as shown:

      C:
      mpcc -o poe_bandwidth poe_bandwidth.c
      Fortran:
      mpxlf -o poe_bandwidth poe_bandwidth.f 

    2. Change a couple environment variables:

      setenv MP_PROCS 2
      setenv MP_EUILIB ip

    3. Run the executable:

      poe_bandwidth

      As the program runs, it will display the effective communications bandwidth between two nodes using Internet protocol (ip) over the SP switch.

      Sample output from poe_bandwidth using IP communications
      
         0:
         0:****** MPI/POE Bandwidth Test ******
         0:Message start size= 100000 bytes
         0:Message finish size= 1000000 bytes
         0:Incremented by 100000 bytes per iteration
         0:Roundtrips per iteration= 10
         0:Task 0 running on: berg05.pacific.llnl.gov
         0:Task 1 running on: berg06.pacific.llnl.gov
         0:
         0:Message Size   Bandwidth (bytes/sec)
         0:   100000        131277120
         0:   200000        160914005
         0:   300000        155605856
         0:   400000        159537625
         0:   500000        169940602
         0:   600000        192885165
         0:   700000        193172080
         0:   800000        222912305
         0:   900000        223830179
         0:  1000000        224911334
      

    4. Now, try running the executable again, but this time use a command line flag to specify User Space communications protocol. Note that using the command line flag insures this by overriding whatever the MP_EUILIB environment variable is set to.

      poe_bandwidth -euilib us

      Note: It is very possible that when you try this step, you will get one of the error messages that look something like:

      ERROR: 0031-124 Less than XX nodes available from pool N

      - or -

      ERROR: 0031-365 LoadLeveler unable to run job, reason:
      LoadL_negotiator: 2544-870 Step blue199.pacific.llnl.gov.11575.0 was not
      considered to be run in this scheduling cycle due to its relatively low
      priority or because there are not enough free resources.

      This is because there may be others in the workshop using nodes in User Space mode at the same time as you. Recall that only one user at a time may run US tasks on a node. If you get this error message, just try running again in a few seconds/minutes.

    5. Notice the output. You should see a significant increase in bandwidth.

      Sample output from poe_bandwidth using US communications
      
         0:
         0:****** MPI/POE Bandwidth Test ******
         0:Message start size= 100000 bytes
         0:Message finish size= 1000000 bytes
         0:Incremented by 100000 bytes per iteration
         0:Roundtrips per iteration= 10
         0:Task 0 running on: berg05.pacific.llnl.gov
         0:Task 1 running on: berg06.pacific.llnl.gov
         0:
         0:Message Size   Bandwidth (bytes/sec)
         0:   100000        393351214
         0:   200000        437020474
         0:   300000        450266125
         0:   400000        456386278
         0:   500000        480783135
         0:   600000        991952069
         0:   700000        985073913
         0:   800000        985199935
         0:   900000        983501016
         0:  1000000        968929957
      

  11. Determine per-task communication bandwidth behavior

    In this exercise, pairs of tasks, located on two different nodes, will communicate with each other.

    1. First, make sure that the User Space protocol is used for communications:

      setenv MP_EUILIB us

    2. Compile the code:

      C:
      mpcc -o smp_bandwidth smp_bandwidth.c 
      Fortran:
      mpxlf -o smp_bandwidth smp_bandwidth.f  

    3. Then use the smp_bandwidth code to determine per-task bandwidth characteristics on an smp node:

      smp_bandwidth -nodes 2 -procs 2
      smp_bandwidth -nodes 2 -procs 4
      smp_bandwidth -nodes 2 -procs 8
      smp_bandwidth -nodes 2 -procs 16

      What happens to the per-task bandwidth as the number of tasks increase?

  12. Optimize intra-node communication bandwidth

    When all of the task communications occur "on-node", it is possible to optimize the effective per-task bandwidth by utilizing shared memory instead of the network.

    1. First use shared memory and note the per-task bandwidth:

      setenv MP_SHARED_MEMORY yes
      smp_bandwidth -nodes 1 -procs 4
      smp_bandwidth -nodes 1 -procs 8

    2. Now try it without shared memory (using the network):

      setenv MP_SHARED_MEMORY no
      smp_bandwidth -nodes 1 -procs 4
      smp_bandwidth -nodes 1 -procs 8

      What differences do you notice?

  13. Try using POE's Multiple Program Multiple Data (MPMD) mode

    POE allows you to load and run different executables on different nodes. This is controlled by the MP_PGMMODEL environment variable.

    1. First, set some environment variables:

      Environment Variable Setting Description
      MP_PGMMODEL mpmd Specify MPMD mode
      MP_PROCS 4 Use 4 tasks again
      MP_NODES 1 Use one node for all four tasks
      MP_STDOUTMODE ordered Sort the output by task

    2. Then, simply issue the poe command.

    3. After a moment, you will be prompted to enter your executables one at a time. Notice that the machine name where the executable will run is displayed as part of the prompt. In any order you choose, enter these four program names, one per prompt:

      prog1 prog2 prog3 prog4

      Note: these four programs are just simple shell scripts used to demonstrate how to use the MPMD programming model.

    4. After the last program name is entered, POE will run all four executables. Observe their different outputs.

  14. Try specific node allocation using a host list file

    Generally speaking, there aren't many cases where you'll need to "manually" select which nodes should be used to run your POE job. This step will demonstrate how to do it though, should you ever have the need.

    1. First, use your favorite UNIX editor and create a file in your POE executables directory. Call it hostfile. As its contents, enter 4 different node names from the workshop node pool - one node name per line. Click to see the nodes in the workshop node pool.

    2. Set the appropriate POE environment variables which specify specific node allocation:

      Environment Variable Setting Description
      MP_RESD no Turn off selection by the Resource Manager - just to be sure
      MP_HOSTFILE hostfile Specify the host file you created
      MP_SAVEHOSTFILE hosts_used Save the names of the hosts used to run your program
      MP_EUILIB ip Required protocol for specific node allocation
      MP_PGMMODEL spmd Reset from mpmd used in the previous step

    3. Run the poe_hello executable again and observe the output. Does it match what you specified in your hostlist file?

    4. Check your hosts_used file, which was created when your program ran. Do the names match the ones specified by your hostlist file?

  15. Review relevant LC documentation (or at least know where to find it):
    • LC Home Page - especially note the Machine Configurations link in the Machine Info section. Try the OCF Machine Status link also.
    • news job.limits command
    • news job.lim.blue command
    • news job.lim.frost command
    • Review the /etc/environment file
    • ASCI Blue web pages - see the "Running Jobs" section
    • ASCI White web pages - see the "Running Jobs" section

This concludes the POE exercise.


This completes the exercise.

Evaluation Form       Please complete the online evaluation form.

Where would you like to go now?