IBM SP Systems Overview


Table of Contents

  1. Evolution of IBM's POWER Architectures
  2. LC IBM SP Systems
  3. SP Hardware Overview
    1. SP System Components
    2. SP Frames
    3. SP Nodes
    4. Switch Network
    5. Control Workstation
  4. Software and Development Environment
  5. Parallel Operating Environment (POE)
    1. Overview
    2. POE Definitions
  6. Compilers
  7. MPI
  8. Running on IBM SP Systems
    1. Understanding Your System Configuration
    2. Establishing Authorization
    3. Setting Up the Execution Environment
    4. Invoking the Executable
    5. Monitoring Job Status
    6. Interactive Job Specifics
    7. Batch Job Specifics
    8. Optimizing CPU Usage
  9. Debugging
  10. LC Specific Information
  11. Miscellaneous
    1. Parallel Environment Limits
    2. Run-time Analysis Tools
    3. Parallel File Copy Utilities
  12. References and More Information
  13. Exercise


Evolution of IBM's POWER Architectures

POWER1:
  • 1990: IBM announces the RISC System/6000 (RS/6000) family of superscalar workstations and servers based upon its new POWER architecture:
    • RISC = Reduced Instruction Set Computer
    • Superscalar = Multiple chip units (floating point unit, fixed point unit, load/store unit, etc.) execute instructions simultaneously with every clock cycle
    • POWER = Performance Optimized With Enhanced RISC

  • Initial configurations had a clock speed of 25 MHz, single floating point and fixed point units, and a peak performance of 50 MFLOPS.

  • Clusters are not new: networked configurations of POWER machines became common as distributed memory parallel computing started to become popular.
IBM RS/6000 processors

SP1:
  • IBM's first SP (Scalable POWERparallel) system was the SP1. It was the logical evolution of clustered POWER1 computing. It was also short-lived, serving as a foot-in-the-door to the rapidly growing market of distributed computing. The SP2 (shortly after) was IBM's real entry point into distributed computing.

  • The key innovations of the SP1 included:
    • Reduced footprint: all of those real-estate consuming stand-alone POWER1 machines were put into a rack
    • Reduced maintenance: new software and hardware made it possible for a system administrator to manage many machines from a single console
    • High-performance interprocessor communications over an internal switch network
    • Parallel Environment software made it much easier to develop and run distributed memory parallel programs

  • The SP1 POWER1 processor had a 62.5 MHz clock with peak performance of 125 MFLOPS
IBM SP - click for larger image

POWER2 and SP2:

P2SC:

PowerPC:

POWER3:
  • 1999: The POWER3 SMP architecture is announced. POWER3 represents a merging of the POWER2 uniprocessor architecture and the PowerPC SMP architecture.

  • Key improvements:
    • 64-bit architecture
    • Increased clock speeds
    • Increased memory, cache, disk, I/O slots, memory bandwidth....
    • Increased number of SMP processors

  • Several varieties were produced with very different specs. At the time they were made available, they were known as Winterhawk-1, Winterhawk-2, Nighthawk-1 and Nighthawk-2 nodes.

  • ASC White is based upon the the POWER3-II (Nighthawk-2) processor.
IBM POWER3

POWER4:
  • 2001: IBM introduces its latest 64-bit architecture, the POWER4. It is very different from its POWER3 predecessor.

  • The basic building block is a two processor SMP chip. Four chips can be joined to make an 8-way SMP "module". Combining modules creates 16, 24 and 32-way SMP machines.

  • Key improvements over POWER3 include:
    • Increased CPUs - up to 32 per node
    • Faster clock speeds - over 1 GHz
    • Increased memory, L2 cache, disk, I/O slots, memory bandwidth....
    • New L3 cache - shared between modules
IBM pSeries

POWER5:

Future Generations:



LC IBM SP Systems

OCF

FROST: ASC White - Frost

BERG, NEWBERG:

UV: ASC UV


SCF

WHITE: ASC White

ICE:

UM: ASC UM

TEMPEST:

ASC PURPLE:



SP Hardware

SP System Components

There are five basic physical components of an SP system:

  • Frame - A containment unit consisting of a rack to hold computers, together with supporting hardware, including power supplies, cooling equipment and communication media.

  • Nodes - AIX RS/6000 workstations packaged to fit in the SP frame. A node has no display head or keyboard, and so user human interaction must be done remotely.

  • Switch - The internal network medium that allows high-speed communication between nodes. Each frame has a switch board that connects its own nodes and also, connects to switch boards in other frames.

  • Switch Adapter - Physically connects each node to the switch network.

  • Control Workstation (CWS) - is a stand-alone AIX workstation with a display and keyboard. Possesses the hardware and software required to monitor and control the frames, nodes and switches of an entire SP system.
SP System Components


SP Hardware

SP Frames

Frame Characteristics:

SP Hardware

SP Nodes

Node Characteristics:

Node Types:

POWER3 Characteristics:

POWER4 Characteristics:

POWER Node Comparisons:

POWER Benchmarks:



SP Hardware

Switch Network

Topology:

Switch Network Characteristics:

Hardware:

  • Two basic hardware elements comprise the SP switch network:

    Switch board - One switch board per SP frame. Contains 8 logical switch chips with 16 physical chips for reliablility reasons. 8 logical chips wired as bidirectional 4-way to 4-way crossbar.

    Communications adapter - Every node that is connected to the switch must have a switch adapter, which occupies one of the node's I/O expansion slots.one The node's adapter is directly cabled into a corresponding port on the switch board. Future plans call for up to two adapters per node.

  • Switchboards are interconnected according to the number of nodes that comprise an SP system. Any node can talk to any other node by multiple paths. A sample 64 node switch configuration is shown below.
Photo of SP Switch

Switch Communication Protocols:

Switch Application Performance:



SP Hardware

Control Workstation

Single Point of Control:
  • Serves as the single point of control for System Support Programs used by System Administrators for system monitoring, maintenance and control.

  • Separate machine - not part of the SP frame

  • Must be an RISC System/6000

  • Connects to each frame with
    • RS-232 control line used to monitor frame, node and switch hardware
    • external ethernet LAN

  • Acts as install server for other SP nodes

  • May also act as a file server (not recommended)

Single Point of Failure:

  • Because the Control Workstation can be a "single point of failure" for an entire SP system, IBM offers a High Availability Control Workstation (HACWS) configuration option

  • Includes a duplicate/backup control workstation and software to handle automatic failover.
Control Workstation HACWS Schematic


Software and Development Environment


The software and development environment for the IBM SPs is similar to what is described in the Introduction to LC Resources tutorial. Items specific to the IBM SPs are discussed below.

AIX Operating System:

Parallel Environment:

Compilers:

Math Libraries Specific to IBM SPs:

Batch System:

User Filesystems:



Parallel Operating Environment (POE)

Overview

IBM Parallel Environment Software:

Parallel Operating Environment:

Types of Parallelism Supported:

Interactive and Batch Use:

Typical Usage Progression:



Parallel Operating Environment (POE)

POE Definitions

Before learning how to use POE, understanding some basic definitions may be useful.

SP / POWER / AIX / Linux Clusters
SP = Scalable POWERParallel Systems. Refers to IBM's product line of POWER architecture machines. All SP systems run under IBM's AIX operating system. There are numerous models and configurations, always changing. More recently, IBM is also offering parallel Linux cluster systems. POE does not currently run on these systems.

Node
Within POE, a node refers to single machine, running its own copy of the AIX operating system. A node has a unique network name/address. Within a parallel POE system, a node can be booted and configured independently, or cooperatively with other nodes. All current IBM SP nodes are SMPs (next).

SMP
Symmetric Multi-Processor. A computer (single machine/node) that has multiple CPUs, configured to share/arbitrate shared memory. A single copy of the operating system serves all CPUs. IBM SP nodes vary in the number of CPUs they contain (2, 4, 8, 16, 32...)

Job
A job refers to the entire parallel application and typically consists of multiple processes/tasks.

Process / Task
Under POE, an executable (a.out) that may be scheduled to run by AIX on any available physical processor as a UNIX process is considered a task. Task and process are synonymous. For MPI applications, each MPI process is referred to as a "task" with a unique identifier starting at zero up to the number of processes minus one.

Interprocess
Between different processes/tasks. For example, interprocess communications can refer to the exchange of data between different MPI tasks executing on different physical processors. The processors can be on the same node (SMP) or on different nodes.

Thread
A thread is an independently schedulable stream of instructions that exists within, and uses the resources of an existing UNIX process/task. In the simplest sense, the idea of a subroutine that can be scheduled to run independently from, and concurrently with other subroutines in the same a.out describes a thread. A task can have multiple threads, each of which may be scheduled to run on the multiple physical processors of an SMP node.

Pool
A pool is an arbitrary collection of nodes assigned by the system managers of an SP system. Pools are typically used to separate nodes into disjoint groups, each of which is used for specific purposes. For example, on a given system, some nodes may be designated as "login" nodes, while others are reserved for "batch" or "testing" use only.

Example SP System Pools

Example SP System Pools

Partition
The group of nodes used to run your parallel job is called your partition. Across an SP system, there is a one discreet (usually) partition for each user's job. Typically, the nodes in your partition are used exclusively by you for the duration of your job. After your job completes, the nodes may be allocated for other users' partitions. Under POE, a single Job Manager daemon manages the entire SP system and all user partitions.

Job Manager
The Job Manager in POE version 2.4+ is a function provided by LoadLeveler, IBM's batch scheduling software. When you request nodes to run your parallel job, LoadLeveler will find and allocate nodes for your use. LoadLeveler also enables user jobs to take advantage of multiple CPU SMP nodes, and keeps track of how the communications fabric (switch) is used.

Home Node / Remote Node
Your home node is the node where your parallel job is initiated. For interactive sessions it is the node where you are logged on. The home node may or may not be considered part of your partition depending upon how the SP system is configured. A Remote Node is any other non-home node in your partition.

Partition Manager
The Partition Manager, also known as the poe process, is a daemon process that is automatically started for you whenever you run a parallel job. There is one Partition Manager process for your entire parallel job. The Partition Manager process resides on your home node, and is responsible for overseeing the parallel execution of your POE job. The Partition Manager generally operates transparently to the user, and is responsible for performing many of the tasks associated with POE:

User Space Protocol
Often referred to simply as US protocol. The fastest method for MPI communications between tasks on different nodes. Only one user may use US communications on a node at any given time. Can only be conducted over the SP switch.

Internet Protocol
Often referred to simply as IP protocol. A slower, but more flexible method for MPI communications. Multiple users can all use IP communications on a node at the same time. Can be used with other network adapters besides the SP switch.

Non-Specific Node Allocation
Refers to the Job Manager (LoadLeveler) automatically selecting which nodes will be used to run your parallel job. Non-specific node allocation is usually the recommended (and default) method of node allocation. For batch jobs, this is typically the only method of node allocation available.

Specific Node Allocation
Enables the user to explicitly choose which nodes will be used to run a POE job. Requires the use of a "host list file", which contains the actual names of the nodes that must be used. Specific node allocation is only for interactive use, and recommended only when there is a reason for selecting specific nodes.


Compilers


Compilers and Compiler Scripts:

Compiler Syntax:

Compiler Invocation Commands:

Compiler Options:



MPI


Implementations:

Notes:



Running on IBM SP Systems

Understanding Your System Configuration

First Things First:

Some Hints for Understanding Your System:



Running on IBM SP Systems

Establishing Authorization

NOTE: This section can be omitted by LC users, as authorization is automatic.

About SP Authorization:

AIX Authorization:



Running on IBM SP Systems

Setting Up the Execution Environment

Setting POE Environment Variables:

Note Different versions of POE software are not identical in the environment variables they support.

Basic POE Environment Variables

Example Basic Environment Variable Settings

Miscellaneous POE Environment Variables



Running on IBM SP Systems

Invoking the Executable

Syntax:

Multiple Program Multiple Data (MPMD) Programs:

Using POE with Serial Programs:

POE Error Messages:



Running on IBM SP Systems

Monitoring Job Status



Running on IBM SP Systems

Interactive Job Specifics

Running Interactive Jobs:

Insufficient Resources:

Killing Interactive Jobs:

Running on IBM SP Systems

Batch Job Specifics

On LC systems, LCRM (DPCS) is used for running batch jobs. LCRM is covered in detail in the LCRM (DPCS) Tutorial. This section only provides a quick summary of LCRM usage on the IBM SPs.

Submitting Batch Jobs:

Quick Summary of Common Batch Commands:

Batch Jobs and POE Environment Variables:

Killing Batch Jobs:



Running on IBM SP Systems

Optimizing CPU Usage

SMP Nodes:

Effectively Using Available CPUs:

When Not to Use All CPUs:



Debugging


Available Debugging Tools:

Using TotalView on the IBM SPs:

  1. Be sure to compile your program with the -g option

  2. When starting TotalView, specify the poe process and then use TotalView's -a option for your program and any other arguments (including POE arguments). For example:
    totalview poe -a -procs 4 myprog
  3. TotalView will then load the poe process. Use the Go command to start poe with your executable.

  4. After you successfully start TotalView with your parallel job, you will be prompted about stopping the parallel job (below). In most cases, answering yes is the right thing to do.

    TotalView Prompt

  5. You will then be ready to begin debugging your parallel program.

  6. For debugging in batch, see Batch System Debugging in LC's TotalView tutorial.


LC Specific Information

Preset POE Environment Variables:

Additional Local Information:



Miscellaneous

Parallel Environment Limits:

Several of the more important POE environment limits are listed below. Others may be found in the Parallel Environment MPI Programming Guide manual (see the Appendix on Limits).

Run-time Analysis Tools:

Parallel File Copy Utilities:


This completes the tutorial.

Evaluation Form       Please complete the online evaluation form - unless you are doing the exercise, in which case please complete it at the end of the exercise.

Where would you like to go now?



References and More Information