IBM SP Systems Overview
Table of Contents
- Evolution of IBM's POWER Architectures
- LC IBM SP Systems
- SP Hardware Overview
- SP System Components
- SP Frames
- SP Nodes
- Switch Network
- Control Workstation
- Software and Development Environment
- Parallel Operating Environment (POE)
- Overview
- POE Definitions
- Compilers
- MPI
- Running on IBM SP Systems
- Understanding Your System Configuration
- Establishing Authorization
- Setting Up the Execution Environment
- Invoking the Executable
- Monitoring Job Status
- Interactive Job Specifics
- Batch Job Specifics
- Optimizing CPU Usage
- Debugging
- LC Specific Information
- Miscellaneous
- Parallel Environment Limits
- Run-time Analysis Tools
- Parallel File Copy Utilities
- References and More Information
- Exercise
Evolution of IBM's POWER Architectures
|
POWER1:
- 1990: IBM announces the RISC System/6000 (RS/6000)
family of superscalar workstations and servers based
upon its new POWER architecture:
- RISC = Reduced Instruction Set Computer
- Superscalar = Multiple chip units (floating point unit, fixed
point unit, load/store unit, etc.) execute instructions simultaneously
with every clock cycle
- POWER = Performance Optimized With Enhanced RISC
- Initial configurations had a clock speed of 25 MHz, single floating point
and fixed point units, and a peak performance of 50 MFLOPS.
- Clusters are not new: networked configurations of POWER machines became
common as distributed memory parallel computing started to become popular.
|
|
SP1:
- IBM's first SP (Scalable POWERparallel) system was the SP1. It was
the logical evolution of clustered POWER1 computing. It was also
short-lived, serving as a foot-in-the-door to the rapidly growing
market of distributed computing. The SP2 (shortly after) was IBM's
real entry point into distributed computing.
- The key innovations of the SP1 included:
- Reduced footprint: all of those real-estate consuming stand-alone
POWER1 machines were put into a rack
- Reduced maintenance: new software and hardware made it possible
for a system administrator to manage many machines from a
single console
- High-performance interprocessor communications over an internal switch
network
- Parallel Environment software made it much easier to develop and run
distributed memory parallel programs
- The SP1 POWER1 processor had a 62.5 MHz clock with peak performance of
125 MFLOPS
|
|
POWER2 and SP2:
- 1993: Continued improvements in the POWER1 processor architecture led to
the POWER2 processor. Some of the POWER2 processor improvements included:
- Floating point and fixed point units increased to two each
- Increased data cache size
- Increased memory to cache bandwidth
- Clock speed of 66.5 MHz with peak performance of 254 MFLOPS
- Improved instruction set (quad-word load/store, zero-cycle branches,
hardware square root, etc)
- Lessons learned from the SP1 led to the SP2, which incorporated the
improved POWER2 processor.
- SP2 improvements were directed at greater scalability and included:
- Better system and system management software
- Improved Parallel Environment software for users
- Higher bandwidth internal switch network
P2SC:
- 1996: The P2SC (POWER2 Super Chip) debuted. The P2SC was an improved
POWER2 processor with a clock speed of 160 MHz. This
effectively doubled the performance of POWER2 systems.
- Otherwise, it was virtually identical to the POWER2 architecture that
it replaced.
PowerPC:
- Introduced in 1993 as the result of a partnership between IBM, Apple and
Motorola, the PowerPC processor included most of the POWER
instructions. New instructions and features were added to support
SMPs.
- The PowerPC line had several iterations, finally ending with the 604e.
Its primary advantages over the POWER2 line were:
- Multiple CPUs
- Faster clock speeds
- Introduction of an L2 cache
- Increased memory, disk, I/O slots, memory bandwidth....
- Not much was heard in SP circles about PowerPC until the 604e,
a 4-way SMP with a 332 MHz clock. The ASC Blue-Pacific
system - at one time ranked as the most powerful
computer on the planet, was based upon the 604e processor.
- The PowerPC architecture was IBM's entry into the SMP world and
eventually replaced all previous, uniprocessor architectures in the
SP evolution.
POWER3:
- 1999: The POWER3 SMP architecture is announced. POWER3 represents a
merging of the POWER2 uniprocessor architecture and the PowerPC
SMP architecture.
- Key improvements:
- 64-bit architecture
- Increased clock speeds
- Increased memory, cache, disk, I/O slots, memory bandwidth....
- Increased number of SMP processors
- Several varieties were produced with very different specs. At the
time they were made available, they were known as
Winterhawk-1, Winterhawk-2, Nighthawk-1 and Nighthawk-2 nodes.
- ASC White
is based upon the the POWER3-II (Nighthawk-2) processor.
|
|
POWER4:
- 2001: IBM introduces its latest 64-bit architecture, the POWER4.
It is very different from its POWER3 predecessor.
- The basic building block is a two processor SMP chip.
Four chips can be joined to make an 8-way SMP "module".
Combining modules creates 16, 24 and 32-way SMP machines.
- Key improvements over POWER3 include:
- Increased CPUs - up to 32 per node
- Faster clock speeds - over 1 GHz
- Increased memory, L2 cache, disk, I/O slots, memory bandwidth....
- New L3 cache - shared between modules
|
|
POWER5:
Future Generations:
- POWERX: Continued evolution of the POWER line (POWER6, POWERn....).
Faster clocks, faster switch, more storage, more I/O, etc.
- LINUX Clusters and POWER evolve together:
- IBM offers clustered Linux solutions that combine IBM hardware
and software with other vendor products such as Intel processors,
Myrinet networks, and Linux Redhat software.
- Clustered Linux systems are being marketed alongside POWER
systems as IBM's high-performance pSeries.
- AIX is becoming increasingly Linux friendly
- BLUE GENE: - Completely new IBM architecture that has nothing to do with
SPs.
- A couple quotes from IBM:
"at least 15 times faster, 15 times more power efficient and consume
about 50 times less space per computation
than today's fastest supercomputers."
"Blue Gene will consist of more than one million processors, each
capable of one billion operations per second (1 gigaflop).
Thirty-two of these ultra-fast processors will be placed on a
single chip (32 gigaflops). A compact two-foot by two-foot board
containing 64 of these chips will be capable of 2 teraflops,
making it as powerful as the 8000-square foot ASCI computers.
Eight of these boards will be placed in 6-foot-high racks
(16 teraflops), and the final machine (less than 2000 sq. ft.)
will consist of 64 racks linked together to achieve the one
petaflop performance."
- More information from IBM:
www.research.ibm.com/bluegene
- Blue Gene/L is a collaboration between IBM and LLNL. For more
information see:
www.llnl.gov/asc/platforms/bluegenel
OCF
FROST:
- 68 nodes
- 16 processors per node
- Processor type: POWER3 "Nighthawk-2" at 375 MHz,
- 1.6 TFLOPS system peak performance
- 16 GB memory per node
- 1088 GB of RAM total
- 20.6 TB global disk (GPFS)
- Colony switch
- Image at right. Larger image available
here.
BERG, NEWBERG:
- BERG: Single 32-processor machine configured into 3 logical nodes
- NEWBERG: Single 32-processor machine configured into 4 logical nodes
- Processor type: POWER4 at 1.3 GHz
- 166 GFLOPS peak performance per 32-processor machine
- Memory per node varies
- Federation switch interconnect for NEWBERG nodes
- No GPFS
- Primarily for prototyping, testing, porting for POWER4
architecture
- Photo
UV:
- Part of ASC Purple early delivery. Limited Availability (LA) status while on
the OCF. Will move to SCF in 1/05 and then become Generally Available (GA)
- 128 nodes
- 8 processors per node
- Processor type: Power4 p655 at 1.5 GHz
- 16 GB memory per node
- 2.048 TB of RAM total
- 68 TB global disk (GPFS)
- Federation switch
- Image at right.
SCF
WHITE:
- 512 nodes
- 16 processors per node
- Processor type: POWER3 "Nighthawk-2" at 375 MHz,
- 12.3 TFLOPS system peak performance
- 16 GB memory per node
- 8192 GB of RAM total
- 109 TB global disk (GPFS)
- Colony switch
- Image at right. Larger image available
here.
ICE:
- 27 nodes
- 16 processors per node
- Processor type: POWER3 "Nighthawk-2" at 375 MHz,
- 0.7 TFLOPS system peak performance
- 16 GB memory per node
- 448 GB of RAM total
- 5.7 TB global disk (GPFS)
- Colony switch
UM:
- Part of ASC Purple early delivery
- 128 nodes
- 8 processors per node
- Processor type: Power4 p655 at 1.5 GHz
- 16 GB memory per node
- 2.048 TB of RAM total
- 68 TB global disk (GPFS)
- Federation switch
- Image at right.
TEMPEST:
- Serial/single-node computing resource
- 12 nodes total
9 nodes with 4 processors per node
3 nodes with 16 processors per node
- Processor type: Power5 at 1.65 GHz
- 9 nodes with 32 GB memory per node
3 nodes with 64 GB memory per node
- 480 GB of RAM total
- No switch, no GPFS
ASC PURPLE:
- 100 TFLOPS system peak performance
- 1528 nodes
- 8 processors per node
- Processor type: Power5
- Federation switch
- Expected availability in mid-2005
SP System Components
There are five basic physical components of an SP system:
- Frame - A containment unit consisting of a rack to hold
computers, together with supporting hardware, including power supplies,
cooling equipment and communication media.
- Nodes - AIX RS/6000 workstations packaged to fit in the
SP frame. A node has no display head or keyboard, and so user human
interaction must be done remotely.
- Switch - The internal network medium that allows high-speed
communication between nodes. Each frame has a switch board that connects
its own nodes and also, connects to switch boards in other frames.
- Switch Adapter - Physically connects each node to the switch
network.
- Control Workstation (CWS) - is a stand-alone AIX workstation
with a display and keyboard. Possesses the hardware and software required to
monitor and control the frames, nodes and switches of an entire SP system.
|
|
SP Frames
Frame Characteristics:
- An SP system is composed of 1 or more SP frames. Typically, a frame
houses multiple processor nodes, the switch network, and power supply
hardware.
- Internal high speed switch network
- Redundant frame power
- Air cooled
- Concurrent node maintenance
- 1 - 16 nodes per frame - depends on node types (thin/wide/high).
Nodes fit into designated frame drawers / slots
- Different node types can be mixed within a frame. Sample photos show
front and back views of a frame configured for 16 thin-nodes
(click small image for larger one).
- For larger systems (over 80 nodes), frames are used to house
intermediate switch hardware components
- POWER4 frames are different than those used for previous POWER
architectures. Nodes, power supply, media drawers, I/O drawers
and/or switch hardware can be mixed and matched in a number of ways.
Example POWER4 frames are shown below.
SP Nodes
Node Characteristics:
- A node in terms of the SP, is defined as a single,
stand-alone machine, self-contained in a "box" which is mounted in an
SP frame slot/drawer. SP systems follow the "shared nothing" model.
- One of the POWER family of processors, such as a 604e, POWER3
or POWER4 processor. Even though a node may be an SMP with multiple
CPUs, it is still considered a single node.
- Every node has its own independent hardware resources, such as:
- I/O hardware, including disk drives.
- Network adapters, including an adapter for the
internal switch network.
- Memory resources, including memory cards and caches.
- Power and cooling equipment.
- One copy of the AIX operating system per node. For SMP nodes, a
single copy of the operating system is shared by all CPUs.
Node Types:
- Prior to POWER4, SP nodes came in three different "packages":
- Thin - occupies one "slot" in the frame.
- Wide - occupies two horizontally adjacent slots in the
frame. "Wideness" is due to hardware and space allowed for extra
I/O adapter expansion slots and disk/media bays.
- High - occupies four adjacent (2x2) slots in the frame.
- The POWER4 node packaging introduced a 32-way SMP node (among other
types) that occupies one-half of the POWER4 frame. POWER4 nodes with
less CPUs occupy less space.
- The most common types of SP nodes found "in the field" currently are:
- POWER3 thin, wide and high nodes
- POWER4
POWER3 Characteristics:
- SMP - 2-16 CPUs per node
- 64-bit architecture and address space
- There are 4 different types of POWER3 nodes, referred to as Winterhawk-1,
Winterhawk-2, Nighthawk-1 and Nighthawk-2.
- Clock speeds range from 200 - 450 MHz
- Memory/Cache:
- Separate 64 KB data and 32 KB instruction caches per CPU
- 4 or 8 MB L2 cache per CPU
- Up to 64 GB shared memory
- L2 cache has its own bus - can be accessed simultaneously with main
memory
- Separate data and address buses.
- I/O & switch bus operates concurrently and independently from
memory-cache bus
- Superscalar - 8 execution units:
- 2 floating point units
- 3 fixed point units
- 2 load/store units
- Branch unit
- Condition Register Unit
POWER4 Characteristics:
- SMP - 8-32 CPUs per node
- 64-bit architecture and address space
- Clock speeds range from 1 - 1.9 GHz
- The basic building block is a "module" comprised of 4 chips with each
chip having 2 CPUs. 4 modules can be combined to form a 32-way SMP.
- Memory/Cache:
- Separate 64 KB data and 32 KB instruction caches per CPU
- 1.5 MB L2 cache shared per chip
- 32 MB of L3 cache per chip
- L2 and L3 cache are logically shared by all chips on a module
- Chip-to-chip bandwidth is 35 GB/sec
- Up to 1024 GB shared memory on a 32-way node
- Fast I/O interface (GXX bus) onto chip - ~1.7 TB/sec
- Superscalar - 8 execution units:
- 2 floating point units
- 2 fixed point units
- 2 load/store units
- Branch resolution unit
- Condition Register Unit
- A photograph of a 32-way POWER4 board is
available here
POWER Node Comparisons:
|
NODE TYPE |
POWER3 Winterhawk-1 |
POWER3 Winterhawk-2 |
POWER3 Nighthawk-1 |
POWER3 Nighthawk-2 |
POWER4 p609 model |
Packaging |
Thin/Wide |
Thin/Wide |
High |
High |
2XHigh |
CPUs per node |
1/2 |
2/4 |
2/4/6/8 |
4/8/12/16 |
8/16/24/32 |
Clock (MHz) |
200 |
375 |
222 |
375 |
1900 |
Max. Mflops/CPU |
800 |
1500 |
888 |
1500 |
7600 |
Memory (GB) min-max |
0.256-4 |
0.256-16 |
1-16 |
1-64 |
8-1024 |
L1 Cache data/instruc (KB) |
64/32 |
64/32 |
64/32 |
64/32 |
64/32 |
L2 Cache per CPU |
4MB |
8MB |
4MB |
8MB |
1.44MB* |
L3 Cache |
n/a |
n/a |
n/a |
n/a |
32MB* |
Memory-CPU Bandwidth (GB/sec) |
1.6 |
6.0 |
14.5 |
16 |
n/a |
L2 Bandwidth (GB/sec) |
6.4 |
12 |
7.1 |
12 |
124* |
L3 Bandwidth (GB/sec) |
n/a |
n/a |
n/a |
n/a |
11* |
Max. I/O Slots (PCI) |
2 thin 10 wide |
2 thin 10 wide |
53 |
53 |
160 |
Max. Disk/Media Bays |
2 thin 4 wide |
2 thin 4 wide |
26 |
26 |
128 |
Max. Disk (GB) |
36 thin 109 wide |
36 thin 109 wide |
946 |
946 |
9300 |
* = per 2-CPU chip
POWER Benchmarks:
Switch Network
Topology:
- The SP switch network provides the internal, high
performance message passing fabric that connects all of the
SP processors together into a single system.
- The SP switch network is classed as a bidirectional, multistage
interconnection networks (MIN).
- Bidirectional:
Any-to-any internode connection allows all
processors to send messages simultaneously. Each point to
point connection between nodes is comprised of two channels
(full duplex) that can carry data in opposite directions
simultaneously.
- Multistage Interconnection:
On larger systems (over 80 nodes), additional intermediate switches
are added as the system is scaled upward.
Switch Network Characteristics:
- Packet-switched network (versus circuit-switched)
- Support for multi-user environment - multiple jobs may run
simultaneously over the switch (one user does not monopolize switch)
- Path redundancy - multiple routings between any two nodes, with a
minimum of four paths between any pair of nodes. Permits
routes to be generated even when there are faulty components in
the system.
- Built-in error detection
- Hardware redundancy for reliability
- Architected for expansion to 1000s of ports
- Hardware latency: under 300 ns up to 80 nodes. Slightly higher if
intermediate switch boards are traversed.
Hardware:
- Two basic hardware elements comprise the SP switch network:
Switch board - One switch board per SP frame. Contains 8 logical
switch chips with 16 physical chips for reliablility reasons.
8 logical chips wired as bidirectional 4-way to 4-way crossbar.
Communications adapter - Every node that is connected to the
switch must have a switch adapter, which occupies one of the node's I/O
expansion slots.one The node's adapter is directly cabled into a
corresponding port on the switch board.
Future plans call for up to two adapters per node.
- Switchboards are interconnected according to the number of nodes that
comprise an SP system. Any node can talk to any other node by multiple
paths. A sample 64 node switch configuration is shown below.
|
|
|
- View additional switch configuration diagrams.
- There are different types of switches and adapters to match different
types of nodes. The two most common types of switches currently are:
- SP2 (Colony) Switch: POWER3 nodes
- Federation Switch: POWER4 nodes
Switch Communication Protocols:
- Applications can use one of two available communications protocols, either
US or IP
- US - User Space Protocol. Preferred protocol due to performance.
- IP - Internet Protocol. Slower protocol. Used for communications
by jobs that span multiple IBM SP systems.
- Usage details are covered later in this tutorial.
Switch Application Performance:
- An application's communication performance over the SP switch is
dependent upon several factors:
- Node type
- Switch and switch adapter type
- Communications protocol used
- On-node vs. off-node proximity
- Application specific characteristics
- Network tuning parameters
- Competing network traffic
- Theoretical peak bi-directional performance:
- SP2 (Colony) Switch: 1 GB/sec
- Federation Switch: 4 GB/sec
- The table below shows performance metrics for a two task MPI point-to-point
message passing program under different configurations and conditions.
Executions were performed on LC production systems in "batch" mode.
Switch Type
Node Type |
Protocol |
Task Proximity |
Latency (usec) |
Pt to Pt Bandwidth (MB/sec) |
SP2 Switch
POWER3 NH-2 375 MHz |
IP |
Between 2 nodes |
105 |
77 |
On same node |
70 |
82 |
US |
Between 2 nodes |
20 |
390 |
On same node |
20 |
334 |
Federation Switch POWER4 1.5 GHz |
IP |
Between 2 nodes |
32 |
318 |
On same node |
21 |
335 |
US |
Between 2 nodes |
6 |
1542 |
On same node |
6 |
1201 |
switchtests.c
Control Workstation
Single Point of Control:
- Serves as the single point of control for System Support Programs
used by System Administrators for system monitoring, maintenance and
control.
- Separate machine - not part of the SP frame
- Must be an RISC System/6000
- Connects to each frame with
- RS-232 control line used to monitor frame, node and switch hardware
- external ethernet LAN
- Acts as install server for other SP nodes
- May also act as a file server (not recommended)
Single Point of Failure:
- Because the Control Workstation can be a "single point of failure" for
an entire SP system, IBM offers a High Availability Control Workstation
(HACWS) configuration option
- Includes a duplicate/backup
control workstation and software to handle automatic failover.
|
|
Software and Development Environment
|
The software and development environment for the IBM SPs is similar to what is
described in the Introduction to LC Resources tutorial. Items specific
to the IBM SPs are discussed below.
AIX Operating System:
- AIX is built upon UNIX System V and Berkeley Software
Distribution 4.3 (4.3 BSD), with conformance to the Portable
Operating System Interface for Computer Environments (POSIX)
IEEE 1003.1-1990.
- Beginning with AIX 5, IBM is moving in the direction of "Linux Affinity"
which will allow Linux applications to be recompiled and run under AIX.
- Every SP node runs its own copy of the AIX operating system. SMP nodes
have a single copy of the operating system which is threaded for all
CPUs.
- AIX product information is available from IBM on the WWW at:
Parallel Environment:
- IBM's Parallel Environment software provides the parallel user environment
on IBM SP systems. It encompasses a collection of software tools and
libraries designed for developing, executing, debugging and profiling
parallel C, C++ and Fortran programs.
- Complete documentation for all components of the Parallel Environment
software is available online in IBM's Parallel
Environment Manuals. Parallel Environment topics are also discussed
in the POE section below.
Compilers:
- IBM - C/C++ and Fortran compilers. Covered
later.
- Guide - KAI OpenMP C, C++ and Fortran compilers
- gcc, g77 - GNU C, C++ and Fortran compilers
Math Libraries Specific to IBM SPs:
- ESSL - IBM's Engineering Scientific Subroutine Library.
See IBM's online
ESSL Manuals for information.
- PESSL - IBM's Parallel Engineering Scientific Subroutine Library.
A subset of ESSL that has been parallelized. Documentation is located with
ESSL documentation mentioned above.
- MASS - Math Acceleration Subsystem. High performance versions of
most math intrinsic functions. Scalar versions and vector versions.
Batch System:
- LoadLeveler - IBM's native batch system for the IBM SP. At LC,
users generally do not interact directly with LoadLeveler. Instead,
LCRM (DPCS) is used.
- LCRM (DPCS) - LC's batch system. Covered in depth in the
LCRM (DPCS) Tutorial.
User Filesystems:
- As usual - home directories, /nfs/tmp, /var/tmp, /tmp,
/usr/gapps, archival storage. For more information see the
Introduction to LC Resources tutorial.
- General Parallel File System (GPFS) - IBM's parallel filesystem
available on all of LC's IBM SPs. GPFS is discussed in the
Parallel File Systems section of the Introduction
to Livermore Computing Resources tutorial.
Parallel Operating Environment (POE)
|
Overview
IBM Parallel Environment Software:
- IBM's Parallel Environment (PE) software product encompasses a
collection of software tools designed to provide an
environment for developing, executing, debugging and profiling
parallel C, C++ and Fortran programs.
- Some of the Parallel Environment's key components include:
- Parallel compiler scripts
- Facilities to manage your parallel execution environment (environment
variables and command line flags)
- Message Passing Interface (MPI) library
- Most of MPI-2
- Low-level Application Programming Interface (LAPI)
- Parallel file copy utilities
- Authentication utilities
- Parallel debugger
- Parallel profiling tools
- Dynamic probe class library (DPCL) - an API for parallel tool
development (Note that DPCL has been open sourced but is still
distributed with PE).
- Much of what the Parallel Environment does is designed to be
transparent to the user. Some of these tasks include:
- Linking to the necessary parallel libraries during compilation (via
parallel compiler scripts)
- Finding and acquiring machine resources for your parallel job
- Loading your executable onto each processor
- Handling all stdin, stderr and stdout for each processor of your
parallel job
- Signal handling
- Providing parallel communications support
- Managing the use of processor and network adapter resources
- Retrieving system and job status information when requested
- Error detection and reporting
- Providing support for run-time profiling and analysis tools
Parallel Operating Environment:
- Technically, the Parallel Operating Environment (POE) is just one
part of IBM's Parallel Environment (PE) software product.
However, for discussion purposes and from a user perspective, this
tutorial will consider the two synonymous.
- POE was originally designed for a distributed memory, message passing based
system (early SPs). It has been continually updated to keep pace with
IBM's hardware progression - SMPs with an increasing number of cpus and
a clustered-SMP environment.
- POE is unique to the IBM AIX environment. It runs only on
the IBM POWER platforms.
- POE can also be used to run serial jobs and shell commands concurrently
across a network of machines.
Types of Parallelism Supported:
- POE is primarily designed for process level (MPI) parallelism, but fully
supports threaded and hybrid (MPI + threads) parallel programs also.
- Process level MPI parallelism is directly managed by POE from compilation
through execution.
- Thread level parallelism is "handed off" to the compiler, threads library
and OS. For single-node threaded applications, POE is not actually needed
at all and is in fact not recommended.
- For hybrid programs, POE manages the MPI tasks, and
lets the compiler, threads library and OS manage the threads.
- POE fully supports Single Process Multiple Data (SPMD) and
Mutltiple Process Multiple Data (MPMD) models for parallel programming.
- For more information about parallel programming, MPI, OpenMP and
POSIX threads, see the tutorials listed on the
LC Training web page.
Interactive and Batch Use:
- POE can be used both interactively and within a batch (scheduler)
system to compile, load and run parallel jobs.
- There are many similarities between interactive and batch POE usage.
There are also important differences. These will
be pointed out later as appropriate.
Typical Usage Progression:
- The typical progression of steps for POE usage is outlined below,
and discussed in more detail in following sections.
- Understand your system's configuration
- Establish authorization on all nodes that you will use
- Compile and link the program using one of the POE parallel
compiler scripts
- Set up your execution environment by setting the necessary POE
environment variables. Create a host list file if using specific node
allocation.
- Start any run-time analysis tools (optional)
- Invoke the executable
- Note: not all of these steps need to be performed every time you use
POE - in particular, steps 2 and 5 are often omitted.
Parallel Operating Environment (POE)
|
POE Definitions
Before learning how to use POE, understanding some basic definitions may be
useful.
- SP / POWER / AIX / Linux Clusters
- SP = Scalable POWERParallel Systems. Refers to IBM's product line
of POWER architecture machines. All SP systems run under IBM's AIX
operating system. There are numerous models and configurations, always
changing.
More recently, IBM is also offering parallel Linux cluster systems.
POE does not currently run on these systems.
- Node
- Within POE, a node refers to single machine, running
its own copy of the AIX operating system. A node has a unique network
name/address. Within a parallel POE system, a node can be booted and
configured independently, or cooperatively with other nodes.
All current IBM SP nodes are SMPs (next).
- SMP
- Symmetric Multi-Processor. A computer (single machine/node) that has
multiple CPUs, configured to share/arbitrate shared memory.
A single copy of the operating system serves all CPUs. IBM SP nodes
vary in the number of CPUs they contain (2, 4, 8, 16, 32...)
- Job
- A job refers to the entire parallel application and typically consists
of multiple processes/tasks.
- Process / Task
- Under POE, an executable (a.out) that may be scheduled to run
by AIX on any available physical processor as a UNIX process is
considered a task. Task and process are synonymous. For MPI applications,
each MPI process is referred to as a "task" with a unique identifier
starting at zero up to the number of processes minus one.
- Interprocess
- Between different processes/tasks. For example, interprocess
communications can refer to the exchange of data between different
MPI tasks executing on different physical processors. The processors
can be on the same node (SMP) or on different nodes.
- Thread
- A thread is an independently schedulable stream of instructions that
exists within, and uses the resources of an existing UNIX process/task.
In the simplest sense, the idea of a subroutine that can be scheduled
to run independently from, and concurrently with other subroutines in the
same a.out describes a thread. A task can have multiple threads, each of
which may be scheduled to run on the multiple physical processors of an
SMP node.
- Pool
- A pool is an arbitrary collection of nodes assigned by the
system managers of an SP system. Pools are typically used to
separate nodes into disjoint groups, each of which is used for specific
purposes. For example, on a given system, some nodes may be designated
as "login" nodes, while others are reserved for "batch" or "testing"
use only.
Example SP System Pools
- Partition
- The group of nodes used to run your parallel job is
called your partition. Across an SP system, there is a one discreet
(usually) partition for each user's job. Typically, the nodes in your
partition are used exclusively by you for the
duration of your job. After your job completes, the nodes may be
allocated for other users' partitions. Under POE, a single Job Manager
daemon manages the entire SP system and all user partitions.
- Job Manager
- The Job Manager in POE version 2.4+ is a function provided by LoadLeveler,
IBM's batch scheduling software. When you request nodes to run your
parallel job, LoadLeveler will find and allocate nodes for your use.
LoadLeveler also enables user jobs to take advantage of multiple CPU
SMP nodes, and keeps track of how the communications fabric (switch)
is used.
- Home Node / Remote Node
- Your home node is the node where your parallel job is initiated. For
interactive sessions it is the node where you are logged on.
The home node may or may not be considered part of your partition
depending upon how the SP system is configured.
A Remote Node is any other non-home node in your partition.
- Partition Manager
- The Partition Manager, also known as the poe process,
is a daemon process that is automatically started for you whenever
you run a parallel job. There is one Partition Manager process for your
entire parallel job.
The Partition Manager process resides on your home node, and is
responsible for overseeing the parallel execution of your POE job.
The Partition Manager generally operates transparently to the user, and
is responsible for performing many of the tasks associated with POE:
- Obtains the nodes for your parallel job through communications
with the Job Manager.
- Establishes socket connections with each of the tasks in your
partition.
- Sets up your user environment and validation for each task
in your partition.
- Starts a pmd daemon process for each parallel task.
This process serves as the point of contact between
the Partition Manager and each task - it actually becomes the parent
process of your executable.
- Dynamically links the specified communications library to your
executable.
- Manages stdin, stderr and stdout for all of your parallel tasks.
- Exchanges control information (signaling, synchronization, exit
status) with the pmd process on each node.
- User Space Protocol
- Often referred to simply as US protocol.
The fastest method for MPI communications between tasks on different nodes.
Only one user may use US communications on a node at any given time.
Can only be conducted over the SP switch.
- Internet Protocol
- Often referred to simply as IP protocol.
A slower, but more flexible method for MPI communications.
Multiple users can all use IP communications on a node at the same time.
Can be used with other network adapters besides the SP switch.
- Non-Specific Node Allocation
- Refers to the Job Manager (LoadLeveler) automatically selecting which nodes
will be used to run your parallel job. Non-specific node allocation is
usually the recommended (and default) method of node allocation. For
batch jobs, this is typically the only method of node allocation
available.
- Specific Node Allocation
- Enables the user to explicitly choose which nodes will be used to run
a POE job. Requires the use of a
"host list file",
which contains the actual names of the nodes that must be used.
Specific node allocation is only for interactive use, and recommended
only when there is a reason for selecting specific nodes.
Compilers and Compiler Scripts:
- In IBM's Parallel Environment, there are a number of compiler invocation
commands, depending upon what you want to do. However, underlying all
of these commands are the same AIX C/C++ and Fortran compilers.
- The POE parallel compiler commands are actually scripts that
automatically link in the necessary Parallel Environment libraries,
include files, etc. and then call the appropriate native AIX compiler.
- For the most part, the native IBM compilers and their parallel
compiler scripts support a common command line syntax.
Most compiler "flavors" also support common options.
- See the References and More Information section
for links to IBM compiler documentation.
Compiler Syntax:
[compiler] [options] [source_files] |
For example:
mpxlf -g -O3 -qlist -o myprog mprog.f
Compiler Invocation Commands:
- Compiler invocation commands: the table below summarizes which
compiler "flavors" are typically available - although some may be
dependent upon your site's license situation and software level.
- Note that all of the IBM compiler invocation commands are not shown.
Other compiler commands are available to select IBM compiler extensions
and features. Consult the appropriate IBM compiler man page and
User's Guide for details.
IBM Compiler Invocation Commands |
Serial |
xlc / cc |
ANSI C compiler / Extended C compiler (not strict ANSI) |
xlC |
C++ compiler |
xlf / f77 |
Fortran 77 compatible; subset of Fortran 90 compiler |
xlf90 |
Full Fortran 90 with IBM extensions |
Threads
(OpenMP, Pthreads, IBM threads) |
xlc_r |
xlc for use with threaded programs |
xlC_r |
xlC for use with threaded programs |
xlf_r |
xlf for use with threaded programs |
xlf90_r |
xlf90 for use with threaded programs |
MPI |
mpcc |
Compiler script for parallel C programs using MPI |
mpCC |
Compiler script for parallel C++ programs using MPI |
mpxlf |
Compiler script for parallel Fortran 77 programs using MPI |
mpxlf90 |
Compiler script for parallel Fortran 90 programs using MPI |
MPI with Threads
(OpenMP, Pthreads, IBM threads) |
mpcc_r |
Parallel C compiler script for hybrid MPI/threads programs |
mpCC_r |
Parallel C++ compiler script for hybrid MPI/threads programs |
mpxlf_r |
Parallel Fortran 77 compiler script for hybrid MPI/threads
programs |
mpxlf90_r |
Parallel Fortran 90 compiler script for hybrid MPI/threads
programs |
Compiler Options:
- Compiler options: all of the IBM compilers include many options - too
numerous to be covered here. Many of these options are listed in the
man pages hyperlinked to the compiler commands in the above table.
- For a full discussion, users are advised to consult the IBM documentation
for details. An abbreviated summary of some of the more useful
options are listed in the table below.
Option |
Description |
-bmaxdata:bytes
-bmaxstack:bytes |
Required for large memory use on 32-bit architectures. Default data and
stack (combined) size is 256 MB. |
-c |
Compile only, producing a ".o" file. Does not link object
files. |
-g |
Produce information required by debuggers and some
profiler tools |
-I (upper case i) |
Names directories for additional include files. |
-L |
Specifies pathname where additional libraries reside
directories will be searched in the order of their occurrence on
the command line. |
-l (lower case L) |
Names additional libraries to be searched. |
-O -O2 -O3 -O4 -O5 |
Various levels of optimization |
-o |
Specifies the name of the executable (a.out by default) |
-p -pg |
Generate profiling support code |
-q32, -q64 |
Specifies generation of 32-bit or 64-bit objects |
-qhot |
Determines whether or not to perform high-order
transformations on loops and array language during
optimization, and whether or not to pad array dimensions
and data objects to avoid cache misses. Fortran only. |
-qipa
| Specifies interprocedural analysis optimizations |
-qarch=arch
-qtune=arch |
Permits maximum optimization for the SP processor
architecture being used. Can significantly improve performance at the
expense of portability. |
-qautodouble=setting |
Automatic conversion of single precision to double
precision, or double precision to extended precision |
-qsmp=omp |
Specifies OpenMP compilation |
-qlist
-qsource
-qxref |
Compiler listing/reporting options |
- 32-bit versus 64-bit
- Default compilation is to produce 32-bit objects
- -q64 must be used to specify 64-bit object creation
- 32-bit and 64-bit components can not coexist - an application must
be entirely one or the other
- There are some important issues and considerations related to this
topic - see the compiler User's Guide.
- Optimization
- Default is no optimization
- Without the correct -O option specified, the defaults for
-qarch and -qtune are not optimal!
Only -O4 and -O5 automatically select the best
architecture related optimizations.
- Options other than those above are also available
- See the respective man page and compiler User's Guide for details.
- Note: All of the IBM compiler commands have default options, which can
be configured by a site's system adminstrators. It may be useful to
review the files /etc/*cfg* to learn exactly what the
defaults are for the system you're using:
- Linking: POE executables which use MPI are
dynamically linked with the appropriate communications library at
run time. It is possible, but not recommended, to create statically
linked modules. Please consult the Parallel Environment
Operation and Use Volume 1
manual for details.
Implementations:
- There are three implementations of MPI on LC's IBM SPs:
- IBM MPI - threaded library (recommended)
- IBM MPI - signal (non-threaded) library
- MPICH - non-threaded
- The implementation of choice is determined by the compiler command you
use.
- IBM MPI threaded library
- mpcc_r
- mpxlc_r
- mpguidec
- mpCC_r
- mpxlC_r
- mpguidec++
- mpxlf_r
- mpguidef77
- mpxlf90_r
- mpguidef90
- mpxlf95_r
- Note that the Guide compiler scripts use IBM's threaded MPI library
- Note also that LC has aliased the signal library compile commands
(below) to the threaded commands above. This is not true on
non-LC systems!
- IBM MPI - signal (non-threaded) library
- First setenv LLNL_COMPILE_SINGLE_THREADED TRUE. If you forget
to do this at LC, you will get the threaded library instead.
- mpcc
- mpxlc
- mpCC
- mpxlC
- mpxlf
- mpxlf90
- mpxlf95
- MPICH - non-threaded
- mpicc
- mpiCC
- mpif77
- mpif90
Notes:
- All MPI compiler commands are actually "scripts" that automatically link
in the necessary MPI libraries, include files, etc. and then call the
appropriate native AIX compiler.
- See the Programming Considerations section
below for important details regarding the IBM MPI implementations.
- Documentation for the IBM implementation is available
from IBM.
- LC's MPI tutorial describes
how to create MPI programs.
Running on IBM SP Systems
|
Understanding Your System Configuration
First Things First:
- Before running your parallel job, it is important to know a few details
regarding the SP system you intend to use. This is especially important
if you use multiple SP systems, as they will almost certainly be configured
differently.
- Some of the questions you might ask:
- How many nodes are available?
- How many physical processors are there on each node?
- Are the nodes configured into separate pools?
- Has the system changed since I last used it?
- What other jobs are running?
- Are there sufficient machine resources (memory, disk) on the
nodes I will be using?
- Where can I run interactively? In batch?
- How are other users using the system resources?
Some Hints for Understanding Your System:
Running on IBM SP Systems
|
Establishing Authorization
NOTE: This section can be omitted by LC users, as authorization is automatic.
About SP Authorization:
AIX Authorization:
- AIX authorization can be setup by system administrators (root) in the
/etc/hosts.equiv file, or by the user in a .rhosts
file.
- If your system administrators have setup the /etc/hosts.equiv
file, you can view it to determine which machines may be used, and possibly
also, specifications on which users are permitted/denied access to
which machines.
- If your system is not using the /etc/hosts.equiv file for
authentication, then you will need to setup your own authentication
file using the standard UNIX "trusted host" file, called
.rhosts.
- Creating a .rhosts File
Running on IBM SP Systems
|
Setting Up the Execution Environment
Setting POE Environment Variables:
| Different versions of POE software are not identical in the environment
variables they support.
|
Basic POE Environment Variables
- Although there are many POE environment variables, you really only need
to be familiar with a few of them. Specifically, those that answer
the three basic questions:
 |
QUESTION 1: How many nodes and how many tasks do I
require?
|
- MP_PROCS
- The total number of MPI processes/tasks for your parallel job.
May be used alone or
in conjunction with MP_NODES and/or MP_TASKS_PER_NODE to specify how many
tasks are loaded onto a physical SP node. The maximum value for
MP_PROCS is dependent upon the version of PE software installed
(currently ranges from 128 to 2048) If not set, the default is 1.
- MP_NODES
- Specifies the number of physical nodes on which to run the parallel
tasks. May be used alone or in conjunction with MP_TASKS_PER_NODE
and/or MP_PROCS.
- MP_TASKS_PER_NODE
- Specifies the number of tasks to be run on each of the physical nodes.
May be used in conjunction with MP_NODES and/or MP_PROCS.
 |
QUESTION 2: How will nodes be allocated - should I
choose them myself or let the LoadLeveler automatically choose them for me?
|
- MP_RESD
- Specifies whether or not LoadLeveler should be used
to allocate nodes. Valid values are either "yes" (non-specific node
allocation) or "no" (specific node allocation). If not set, the default
value is context sensitive to other POE variables. Batch systems
typically override/ignore user settings for this environment variable.
- MP_RMPOOL
- Specifies the SP system pool number that should be used
for non-specific node allocation. This is only valid if you
are using the LoadLeveler for non-specific node allocation
(from a single pool) without a
host list file.
Batch systems typically override/ignore user settings for this environment
variable.
- MP_HOSTFILE
- This environment variable is used only if you wish to explicitly select
which nodes will be allocated for your POE job (specific node allocation).
If you prefer to let LoadLeveler automatically allocate nodes then this
variable should be set to NULL or "".
If used, this variable specifies the name of a file which contains the
actual machine (domain) names of nodes you wish to use. It can also
be used to specify which pools should be used.
The default filename is "host.list" in the current directory.
You do not need to set this variable if your host list file is the
default "host.list".
Batch systems typically override/ignore user settings for this environment
variable. Additional details on using the host list file are available
here.
 |
QUESTION 3: Which communications protocol and network
interface should I use?
|
- MP_EUILIB
- Specifies which of two protocols should be used for task communications.
Valid values are either "ip" for Internet Protocol or "us" for
User Space protocol. The default is "ip", while "us" is faster.
Note: batch systems may differ in the default setting.
- MP_EUIDEVICE
- A node may be physically connected to different networks. This environment
variable is used to specify which network adapter should be used for
communications.
Valid values are: en0 (ethernet), fi0 (FDDI),
tr0 (token-ring), css0 (SP switch) or csss
(SP switch double adapter).
Note that valid values will also depend upon the actual physical
network configuration of the node. For example, to specify "fd0" when
a node does not physically have a FDDI adapter is an error.
Recommendation: if all of your communication is between SP nodes, be
sure to set MP_EUIDEVICE to css0 or csss (double adapters).
Example Basic Environment Variable Settings
- The following examples demonstrate how to set the basic POE
environment variables for three different situations.
- Case 1:
Non-specific node allocation of 4 tasks from a single pool with
User Space protocol communications over the SP switch (single adapter).
csh / tcsh |
ksh / bsh |
setenv MP_PROCS 4
setenv MP_RMPOOL 0
setenv MP_RESD YES
setenv MP_HOSTFILE "NULL"
setenv MP_EUILIB us
setenv MP_EUIDEVICE css0
|
export MP_PROCS=4
export MP_RMPOOL=0
export MP_RESD=YES
export MP_HOSTFILE="NULL"
export MP_EUILIB=us
export MP_EUIDEVICE=css0
|
- Case 2:
Non-specific node allocation using 4 physical nodes from a
single pool. Each node will execute 4 tasks (for a total of 16 tasks),
using User Space protocol communications over the SP switch
(double adapter).
csh / tcsh |
ksh / bsh |
setenv MP_NODES 4
setenv MP_TASKS_PER_NODE 4
setenv MP_RMPOOL 2
setenv MP_RESD YES
setenv MP_HOSTFILE "NULL"
setenv MP_EUILIB us
setenv MP_EUIDEVICE csss
|
export MP_NODES=4
export MP_TASKS_PER_NODE=4
export MP_RMPOOL=2
export MP_RESD=YES
export MP_HOSTFILE="NULL"
export MP_EUILIB=us
export MP_EUIDEVICE=csss
|
- Case 3:
Specific node allocation of 32 tasks (possibly from multiple
pools) with Internet protocol communications over the SP switch
(double adapter). Assumes the existence of a file called
"myhosts" containing the desired hostnames.
csh / tcsh |
ksh / bsh |
setenv MP_PROCS 32
unsetenv MP_RMPOOL
setenv MP_RESD no
setenv MP_HOSTFILE myhosts
setenv MP_EUILIB ip
setenv MP_EUIDEVICE csss
|
export MP_PROCS=32
unset MP_RMPOOL
export MP_RESD=no
export MP_HOSTFILE=myhosts
export MP_EUILIB=ip
export MP_EUIDEVICE=csss
|
Miscellaneous POE Environment Variables
A list of some commonly used, or potentially useful, POE environment variables
appears below. A complete list of the POE environment variables can be
viewed quickly in the POE
man page. A much fuller discussion is available in the
Parallel Environment Operation and Use Volume 1
manual.
Variable |
Description |
MP_SHARED_MEMORY |
Allows MPI programs with more than one task on a node to use shared
memory versions of MPI message passing calls. Can significantly improve
communication bandwidth. Only available with the latest SP
hardware/software. Also requires a 256 MB memory segment. Valid values
are "yes" and "no". |
MP_STDOUTMODE |
Enables you to manage the STDOUT from your parallel tasks. If set to
"unordered" all tasks write output data to STDOUT asynchronously.
If set to "ordered" output data from each parallel task is written to
its own buffer. Later, all buffers are flushed, in task order, to
stdout. If a task id is specified, only the task indicated writes
output data to stdout. The default is unordered. Warning: use
"unordered" if your interactive program prompts for input - otherwise
your prompts may not appear. |
MP_SAVEHOSTFILE |
The name of an output host list file to be generated by the Partition
Manager. Can be used to "save" the names of the hosts used by your
POE job. |
MP_NEWJOB |
By default, the Partition Manager releases your partition when your program
completes its run. In order to preserve your partition between multiple runs
or running multiple jobs in sequence, set this environment variable to
"yes". |
MP_LABELIO |
Determines whether or not output from the parallel tasks are labeled
by task id. Valid values are yes or no. If not set, the default is no. |
MP_CHILD |
Is an undocumented, "read-only" variable set by POE. Each task will
have this variable set to equal its unique taskid (0 thru MP_PROCS-1).
Can be queried in scripts or batch jobs to determine "who I am". |
MP_FENCE and MP_NOARGLIST |
The default behavior of POE is to parse your command line and extract all
the arguments it recognizes, and then pass the remaining arguments to your
program. In cases where arguments to your program conflict with the
predefined POE command line flags (such as -procs),
use of these environment variables will instruct
POE on how to parse your command line. Setting MP_NOARGLIST to "yes"
causes POE to pass all arguments to your program. MP_FENCE is used to
selectively specify which arguments go to your program. |
MP_PGMMODEL |
Determines the programming model you are using. Valid values are spmd
or mpmd. If not set, the default is spmd. If set to "mpmd" you will
be enabled to load different executables individually on the nodes of your
partition. |
MP_CMDFILE |
Determines the name of a POE commands file used to load the nodes of
your partition. If set, POE will read the commands file rather than
STDIN. Valid values are any file specifier. Generally used only
when MP_PGMMODEL=mpmd. |
MP_RETRY and MP_RETRYCOUNT |
Interactive use only. MP_RETRY specifies the period (in seconds) of
time between processor node allocation
retries if there are not enough processor nodes immediately available.
MP_RETRYCOUNT specifies the number of times that the Partition Manager
should attempt to allocate processor nodes before returning without
running your program. |
MP_INFOLEVEL |
Determines the level of message reporting. Default is 1. Valid values are:
0 = error
1 = warning and error
2 = informational, warning, and error
3 = informational, warning, and error. Also
reports diagnostic messages for use by the
IBM Support Center.
3,4,5,6 = Informational, warning, and error. Also
reports high- and low-level diagnostic
messages for use by the IBM Support Center.
|
Running on IBM SP Systems
|
Invoking the Executable
Syntax:
Multiple Program Multiple Data (MPMD) Programs:
- By default, POE follows the Single Program Multiple Data parallel
programming model: all parallel tasks execute the same program but may use
different data.
- For some applications, parallel tasks may need to run different programs
as well as use different data. This parallel programming model is
called Multiple Program Multiple Data (MPMD).
- For MPMD programs, the following steps must be performed:
Interactive:
- Set the MP_PGMMODEL environment variable to "mpmd". For example:
setenv MP_PGMMODEL mpmd
export MP_PGMMODEL=mpmd
- Enter poe at the Unix prompt. You will then
be prompted to enter the executable which should be loaded on each
node. The example below
loads a "master" program on the first node and 4 "worker" tasks
on the remaining four nodes.
0:node1> master
1:node2> worker
2:node3> worker
3:node4> worker
4:node5> worker
- Execution starts automatically after the last node has been loaded.
Batch:
- Create a file which contains a list of the
program names (one per line) that must be loaded onto your nodes.
- Set the MP_PGMMODEL environment variable to "mpmd" - usually done in
your batch submission script
- Set the environment variable MP_CMDFILE to the name of the file
you created in step 1 above - usually done in your batch submission
script also.
- When your application is invoked within the batch system, POE will
automatically load the nodes as specified by your file.
- Execution starts automatically after the last node has been loaded.
Using POE with Serial Programs:
POE Error Messages:
Running on IBM SP Systems
|
Monitoring Job Status
- LC's ju and spjstat commands allow you to
monitor your running job's status. Examples are shown below.
frost067% ju
Pool total down used avail cap Jobs
0 pdebug.general 1 0 0 1 0%
1 pbatch.batch 63 0 61 2 96% ccals-4, ggarrids-4, ggarrids-4, jjkokks-2, jjkokks-2,
jjkokks-8, aads-1, rragils1-4, rrands-1, kkus-4, eess-1, aanas2-1, ggitss-1, ggitss-1, ggitss-1, ggitss-1,
eeittes-8, cchlutes-5, uuhons-8
frost067% spjstat
Scheduling pool data:
--------------------------------------------------------
Pool Memory Cpus Nodes Usable Free Other traits
--------------------------------------------------------
pbatch 16384Mb 16 63 63 2 pbatch
pdebug 16384Mb 16 1 1 1 pdebug
Running job data:
---------------------------------------------------------------
LL Batch ID User Nodes Pool Class Status Master
Name Used Node
---------------------------------------------------------------
frost067.994.0 eess 1 pbatch normal R frost005
frost067.976.0 ggitss 1 pbatch normal R frost001
frost067.954.0 uuhons 8 pbatch normal R frost008
frost067.1110.0 ggarrids 4 pbatch normal R frost016
frost067.1109.0 ggarrids 4 pbatch normal R frost012
frost067.1104.0 aads 1 pbatch normal R frost049
frost067.1102.0 wwals 4 pbatch normal R frost050
frost067.1098.0 rragils1 4 pbatch normal R frost023
frost067.1095.0 aanas2 1 pbatch normal R frost006
frost067.1093.0 jjkokks 2 pbatch normal R frost061
frost067.1092.0 jjkokks 8 pbatch normal R frost002
frost067.1089.0 jjkokks 2 pbatch normal R frost047
frost067.1082.0 cchlutes 5 pbatch normal R frost026
frost067.1074.0 ggitss 1 pbatch normal R frost024
frost067.1073.0 ggitss 1 pbatch normal R frost018
frost067.1050.0 kkus 4 pbatch normal R frost017
frost067.1049.0 eeittes 8 pbatch normal R frost004
frost067.1048.0 ggitss 1 pbatch normal R frost011
frost067.1027.0 rrands 1 pbatch normal R frost040
|
- For monitoring all batch jobs, both running and queued, the
pstat command can be used. See the man page or
LCRM (DPCS) Tutorial for details.
Running on IBM SP Systems
|
Interactive Job Specifics
Running Interactive Jobs:
Insufficient Resources:
- Interactive pools are usually small and must be shared by all users on
the system. It is easy to exhaust the available nodes. When this happens,
POE will give you an error message:
ERROR: 0031-124 Less than XX nodes available from pool N
- or -
ERROR: 0031-365 LoadLeveler unable to run job, reason:
LoadL_negotiator: 2544-870 Step frost032.pacific.llnl.gov.11575.0 was not
considered to be run in this scheduling cycle due to its relatively low
priority or because there are not enough free resources.
|
- If this happens, you can keep checking the pdebug queue with the
ju command until nodes are freed up, and then try running your
job.
- Alternately, you can use the MP_RETRY and
MP_RETRYCOUNT environment variables to automatically
keep trying to run your job.
Killing Interactive Jobs:
Running on IBM SP Systems
|
Batch Job Specifics
On LC systems, LCRM (DPCS) is used for running batch jobs. LCRM is
covered in detail in the LCRM
(DPCS) Tutorial. This section only provides a quick summary of LCRM
usage on the IBM SPs.
Submitting Batch Jobs:
Quick Summary of Common Batch Commands:
Command |
Description |
psub |
Submits a job to LCRM |
pstat |
LCRM job status command |
prm |
Remove a running or queued job |
phold |
Place a queued job on hold |
prel |
Release a held job |
palter |
Modify job attributes (limited subset) |
lrmmgr |
Show host configuration information |
pshare |
Queries the LCRM database for bank share allocations, usage statistics,
and priorities. |
defbank |
Set default bank for interactive sessions |
newbank |
Change interactive session bank |
Batch Jobs and POE Environment Variables:
- Certain POE environment variables will effect batch jobs just as they
do interactive jobs. For example, MP_INFOLEVEL and
MP_PGMMODEL. These can be placed in your batch job command
script.
- However, other POE environment variables are ignored by the batch
scheduler for obvious reasons. The following POE variables will have no
effect if used in a batch job command file:
MP_PROCS
MP_NODES
MP_TASKS_PER_NODE
MP_RMPOOL
MP_HOSTFILE
MP_PMDSUFFIX
|
MP_RESD
MP_RETRY
MP_RETRYCOUNT
MP_ADAPTER_USE
MP_CPU_USE
|
- Be aware that POE environment variables in your .login, .cshrc,
.profile, etc. files may also affect your batch job.
Killing Batch Jobs:
- The LCRM prm command can be used to remove queued jobs
or terminate a running job.
- In the unlikely event that a terminated parallel job leaves behind
running processes, you can use the poekill command,
as previously described in the Interactive Job
Specifics section.
Running on IBM SP Systems
|
Optimizing CPU Usage
SMP Nodes:
- All IBM SPs nodes are shared memory SMPs. Each SMP node has multiple CPUs,
and is thus capable of running multiple tasks simultaneously.
- The number of CPUs on a node depends upon the type of node. For
example:
- 604e - 4 CPUs per node
- POWER3-II - 16 CPUs per node
Effectively Using Available CPUs:
When Not to Use All CPUs:
- For MPI codes that use OpenMP or Pthread threads, you probably do not
want to place an MPI task on each CPU, as the threads will need someplace
to run.
- For tasks that require most of the node's memory, you may likewise not
want to put a task on every CPU if it will lead to memory exhaustion.
Available Debugging Tools:
- TotalView - Multi-process, multi-thread, Xwindows based debugger
from Etnus. Recommended. See the TotalView tutorial for details.
- dbx - command line debugger for serial codes. See man page for
details.
- pdbx - command line parallel version of dbx provided by POE. See
the man page and IBM POE documentation for
details.
- decor - A tool for displaying stack traces from multiple
lightweight core files. Run the command without arguments for usage
information.
- Zerofault - Memory management tool from The Kernel Group. See
http://www.llnl.gov/icc/lc/DEG/zerofault for details.
- Great Circle - Parallel memory management tool from Geodesic. See
http://www.llnl.gov/icc/lc/DEG/GreatCircle-DEG3.htm
for details.
- Insure++ - Memory management tool and runtime debugger from
ParaSoft. See /usr/local/parasoft/manuals/index.htm for details.
Using TotalView on the IBM SPs:
- Using TotalView is covered in great detail in the
Totalview tutorial. Only the basics of
starting TotalView are covered here.
- Be sure to compile your program with the -g option
- When starting TotalView, specify the poe process and then
use TotalView's -a option for your program and any other
arguments (including POE arguments). For example:
totalview poe -a -procs 4 myprog
- TotalView will then load the poe process. Use the Go command to start
poe with your executable.
- After you successfully start TotalView with your parallel job, you
will be prompted about stopping the parallel job (below). In most cases,
answering yes is the right thing to do.
- You will then be ready to begin debugging your parallel program.
- For debugging in batch, see
Batch System
Debugging in LC's TotalView tutorial.
Preset POE Environment Variables:
- LC automatically sets several the POE environment variables for all
users. For the most current settings, check the global
/etc/environment file on any ASC machine.
An example (not necessarily the most recent) is shown below.
- Note that in most cases, these are the "best" settings.
# POE default environment variables
MP_CPU_USE=unique
MP_EUILIB=us
MP_HOSTFILE=NULL
MP_INFOLEVEL=1
MP_LABELIO=yes
MP_RESD=yes
MP_SHARED_MEMORY=yes
MP_SYNC_ON_CONNECT=no
MP_TMPDIR=/var/tmp
# cluster-specific environment variables
RPC_UNSUPPORTED_NETIFS=en0:en1:en2:en3:en4:css0:css1
# Set Poe Environment Variables
MP_COREFILE_FORMAT=core.light
MP_EUIDEVICE=csss
|
Additional Local Information:
- POE/AIX Environment Variables at
LLNL - table of environment variables with notations on LLNL
specific settings.
- ASC White Machine Usage
- Livermore Computing:
-
http://www.llnl.gov/computing . Links to a full range of
local documentation, tutorials, policies, latest news, machine
status, announcements, etc.
- On all ASC machines:
- /usr/local/docs - text, PostScript and HTML documents
- news items
- MOTD
Parallel Environment Limits:
Several of the more important POE environment limits are listed below.
Others may be found in the Parallel Environment
MPI Programming Guide manual (see the Appendix
on Limits).
- The maximum number of tasks that a job may have:
- Up to 4096 User Space tasks
- Up to 2048 IP tasks
- Maximum buffer size for any MPI communication is 2 GB.
- Maximum "eager limit" message size is 64 KB. Refers to messages sent
without handshaking from a sending task to a receiving task. Requires
receiving task to have sufficient early arrival buffer.
- Maximum early arrival buffer (for messages sent in "eager" mode) size
is 64 MB. If this is exceeded the program will fail/terminate.
- Maximum aggregate unsent data, per task: no specific limit
- Maximum number of communicators, file handles, and windows:
approximately 2000
- Maximum number of committed data types: depends on MP_BUFFER_MEM
- Maximum number of distinct tags: all non-negative integers less than 2**32-1
Run-time Analysis Tools:
- The Parallel Environment software provides a couple
tools which can be used to analyze your parallel program's execution
behavior. These tools are briefly described below. Please consult the
Parallel Environment Operation and Use Volume 2
manual for detailed information.
- Xprofiler
- This tool is a graphical version of the parallel gprof
command.
[Image 1]
[Image 2]
- Permits profiling at the source line level
- Provides all of the "flat" (non-graphical) gprof reports
- Invoked with the
xprofiler command.
- PE Benchmarker
- New with AIX 5 and version 3.2 of the Parallel Environment software.
- PE Benchmarker is a suite of applications and utilities that provide
for dynamic instrumentation/tracing of MPI, user, hardware and operating
system events, and visualization of these events.
Parallel File Copy Utilities:
- POE provides the following utilities which may be used to copy file(s) to
and from a number of nodes.
- These utilities are actually message
passing applications designed for efficiency.
- See the associated hyperlinked man page for examples of each
utility's use.
Utility |
Description |
mcp
Copies a single file from the home node to a number of remote nodes.
|
|
mcpscat
Copies a number of files from task 0 and scatter them in sequence
to all tasks, in a round robin order.
|
|
mcpgath
Copies a number of files from all tasks back to task 0.
|
|
This completes the tutorial.
|
Please complete the online evaluation form - unless you are doing the exercise,
in which case please complete it at the end of the exercise. |
Where would you like to go now?
References and More Information
|
- IBM Parallel Environment for AIX manuals can be found at:
www-1.ibm.com/servers/eserver/pseries/library/sp_books
- Operation and Use Volume 1: Using the Parallel Operating Environment
- Operation and Use Volume 2: Tools Reference
- Hitchhiker's Guide
- MPI Programming Guide
- MPI Subroutine Reference
- Installation
- Messages
- GPFS
- ESSL, PESSL
- IBM Compiler Documentation:
- "ASCI Blue Pacific and Beyond". Presentation by Geert Wenes,
IBM Enterprise Systems Group. 2000.
- "Parallel Operating Environment (POE)" tutorial from the Maui High
Performance Computing Center.
Permission granted to reproduce copyrighted material in whole or in part
for United States Government or educational purposes.
www.mhpcc.edu.