Article

Profiling MPI Applications

By Yukon Maruyama and Marty Itzkowitz, updated January 2011

 

This article describes the profiling of Message Passing Interface (MPI) applications with the Oracle Solaris Studio Performance Tools. It starts with an overview of MPI performance data, explains how to profile MPI applications, shows examples from the analysis of performance data, and finishes with a discussion of supported MPI implementations.

The Oracle Solaris Studio Performance Tools require a supported version of Oracle Solaris or Linux and a supported version of Java. You can verify that all the appropriate components and patches are installed by running the collect command with no arguments.

This document assumes you are familiar with MPI. A description of the MPI standard API and runtime system can be found in Wikipedia or any number of reference books on MPI.

This article is based on the Oracle Solaris Studio 12.2 software. The examples are based on the Oracle Message Passing Toolkit (previously called “ClusterTools”) MPI implementation.

Overview of MPI Performance Data

MPI applications consist of processes that use MPI to synchronize and communicate. The processes may be instantiations of a single executable, or of multiple executables. An application is launched with an agent, typically mpirun or mpiexec, that assigns the processes to nodes of an MPI cluster. Each process is assigned a sequential identifier known as its global rank. The process uses MPI API calls to exchange data, combine partial results of computations, or synchronize processes.

MPI performance issues can be grouped into two categories: problems that can be identified by focusing on communication and synchronization, and those that look at the application computation itself. The two types of data are addressed with different kinds of data collection. Communication data is collected by tracing MPI API calls. Computation data is collected with clock profiling, and optionally, CPU hardware-counter profiling.

Communication Profiling

Communication profiling data can be used to show an exact sequence of MPI API calls. It can also be used to identify imbalances, outliers, and communication patterns. (Examples of analysis screen shots can be found in MPI Timeline and MPI Charts.) This data is gathered by tracing MPI API calls using a technique called interposition.

When the application calls the MPI_Init function, the data collector initializes MPI tracing. This initialization includes the task of determining the time offset between nodes.

During the application's run, the data collector records the name of the calls, the number of messages sent, the sent-byte count, the number of messages received, and the received-byte count. Each trace event also includes an entry and exit timestamp.

During MPI_Finalize, the collector again measures time offsets for each node. The offsets determined during MPI_Init and MPI_Finalize are used to adjust each nodes' timestamps. In addition, messages are matched to the MPI API calls associated with their transfer. For example, if an MPI_Irecv call relies on a subsequent MPI_Wait to complete a message transfer, the message's reception is attributed to MPI_Wait, not MPI_Irecv. Communication profiling imposes significant overhead on MPI_Init and MPI_Finalize, but significantly less overhead on the remainder of MPI API calls.

Computation Profiling

Computation profiling data can show which of an application's functions and source lines are consuming system resources or causing program stalls. (Examples of analysis screen shots can be found in Examining Computation Profiling).

Computation data is based on clock profiling, and optionally, CPU hardware-counter profiling. Data consists of performance metrics and stack information that is recorded each time a clock-profile or CPU hardware-counter event occurs.

Clock Profiling and MPI State Data

Clock profiling is enabled by the collect command by default, with a period of approximately 10 milliseconds. The profiling rate can be adjusted or disabled with the -p argument. At the end of the specified period, stack information and metrics are recorded. On Linux, User CPU Time is recorded. On Solaris, User CPU Time, System CPU Time, and several other metrics are recorded.

Two additional performance metrics, known collectively as MPI State Data, are obtained from the MPI runtime library whenever clock profiling is performed on Oracle Message Passing Toolkit (ClusterTools) 8.1 or later. The two metrics are MPI-Work Time, which accumulates when the process is doing work inside the MPI Library, and MPI-Wait Time which accumulates when the process is busy-waiting or sleep-waiting. MPI State Data can be used to identify the nature of MPI-related stalls.

Hardware-Counter Profiling Data

Hardware-counter profiling works on a principal similar to clock profiling, but uses CPU hardware counter events to trigger the recording of samples instead. The events are CPU-specific, but they typically measure the use of resources like cpu-cycles, caches, and TLBs. Hardware-counter profiling is available on Solaris systems as well as on supported Linux systems that have the perfctr patch installed.

Comparing MPI API Tracing and Clock-Profiling Data

MPI API trace data is orthogonal to MPI State data, although both get their information from the MPI runtime. As a result, there are correlations between the data. Specifically, the MPI Time metric measured by tracing represents the real time spent in the MPI runtime, summed across processes, and MPI-Wait and MPI-Work represent statistical measures which approximately sum to the MPI time.

Scalability Differences between API Tracing and Clock-Profiling

MPI can be used for very large scale jobs, using hundreds, if not thousands, of processors and MPI processes. In such cases, the scalability of the performance measurements becomes significant.

The data volume of MPI API tracing is approximately proportional to the number of MPI API calls and messages. As a result, data volume can only be reduced by limiting the application's runtime or process count. Selective profiling of a subset of ranks is not possible because the collector would not have enough information to match message sends and receives.

On the other hand, the data volume from clock-profiling depends on the number of samples taken. The frequency can be managed by choosing a lower profiling rate. For example, -p low will reduce the data by approximately a factor of ten. Selective profiling of a subset of ranks, described later in this document, can also reduce data volume.

How to Profile MPI Applications

This section discusses how to compile and launch your application for performance profiling.

Compiling MPI Applications

You can compile your application with either the Solaris Studio compilers or the GNU compilers. Most MPI implementations provide wrapper scripts for the C, C++ and Fortran compilers to ensure that the proper include files and libraries are found. For C++ applications, be sure to use a version of the MPI implementation compiled with the same compiler as your application. Some Fortran compilers have similar restrictions.

If you are going to profile your application, specify -g to ensure that symbolic information and information relating assembly instructions to source lines is included in the executable. Note that -g does not affect the optimizations performed by the Oracle Solaris Studio compilers.

Some MPI distributions are available with either static linking or with shared-object linking. You must use a version built to use shared-object linking in order to use the Oracle Solaris Studio Profiling Tools.

Launching MPI Performance Data-Collection

Use the collect command to profile an MPI application. You can collect both communication and computation profile information in a single run. Specify the -M version argument to name the version of MPI you are using and specify mpirun as the target of collect. For more information on formatting the command, see the collect(1) man page.

MPI API trace data can only be collected for applications that are run on supported MPIs. We also recommend using Oracle Message Passing Toolkit 8.1 or later for MPI Profiling. It is based on Open MPI and is tuned for running on Oracle Solaris platforms. In addition, it has a profiling feature that no other MPI has, the profiling of MPI States which is automatically collected with clock profiling.

If you run an MPI job using Oracle Message Passing Toolkit (ClusterTools) 8.2 with:

  mpirun -np 4 -- a.out 

You can collect MPI performance data on it with:

  collect -M OMPT mpirun -np 4 -- a.out 

Note the -- argument which is optional for mpirun, but is required for collect to properly understand the command.

For MPI applications that run longer than several minutes, you may want to reduce the frequency of statistical samples with the -p option, for example:

  collect -M OMPT -p low mpirun -np 4 -- a.out 

Invoking the collect command produces a single MPI experiment which contains a log file and MPI subexperiments for each rank.

Although many users employ a script to launch their MPI a.out executables, MPI profiling of scripts is not supported. The a.out binary must be a native executable.

Disabling MPI API Tracing

MPI API tracing is enabled by default with -M. You can, however, collect computation data without communication data by specifying the -m off option. For example,

  collect -M OMPT -m off mpirun -np 4 -- a.out 

produces an MPI Experiment with only computation profiling data and

  collect -M OMPT -m off -p low mpirun -np 4 -- a.out 

produces an MPI Experiment with computation profiling data recorded at reduced rate.

Selective Computation Profiling of MPI Jobs

In addition to disabling MPI API tracing, you can further reduce the data volume of an MPI profile by using selective collection on a subset of ranks. Instead of using the collect -M option, invoke mpirun directly on a start-up script that enables profiling on some, but not all, ranks. As stated above, selective MPI API tracing cannot be done.

Replace the computation-only MPI profiling command:

  mpirun -np 4 -- collect -p on a.out

with

  mpirun -np 4 -- start_script ./a.out 

where start_script is a shell script that either runs ./a.out directly, or invokes collect on it.

The following simple script profiles the first two ranks, but not any others, with Oracle Message Passing Toolkit 8.1:

#!/bin/ksh
#
# Assumes the arguments passed to the script are
# the name of the a.out and its arguments

#
# Find the rank of the process from
# the MPI rank environment variable
# which depends on the MPI implementation.

rank=${OMPI_COMM_WORLD_RANK}
          
# Set the collect command and arguments 
# (Or use a full path to collect)

COLLECT="collect -p on" 
          
# Select which ranks to profile

if ["rank" <= "1" ]; then

    exec $COLLECT $*
else
    exec $*
fi 

Some MPI implementations use a different environment variable to specify the MPI rank, and some pass the rank by arguments passed into the target. You can adapt the script, and its invocation in the mpirun command as needed for the particular MPI you are using.

You should exercise care in selecting the MPI ranks for which you want to collect data. Rank zero is usually different from all the other ranks, and it may or may not be particularly interesting. You should choose whichever ranks are relevant to your performance problems.

Analysis of Performance Data

Performance data can be examined with the analyzer application. MPI API tracing data can be examined in the MPI Timeline and in a graphing facility called MPI Charts. In addition, metrics from tracing and profiling can be browsed in displays that identify the functions, source lines, and calling sequences that contributed to performance metrics.

The following sections show a few examples of these analysis views.

MPI Timeline and MPI Charts

The MPI Timeline and MPI Charts tabs can be used to explore the MPI API trace data.

MPI Timeline

The MPI Timeline graphically displays the MPI activity that occurred during an application's run. For each MPI process you can look horizontally to see what the process is doing as a function of elapsed time. Figure 1 is a screen shot of the Oracle Solaris Studio Analyzer window which shows processes P0 to P24 over a timespan of 560 milliseconds.

MPI Timeline Displays MPI Activity

Figure 1: MPI Timeline Displays MPI Activity (Click image for larger view.)

The Absolute Time, measured in seconds, is shown at the top along the horizontal axis. The Relative Time, measured in milliseconds, is shown at the bottom along the horizontal axis. At the left, the MPI process ranks from P0 to P24 are listed. At this level of zoom, the names of MPI API functions are not visible.

Figure 2 shows a zoomed-in view of the data shown above with MPI_Waitall function selected. The details about this function call are shown on the right.

MPI Timeline With <code>MPI_Waitall

Figure 2: MPI Timeline With MPI_Waitall Function (Click image for larger view.)

The Messages slider, on the right, controls the number of message lines displayed on the screen. You can adjust the message volume so the screen is readable and the tool remains responsive. If fewer than 100% of the messages are shown, the visible messages are those that are most "costly" in terms of the total time used in the send and receive functions of that message.

MPI Charts

The MPI Charts generate scatter plots and histograms to visualize the MPI API trace data. MPI Charts can be used to identify communication patterns, imbalances, and outliers.

The charting facility is a general one which you can use to specify which data types and metrics to plot. The initial view, shown in Figure 3, helps users visualize the ratio of user Application time to time spent in various MPI API functions:

Sum of Function Duration: Application Time

Figure 3: Sum of Function Duration: Application Time (Click image for larger view.)

MPI Charts can be used to show communication pattern. Figure 4 shows the data volume sent to and from each rank:

Sum of Message Bites

Figure 4: Sum of Message Bites (Click image for larger view.)

One way to identify outliers is to use scatter plots. For example, Figure 5 shows a distribution of average function duration. The function Entry Times, measured in seconds, are shown along the X axis, and the average function durations for each time period is shown along the Y axis. Colors near the red end of the scale identify when average function duration is higher than normal. There are two outliers in this screen shot, one with an Entry Time near 20 seconds and another near 40 seconds.

Average of Function Duration

Figure 5: Average of Function Duration (Click image for larger view.)

For more information on MPI Timeline, MPI Charts, and MPI Filtering, please see the MPI Analyzer Tutorial.

Examining Computation Profiling

You can identify the functions, source lines, and calling sequences that contribute the most to each performance metric by browsing the Functions, Source, and Callers-callees, tabs respectively.

Functions Tab

The Functions Tab is used to identify the functions that consumed the most resources. It consists of a table of functions and the associated metrics for each function. Any column can be sorted by clicking on the column header. Functions can be selected in order to see more details, and the source can be viewed for a selected function.

Figure 6 shows the Functions with selected metrics on the left. Details for the selected function, y_solve_, are listed on the right.

Function Metrics

Figure 6: Function Metrics (Click image for larger view.)

Of the metrics shown on the right, clock profiling provides the information from User CPU down to Other Wait. The metrics supplied by MPI API tracing follow:

  • MPI Bytes Sent: count of bytes sent

  • MPI Sends: number of messages sent

  • MPI Bytes Received: count of bytes received

  • MPI Receives: number of messages received

  • Other MPI Events: MPI calls that neither sent nor received messages

  • MPI Time: wall-clock time spent in each call

At the end of the list are the metrics supplied by Oracle Message Passing Toolkit (ClusterTools) 8.1 with clock-profiling:

  • MPI-Work Time: real time that accumulates while the MPI Library is doing work

  • MPI-Wait Time: real time that accumulates while the application is blocked because the MPI Library is busy-waiting or sleep-waiting for resources or messages

Source Tab

The Source Tab shows source code annotated with performance metrics.

Clicking on the Source Tab after selecting the y_solve_ function shows the source code for that function. Figure 7 shows the y_solve_ source code.

Source Code

Figure 7: y_solve_ Source Code (Click image for larger view.)

 

The text in blue is commentary supplied by the Oracle Solaris Studio compiler. On the left, the MPI Work and MPI Wait metrics are shown. These metrics show that the call to mpi_wait consists of 3.4 seconds of MPI Work time and 48.601 seconds of MPI Wait time.

There are a number of other Analyzer tabs available. For further information, please try man analyzer or visit the on-line information for Analyzer.

Supported MPI Implementations

MPI Implementations Recognized for MPI API Tracing

Table 1 lists the MPI Implementations which are recognized for MPI API Tracing. You must specify the implementation as the parameter to the collect -M argument; it affects how the arguments are passed to the command to spawn the MPI job, and which version of the libcollector MPI interposition library is LD_PRELOADed into each target MPI process.

Implementation

 Tested Version

 MPI Rank Environment Variable

  -M Parameter

Oracle Message
Passing Toolkit

8.2.1c

OMPI_COMM_WORLD_RANK

OMPT

ClusterTools

8.1

OMPI_COMM_WORLD_RANK

CT

ClusterTools

8.0

OMPI_COMM_WORLD_RANK

CT

ClusterTools

7.1

OMPI_MCA_ns_nds_vpid

CT

ClusterTools

7.0

OMPI_MCA_ns_nds_vpid

CT

Open MPI

1.2.7

OMPI_MCA_ns_nds_vpid

OPENMPI

MPICH2

1.0.7

PMI_RANK

MPICH2

MVAPICH2

1.2rc2

PMI_RANK

MVAPICH2

Table 1. MPI Implementations recognized for MPI API Tracing
Note: "Sun ClusterTools" is now "Oracle Message Passing Toolkit" (OMPT)

Additional MPI Implementations Recognized for Computation-Only Profiling

While you cannot collect communication profile data for an unsupported version of MPI, you can collect computation profile data. For example, you can collect computation profiling using the command:

  mpirun -np 4 -- collect -p on a.out 

(Note that some MPI implementations use a differently-named command for mpirun.)

The command will collect computation profiles in separate experiments for each MPI rank. If the MPI rank environment variable is recognized, the experiments will be named by rank. If the rank is specified by a variable that is not recognized, the experiments will be named in the order created.

You can bring up Analyzer either on a single experiment, or you may bring it up on all the experiments and see the data aggregated across all ranks. Specifying the -g name.erg parameter to collect will create a group file with all of the experiments named in it; you can bring the Analyzer up on that group file.

Table 2 lists the MPI Implementations which are not recognized for MPI API Tracing. If the environment variable specifying the rank is recognized, and experiments will be named by rank.

Table 1. MPI Implementations recognized for MPI API Tracing

Additional MPI Implementations Recognized for Computation-Only Profiling

While you cannot collect communication profile data for an unsupported version of MPI, you can collect computation profile data. For example, you can collect computation profiling using the command:

  mpirun -np 4 -- collect -p on a.out 

(Note that some MPI implementations use a differently-named command for mpirun.)

The command will collect computation profiles in separate experiments for each MPI rank. If the MPI rank environment variable is recognized, the experiments will be named by rank. If the rank is specified by a variable that is not recognized, the experiments will be named in the order created.

You can bring up Analyzer either on a single experiment, or you may bring it up on all the experiments and see the data aggregated across all ranks. Specifying the -g name.erg parameter to collect will create a group file with all of the experiments named in it; you can bring the Analyzer up on that group file.

Table 2 lists the MPI Implementations which are not recognized for MPI API Tracing. If the environment variable specifying the rank is recognized, and experiments will be named by rank.

Implementation

MPI Rank Environment Variable

Sun HPC

SUNHPC_PROC_RANK

LAM

LAMRANK
LAM_MPI_RANK

LSF

LSF_PM_COMM_RANK

Intel

MPD_JRANK

MPICH

MPI_RUNRANK
MPICH_IPROC

MVAPICH

MPI_RUNRANK
MPICH_IPROC

?

MP_RANK

Table 2. MPI Implementations not recognized for MPI API Tracing

References

The Authors

Yukon Maruyama Yukon Maruyama received a degree in electrical engineering from Stanford. He has worked on various software and hardware projects at Siemens AG, DIBA Inc., and Sun Microsystems Laboratories. In 2004, Yukon joined the Oracle Solaris Studio Performance Tools group.

Marty Itzkowitz Marty Itzkowitz received an A.B. degree from Columbia College and a Ph.D. in chemistry and physics from CalTech. After a post-doctoral fellowship at the University of California, Berkeley, he worked on operating systems and distributed services at Lawrence Berkeley Laboratory. He was head of Operating Systems at Vitesse Electronics, and then worked on operating system performance and performance tools at Sun Microsystems and then at Silicon Graphics. He returned to Sun in 1998 as project lead for the Oracle Solaris Studio Performance Tools. His interests include operating system design and performance, multiprocessor performance, performance tools and scientific visualization. He is an avid handball player and cook.