|
| By Terry Dontje, Karen Norteman, Rolf vandeVaart, August 2007; Updated November 2008 |
|
| |
This article describes how to use the Solaris Dynamic Tracing (DTrace) utility with Open MPI. DTrace is a comprehensive dynamic tracing utility that you can use to monitor the behavior of applications programs as well as the operating system itself. You can use DTrace on live production systems to understand those systems' behavior and to track down any problems that might be occurring.
The D language is the programming language used to create the source code for DTrace programs.
The content of this article assumes knowledge of the D language and how to use DTrace.
mpirun Privileges
mpiperuse Provider
mpiperuse Probes
mpiperuse Probe in a D Script
mpiperuse Probes to See Message Queues
For more information about the D language and DTrace, refer to the Solaris Dynamic Tracing Guide. This guide is part of the Solaris 10 OS Software Developer Collection.
Note – The programs and script mentioned in the sections that follow are located at:
/opt/SUNWhpc/examples/mpi/dtrace
mpirun Privileges
| |
Before you run a program under DTrace, you need to make sure that you have the correct
mpirun privileges.
To run the script under
mpirun, make sure that you have
dtrace_proc and
dtrace_user privileges. Otherwise, DTrace returns the following error because it does not have sufficient privileges.
dtrace: failed to initialize dtrace: DTrace requires additional privileges |
mpppriv.sh:
#!/bin/sh # mpppriv.sh - run ppriv under a shell so you can get the privileges # of the process that mpirun creates ppriv $$ |
host1 and
host2 with the names of hosts in your cluster:
mpirun -np 2 --host host1,host2 mpppriv.sh |
If the output of
ppriv shows that the E privilege set has the
dtrace privileges, then you will be able to run
dtrace under
mpirun (see the following two examples). Otherwise, you must adjust your system to get
dtrace access.
The following example shows the output from
ppriv when the privileges have not been set.
%
ppriv $$
4084: -csh
flags = <none>
E: basic
I: basic
P: basic
L: all
|
This example shows
ppriv output when the privileges have been set.
%
ppriv $$
2075: tcsh
flags = <none>
E:basic,dtrace_proc,dtrace_user
I:basic,dtrace_proc,dtrace_user
P:basic,dtrace_proc,dtrace_user
L: all
|
Note: To update your privileges, ask your system administrator to add the
dtrace_user and
dtrace_proc privileges to your account in the
/etc/user_attr file.
After the privileges have been changed, you can use the
ppriv command to view the changed privileges.
| |
There are two ways to use dynamic tracing with MPI programs:
For illustration purposes, assume you have a program named
mpiapp.
mpitrace.d Script
% mpirun -np 4 dtrace -s mpitrace.d -c mpiapp |
stdout). One way around this problem is to create a script similar to the script in the following section.
partrace.sh in this example) similar to the following:
#!/bin/sh # partrace.sh - a helper script to dtrace Open MPI jobs from the # start of the job. dtrace -s $1 -c $2 -o $2.$OMPI_COMM_WORLD_RANK.trace |
partrace.sh shell script:
% mpirun -np 4 partrace.sh mpitrace.d mpiapp |
mpiapp under DTrace using the
mpitrace.d script. The script saves the trace output for each process in a job under a separate file name, based on the program name and rank of the process. Note that subsequent runs append the data into the existing trace files.
OMPI_COMM_WORLD_RANK.trace variable is unstable and subject to change. Use this variable with caution.
The second way to use
dtrace with Open MPI is to attach
dtrace to a running MPI program.
% prstat 0 1 | grep mpiapp 24768 joeuser 526M 3492K sleep 59 0 0:00:08 0.1% mpiapp/1 24770 joeuser 518M 3228K sleep 59 0 0:00:08 0.1% mpiapp/1 |
dtrace.
mpitrace.d:
dtrace -p 24770 -s mpitrace.d |
DTrace enables you to easily trace programs. When used in conjunction with MPI and the more than 200 functions defined in the MPI standard, DTrace provides an easy way to determine which functions might be in error during the debugging process, or those functions that might be of interest. After you determine the function showing the error, it is easy to locate the desired job, process, and rank on which to run your scripts. As demonstrated previously, DTrace allows you to perform these determinations while the program is running.
Although the MPI standard provides the MPI profiling interface, using DTrace does provide a number of advantages. The advantages of using DTrace include the following:
The following example shows a simple script that traces the entry and exit into all the MPI API calls.
mpitrace.d:
pid$target:libmpi:MPI_*:entry
{
printf("Entered %s...", probefunc);
}
pid$target:libmpi:MPI_*:return
{
printf("exiting, return value = %d\n", arg1);
}
|
When you use this example script to attach DTrace to a job that performs
send and
recv operations, the output looks similar to the following.
%
dtrace -q -p 24770 -s mpitrace.d
Entered MPI_Send...exiting, return value = 0
Entered MPI_Recv...exiting, return value = 0
Entered MPI_Send...exiting, return value = 0
Entered MPI_Recv...exiting, return value = 0
Entered MPI_Send...exiting, return value = 0 ...
|
You can easily modify the
mpitrace.d script to include an argument list. The resulting output resembles
truss output. For example:
mpitruss.d:
pid$target:libmpi:MPI_Send:entry,
pid$target:libmpi:MPI_*send:entry,
pid$target:libmpi:MPI_Recv:entry,
pid$target:libmpi:MPI_*recv:entry
{
printf("%s(0x%x, %d, 0x%x, %d, %d, 0x%x)",probefunc, arg0, arg1, arg2, arg3, arg4, arg5);
}
pid$target:libmpi:MPI_Send:return,
pid$target:libmpi:MPI_*send:return,
pid$target:libmpi:MPI_Recv:return,
pid$target:libmpi:MPI_*recv:return
{
printf("\t\t = %d\n", arg1);
}
|
The
mpitruss.d script shows how you can specify wildcard names to match the functions. Both probes match all send and receive type function calls in the MPI library. The first probe shows the usage of the built-in
arg variables to print out the
arglist of the function being traced.
Take care when wildcarding the entry point and formatting the argument output, because you could end up printing either too many arguments, or not enough arguments, for certain functions. For example, in the preceding case, the
MPI_Irecv and
MPI_Isend functions will not have their Request handle parameters printed.
The following example shows a sample output of the
mpitruss.d script.
%
dtrace -q -p 24770 -s mpitruss.d
MPI_Send(0x80470b0, 1, 0x8060f48, 0, 1, 0x8060d48) = 0
MPI_Recv(0x80470a8, 1, 0x8060f48, 0, 0, 0x8060d48) = 0
MPI_Send(0x80470b0, 1, 0x8060f48, 0, 1, 0x8060d48) = 0
MPI_Recv(0x80470a8, 1, 0x8060f48, 0, 0, 0x8060d48) = 0 ...
|
| |
One of the biggest issues with programming is the unintentional leaking of resources, such as memory. With MPI, tracking and repairing resource leaks can be somewhat more challenging because the objects being leaked are in the middleware, and thus are not easily detected by the use of memory checkers.
DTrace helps with debugging such problems using variables, the profile provider, and a call stack function. The
mpicommcheck.d script (shown in the following example) probes for all the MPI communicator calls that allocate and deallocate communicators, and it keeps track of the stack each time the function is called. Every 10 seconds the script dumps out the current count of MPI communicator calls and the total calls for the allocation and deallocation of communicators. When the
dtrace session ends (usually by pressing Ctrl-C, if you attached to a running MPI program), the script prints the totals and all the different stack traces, as well as the number of times those stack traces were reached.
To perform these tasks, the script uses DTrace features such as variables, associative arrays, built-in functions (
count, ustack) and the predefined variable
probefunc.
The following example shows the
mpicommcheck.d script.
mpicommcheck.d:
BEGIN
{
allocations = 0;
deallocations = 0;
prcnt = 0;
}
pid$target:libmpi:MPI_Comm_create:entry,
pid$target:libmpi:MPI_Comm_dup:entry,
pid$target:libmpi:MPI_Comm_split:entry
{
++allocations;
@counts[probefunc] = count();
@stacks[ustack()] = count();
}
pid$target:libmpi:MPI_Comm_free:entry
{
++deallocations;
@counts[probefunc] = count();
@stacks[ustack()] = count();
}
profile:::tick-1sec
/++prcnt > 10/
{
printf("=====================================================================");
printa(@counts);
printf("Communicator Allocations = %d\n", allocations);
printf("Communicator Deallocations = %d\n", deallocations);
prcnt = 0;
}
END
{
printf("Communicator Allocations = %d, Communicator Deallocations = %d\n",
allocations, deallocations);
}
|
This script attaches
dtrace to a suspect section of code in your program (that is, a section of code that might contain a resource leak). If, during the process of running the script, you see that the printed totals for allocations and deallocations are starting to steadily diverge, you might have a resource leak. Depending on how your program is designed, it might take some time and observation of the allocation and deallocation totals to definitively determine that the code contains a resource leak. Once you do determine that a resource leak is definitely occurring, you can press Ctrl-C to break out of the
dtrace session. Next, using the stack traces dumped, you can try to determine where the issue might be occurring.
The following example shows code that contains a resource leak, and the output that is displayed using the
mpicommcheck.d script.
The sample MPI program that contains the resource leak is named
mpicommleak. This program performs three
MPI_Comm_dup operations and two
MPI_Comm_free operations. The program thus "leaks" one communicator operation with each iteration of a loop.
When you attach
dtrace to
mpicommleak using the
mpicommcheck.d script, you see a 10-second periodic output. This output shows that the count of the allocated communicators is growing faster than the count of deallocations.
When you finally end the
dtrace session by pressing Ctrl-C, the session outputs a total of five stack traces, showing the distinct three
MPI_Comm_dup and two
MPI_Comm_free call stacks, as well as the number of times each call stack was encountered.
For example:
% prstat 0 1 | grep mpicommleak
24952 joeuser 518M 3212K sleep 59 0 0:00:01 1.8% mpicommleak/1
24950 joeuser 518M 3212K sleep 59 0 0:00:00 0.2% mpicommleak/1
% dtrace -q -p 24952 -s mpicommcheck.d
=====================================================================
MPI_Comm_free 4
MPI_Comm_dup 6
Communicator Allocations = 6
Communicator Deallocations = 4
=====================================================================
MPI_Comm_free 8
MPI_Comm_dup 12
Communicator Allocations = 12
Communicator Deallocations = 8
=====================================================================
MPI_Comm_free 12
MPI_Comm_dup 18
Communicator Allocations = 18
Communicator Deallocations = 12
^C
Communicator Allocations = 21, Communicator Deallocations = 14
libmpi.so.0.0.0'MPI_Comm_free
mpicommleak'deallocate_comms+0x19
mpicommleak'main+0x6d
mpicommleak'0x805081a
7
libmpi.so.0.0.0'MPI_Comm_free
mpicommleak'deallocate_comms+0x26
mpicommleak'main+0x6d
mpicommleak'0x805081a
7
libmpi.so.0.0.0'MPI_Comm_dup
mpicommleak'allocate_comms+0x1e
mpicommleak'main+0x5b
mpicommleak'0x805081a
7
libmpi.so.0.0.0'MPI_Comm_dup
mpicommleak'allocate_comms+0x30
mpicommleak'main+0x5b
mpicommleak'0x805081a
7
libmpi.so.0.0.0'MPI_Comm_dup
mpicommleak'allocate_comms+0x42
mpicommleak'main+0x5b
mpicommleak'0x805081a
7
|
mpiperuse Provider
| |
PERUSE is an MPI interface that allows you to obtain detailed information about the performance and interactions of processes, software, and MPI. PERUSE provides a greater level of detail about process performance than does the standard MPI profiling interface (PMPI).
For more information about PERUSE and the current PERUSE specification, see MPI PERUSE.
Open MPI includes a DTrace provider named
mpiperuse. This provider enables you to configure Open MPI to support DTrace probes into the Open MPI shared library
libmpi.
In Sun HPC ClusterTools 8.1 software, there are preconfigured executables and libraries with the
mpiperuse provider probes built in. They are located in the
/opt/SUNWhpc/HPC8.1/sun/instrument directory. Use the wrappers and utilities located in this directory to access the
mpiperuse provider.
Note – No recompilation is necessary to use the
mpiperuse provider. Just run the application to be traced with DTrace using
/opt/SUNWhpc/HPC8.1/sun/instrument/bin/mpirun.
mpiperuse Probes
The DTrace
mpiperuse probes expose the events specified in the current PERUSE specification. These events track the life cycle of requests within the MPI library. For more information about this life cycle and the actual events provided by PERUSE, see Section 4 of the PERUSE Specification.
Sections 4.3.1 and 4.4 of the PERUSE Specification list and describe the individual events exposed by PERUSE.
The
mpiperuse provider makes these events available to DTrace. The probe names correspond to the event names listed in Sections 4.3.1 and 4.4 of the PERUSE Specification. For each event, the corresponding probe name is similar, except that the leading PERUSE is removed, the probe name is all lowercase, and underscores are replaced with hyphens. For example, the probe for PERUSE_COMM_MSG_ARRIVED is comm-msg-arrived.
All of the probes are classified under the
mpiperuse provider. This means that to find the probe names, you look under the
mpiperuse name. It also means that when you make a DTrace statement, you can include a wildcard for all probes by using the
mpiperuse classification.
mpiperuse Probe in a D Script
In the D scripting language, specifying an
mpiperuse provider takes the following form.
mpiperuse$target:::
probe-name
|
where
probe-name is the name of the
mpiperuse probe you want to use.
For example, to specify a probe to capture a PERUSE_COMM_REQ_ACTIVATE event, add the following line to a D script.
mpiperuse$target:::comm-req-activate |
This alerts DTrace that you want to use the
mpiperuse provider to capture the PERUSE_COMM_REQ_ACTIVATE event. In this example, the optional object and function fields in the probe description are omitted. This directs DTrace to find all occurrences of the
comm-req-activate probes in the MPI library and its plugins instead of a specific probe. This is necessary because certain probes can appear in multiple places in the MPI library.
For more information about the D language and its syntax, refer to the Solaris Dynamic Tracing Guide. This guide is part of the Solaris 10 OS Software Developer Collection.
All of the
mpiperuse probes receive the following arguments.
|
Table 1 Available mpiperuse Arguments
|
||||||||
|
||||||||
mpiperuse Probes to See Message Queues
To use the
mpiperuse provider, make reference to the appropriate
mpiperuse provider probes and arguments in a DTrace script, as you would for any other provider (such as the
pid provider).
The procedure for running scripts with
mpiperuse probes follows the same steps as those shown in
Running an MPI Program Under DTrace and
Attaching DTrace to a Running MPI Program, except that you must edit the
partrace.sh script before you run it.
Change
partrace.sh to include a
-Z switch after the
dtrace command, as shown in the following example.
#!/bin/sh # partrace.sh - a helper script to dtrace Open MPI jobs from the # start of the job. dtrace -Z -s $1 -c $2 -o $2.$OMPI_COMM_WORLD_RANK.trace |
This change allows probes that do not exist at initial load time to be used in a script (that is, the probes are in plugins that have not been opened with
dlopen).
The following example shows how to use the
mpiperuse probes when running a DTrace script. Use the example script provided in
/opt/SUNWhpc/HPC8.1/sun/examples/dtrace/mpistat.d.
mpiperuse Probes to See Message Queues
dtest.c. Substitute the name and path of your script in place of
dtest.c:
% /opt/SUNWhpc/HPC8.1/sun/instrument/bin/mpicc
~myhomedir/scraps/usdt/examples/dtest.c -o dtest
% /opt/SUNWhpc/HPC8.1/sun/instrument/bin/mpirun -np 2 dtest
Initing MPI...
Initing MPI...
Do communications...
Do communications...
attach to pid 13371 to test tracing.
|
% dtrace -q -p 13371 -s /opt/SUNWhpc/HPC8.1/sun/examples/dtrace/mpistat.d
input(Total) Q-sizes Q-Matches output
bytes active posted unexp posted unexp bytes active
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 1 0 5 0
0 0 0 0 1 0 5 0
0 0 0 0 1 0 5 0
0 0 0 0 1 0 5 0
0 0 0 0 1 0 5 0
0 0 0 0 1 0 5 0
0 0 0 0 1 0 5 0
0 0 0 0 1 0 5 0
0 0 0 0 2 0 10 0
0 0 0 0 2 0 10 0
0 0 0 0 2 0 10 0
0 0 0 0 2 0 10 0
0 0 0 0 2 0 10 0
0 0 0 0 2 0 10 0
0 0 0 0 2 0 10 0
0 0 0 0 2 0 10 0
0 0 0 0 3 0 15 0
0 0 0 0 3 0 15 0
0 0 0 0 3 0 15 0
0 0 0 0 3 0 15 0
0 0 0 0 3 0 15 0
0 0 0 0 3 0 15 0
0 0 0 0 3 0 15 0
|
mpiperuse Usage Examples
The examples in this section show how to perform the described DTrace operations from the command line.
dtrace -p
pid -n 'mpiperuse$target:::comm-req-xfer-end { @[args[0]->ci_remote] = count(); }'
|
joe-users-host2.
% dtrace -p 25428 -n 'mpiperuse$target:::comm-req-xfer-end {@[args[0]->ci_remote] = count();}'
dtrace: description 'mpiperuse$target:::comm-req-xfer-end ' matched 17 probes
^C
joe-users-host2 recv 3
joe-users-host2 send 3
|
dtrace -p
pid -n 'mpiperuse$target:::comm-req-xfer-end { @[args[0]->ci_protocol]= count(); }'
|
% dtrace -p 25445 -n 'mpiperuse$target:::comm-req-xfer-end {@[args[0]->ci_protocol] = count();}'
dtrace: description 'mpiperuse$target:::comm-req-xfer-end ' matched 17 probes
^C
sm 60
|
dtrace -p
pid -n 'mpiperuse$target:::comm-req-xfer-end { @[args[0]->ci_remote] =quantize(args[3]->mcs_count); }'
|
% dtrace -p 25445 -n 'mpiperuse$target:::comm-req-xfer-end {@[args[0]->ci_remote] = quantize(args[3]->mcs_count);}'
dtrace: description 'mpiperuse$target:::comm-req-xfer-end ' matched 17 probes
^C
myhost
value ------------- Distribution ------------- count
2 | 0
4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4
8 | 0
|
dtrace -p pid -n 'mpiperuse$target:::comm-req-xfer-end { @[args[3]->mcs_comm, args[3]->mcs_peer, args[3]->mcs_op] = quantize(args[3]->mcs_count); }'
|
% dtrace -p 24937 -n 'mpiperuse$target:::comm-req-xfer-end {@[args[3]->mcs_comm, args[3]->mcs_peer, args[3]->mcs_op] = quantize(args[3]->mcs_count);}'
dtrace: description 'mpiperuse$target:::comm-req-xfer-end ' matched 19 probes
^C
134614864 1 recv
value ------------- Distribution ------------- count
2 | 0
4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 9
8 | 0
134614864 1 send
value ------------- Distribution ------------- count
2 | 0
4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 9
8 | 0
|
| |
Terry Dontje is a Senior Staff Engineer and Technical Lead in the Cluster Tools group who has spent over 20 years in the parallel computing industry. He has spent over 10 years working on MPI implementations including being a vendor representive to the MPI-2 forum and co-designer and implementor of Sun's various MPI implementations starting with Cluster Tools 3 through Cluster Tools 8. He's done work around interfacing MPI implementations to many different user by-pass network protocols. He currently is an active member and Sun's technical representative to the Open MPI community. Terry enjoys interruptions from his daughter and prechewed crackers in his hands. At least the former is true.
Karen Norteman is a software technical writer for the Systems Group at Sun Microsystems. She is the lead writer for the Sun HPC ClusterTools project, based in Burlington, MA. Karen lives in rural southern Maine with her other half, Greg Hall, three bearded collies, and one disgruntled cat.
Rolf vandeVaart is a software developer who has spent many years working on MPI. Initially, he worked on Sun HPC ClusterTools 3, 4, 5, 6, and 7, and he now works on Sun ClusterTools 8, which is based upon Open MPI. He is also an active member of the Open MPI community. He is based at Sun's Burlington, MA campus.
