Introducing OpenMP: A Portable, Parallel Programming API for Shared Memory Multiprocessors

   
By Nawal Copty , Sun Microsystems, January, 2007  

Sun Studio compilers (C/C++/Fortran 95) support OpenMP parallelization natively. OpenMP is an emerging standard model for parallel programming in a shared memory environment. It provides a set of pragmas, runtime routines, and environment variables for programmers to easily parallelize their code. This article provides a brief introduction to OpenMP and OpenMP support in Sun Studio compilers. This article is of particular interest to programmers who are new to OpenMP and parallel programming in Fortran, C, or C++.

What Is OpenMP?

OpenMP is a specifications for parallelizing programs in a shared memory environment. OpenMP provides a set of pragmas, runtime routines, and environment variables that programmers can use to specify shared-memory parallelism in Fortran, C, and C++ programs.

When OpenMP pragmas are used in a program, they direct an OpenMP-aware compiler to generate an executable that will run in parallel using multiple threads. Little source code modifications are necessary (other than fine tuning to get the maximum performance). OpenMP pragmas enable you to use an elegant, uniform, and portable interface to parallelize programs on various architectures and systems. OpenMP is a widely accepted specification, and vendors like Sun, Intel, IBM, and SGI support it. (See the Resources below for a link to the OpenMP web-site to find the latest OpenMP specification document.)

OpenMP takes parallel programming to the next level by creating and managing threads for you. All you need to do is insert appropriate pragmas in the source program, and then compile the program with a compiler supporting OpenMP and with the appropriate compiler option (Sun Studio uses the -xopenmp compiler option). The compiler interprets the pragmas and parallelizes the code. When using compilers that are not OpenMP-aware, the OpenMP pragmas are silently ignored.

While this article gives examples of using OpenMP with C and C++ programs, equivalent pragmas exist for Fortran 95 as well. See the OpenMP User's Guide for details.

OpenMP Pragmas

The OpenMP specification defines a set of pragmas. A pragma is compiler directives on how to process the block of code that follows the pragma. The most basic pragma is the #pragma omp parallel to denote a parallel region.

OpenMP uses the fork-join model of parallel execution. An OpenMP program begins as a single thread of execution, called the initial thread. When a thread encounters a parallel construct, it creates a new team of threads composed of itself and zero or more additional threads, and becomes the master of the new team. All members of the new team (including the master) execute the code inside the parallel construct. There is an implicit barrier at the end of the parallel construct. Only the master thread continues execution of user code beyond the end of the parallel construct.

The number of threads in the team executing a parallel region can be controlled in several ways. One way is to use the environment variable OMP_NUM_THREADS. Another way is to call the runtime routine omp_set_num_threads(). Yet another way is to use the num_threads clause in conjunction with the parallel pragma.

OpenMP supports two basic kinds of work-sharing constructs to specify that work in a parallel region is to be divided among the threads in the team. These work-sharing constructs are loops and sections. The #pragma omp for is used for loops, and #pragma omp sections is used for sections --  blocks of code that can be executed in parallel.

The #pragma omp barrier instructs all threads in the team to wait for each other before they continue execution beyond the barrier. There is an implicit barrier at the end of a parallel region. The #pragma omp master instructs the compiler that the following block of code is to be executed by the master thread only. The #pragma omp single indicates that only one thread in the team should execute the following block of code; this thread may not necessarily be the master thread. You can use the #pragma omp critical pragma to protect a block of code that should be executed by a single thread at a time. Of course, all of these pragmas make sense only in the context of a parallel pragma (parallel region).

OpenMP Runtime Routines

OpenMP provides a number of runtime routines can be used to obtain information about threads in the program. These include omp_get_num_threads(), omp_set_num_threads(), omp_get_max_threads(), omp_in_parallel(), and others. In addition, OpenMP provides a number of lock routines that can be used for thread synchronization.

OpenMP Environment Variables

OpenMP provides several environment variables that can be used to control the behavior of the OpenMP program.

An important environment variable is OMP_NUM_THREADS , which specifies the number of threads in the team to be used to execute a parallel region (including the master thread of the team). Another widely used environment variable is OMP_DYNAMIC. Set this environment variable to FALSE to disable dynamic adjustment of the number of threads at runtime by the implementation. The general rule is to make the number of threads no larger than the number of cores in the system.

In addition to the standard OpenMP environment variables, Sun Studio compilers provide an added set of Sun-specific environment variables that offer more control of the runtime environment. These are described in the OpenMP User's Guide.

Examples

Using a simple matrix multiplication program you can see how OpenMP can be used to parallelize the program. Consider the following small code fragment that multiplies 2 matrices. This is a very simple example and, if you really want a good matrix multiply routine, you will have to consider cache effects, or use a better algorithm (Strassen's, or Coppersmith and Winograd's, and so on).

for (ii = 0; ii < nrows; ii++) {
            

 for (jj = 0; jj < ncols; jj++) {
            

 for (kk = 0; kk < nrows; kk++) {
            

 array[ii][jj] = array[ii][kk] * array[kk][jj];
            

 }
            

 }
            

}
            

             

          

Note that at each level in the loop nest above, the loop iterations can be executed independently of each other. So parallelizing the above code segment is straightforward: Insert the #pragma omp parallel for pragma before the outermost loop ( ii loop). It is beneficial to insert the pragma at the outermost loop, since this gives the most performance gain. In the parallelized loop, variables array, ncols and nrows are shared among the threads, while variables ii, jj, and kk are private to each thread. The preceding code now becomes:

#pragma omp parallel for shared(array, ncols, nrows) private(ii, jj, kk)
            

for (ii = 0; ii < nrows; ii++) {
            

 for (jj = 0; jj < ncols; jj++) {
            

 for (kk = 0; kk < nrows; kk++) {
            

 array[ii][jj] = array[ii][kk] * array[kk][jj];
            

 }
            

 }
            

}
            

             

          

As another example, consider the following code fragment that finds the sum of f(x) for 0 <= x < n.

for (ii = 0; ii < n; ii++) {
            

 sum = sum + some_complex_long_fuction(a[ii]);
            

}
            

             

          

To parallelize the above fragment, the first step could be

#pragma omp parallel for shared(sum, a, n) private(ii, value)
            

for (ii = 0; ii < n; ii++) {
            

 value = some_complex_long_fuction(a[ii]);
            

             

 #pragma omp critical
            

 sum = sum + value;
            

}
            

             

          

or better, you can use the reduction clause to get

#pragma omp parallel for shared(a, n) private(ii) reduction(+: sum)
            

for (ii = 0; ii < n; ii++) {
            

 sum = sum + some_complex_long_fuction(a[ii]);
            

}
            

             

          

How To Begin

There are several ways to parallelize a program. First, determine if you need parallelization. Some algorithms are not suitable for parallelization. If you are starting a new project, you could choose an algorithm that can be parallelized. It is very important to be sure that the code is correct (serially) before trying to parallelize it. Be sure to maintain timings of your serial run, so that you can decide if parallelization is useful.

Compile the serial version with several optimization options. The compiler can generally perform many more optimizations than you can.

When you are ready to parallelize your program, there are a number of features and tools in Sun Studio that can help you achieve that goal. They are briefly described below.

Automatic Parallelization

Try using the automatic parallelization option of the compiler ( -xautopar). Delegating parallelization to the compiler allows you to parallelize a program without any effort on your part. The autoparallelizer can also help you identify pieces of code that can be parallelized using OpenMP pragmas, or point out things in the code that could prevent parallelization (for example, inter-loop dependencies). You can view compiler commentary by compiling your program with the -g flag and using the er_src(1) utility in Sun Studio, as follows.

             

 %  
             cc -g -xautopar -c source.c
            

 %  
             er_src source.o
            

          

Autoscoping

One common type of errors in OpenMP programming is scoping errors, where a variable is erroneously scoped, say, as shared (private) when it should be scoped as private (shared). The autoscoping feature in the Sun Studio compilers relieves you of the task of determining the scopes of variables. Two extensions to OpenMP are supported: the __auto clause and the default(__auto) clause. Refer to the OpenMP User's Guide for details.

dbx Debugger

The Sun Studio debugger, dbx, is a thread-aware debugger that can help you debug your OpenMP program. To debug your OpenMP program, first compile the program without any optimization by using the -xopenmp=noopt -g compiler options, and then run dbx on the resulting executable. With dbx, you can set breakpoints inside a parallel region and step through the code of a parallel region, examine variables that are private to a thread, and so on.

Performance Analyzer

Identify bottlenecks in your program by using the Sun Studio Performance Analyzer. This tool can help you identify hot routines in your program where a large amount of time is spent. The tool also provides work and wait metrics, attributed to functions, source lines, and instructions, that can help you identify bottlenecks in an OpenMP program.

 

Mixing OpenMP With MPI

MPI (Message Passing Interface) is another model for parallel programming. Unlike OpenMP, MPI spawns multiple processes that then communicate using TCP/IP. Since these processes do not share the same address space, they can run on remote machines (or a cluster of machines). It is difficult to say whether OpenMP or MPI is better. They both have their advantages and disadvantages. What is more interesting is that OpenMP can be used with MPI. Typically, you would use MPI to coarsely distribute work among several machines, and then use OpenMP to parallelize at a finer level on a single machine.

In summary, Sun Studio compilers and tools support OpenMP natively and have many useful features that can help you parallelize your program. For more information, see the Resources section below.

 

Resources

Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Comments:
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.