By Susan Morgan, Revised April 2011
This article explains common parallel and multithreading concepts, and differentiates between the hardware and software aspects of parallel processing. It briefly explains the hardware architectures that make parallel processing possible. The article describes several popular parallel programming models. It also makes connections between parallel processing concepts and related Oracle hardware and software offerings.
The terms parallel computing, parallel processing, and parallel programming are sometimes used in ambiguous ways, or are not clearly defined and differentiated. Parallel computing is a term that encompasses all the technologies used in running multiple tasks simultaneously on multiple processors. Parallel processing, or parallelism, is accomplished by dividing one single runtime task into multiple, independent, smaller tasks. The tasks can execute simultaneously when more than one processor is available. If only one processor is available, the tasks execute sequentially. On a modern high-speed single processor, the tasks might appear to run at the same time, but in reality they cannot be executed simultaneously on a single processor.
Parallel programming, or multithreaded programming, is the software methodology used to implement parallel processing. The program must include instructions to inform the runtime system which parts of the application can be executed simultaneously. The program is then said to be parallelized. Parallel programming is a performance optimization technique that attempts to reduce the “wall clock” runtime of an application by enabling the program to handle many activities simultaneously.
Parallel programming can be implemented using several different software interfaces, or parallel programming models.
This article explains the common parallel and multithreading concepts, and the differences between the hardware and software aspects of parallel processing. It briefly describes the hardware architectures that make parallel processing possible, and presents several popular parallel programming models. Pointers to other locations where you can read more about specific topics are included.
Parallel processing is a general term for the process of dividing tasks into multiple subtasks that can execute at the same time. These subtasks are known as threads, which are runtime entities that are able to independently execute a stream of instructions. Parallel processing can occur at the hardware level and at the software level. Distinguishing between these types of parallel processing is important. At the software level, an application might be rewritten to take advantage of parallelism in the code. With the right hardware support, such as a multiprocessing system, the threads can then execute simultaneously at runtime. If not enough processors or cores are available for all the threads to run simultaneously, certain tasks might still execute one after the other. The common way to describe such non-parallel execution is to say these tasks execute sequentially or serially.
Execution of a parallel application is dependent on hardware design. However, even when the system is capable of parallel execution, the software must still divide, schedule, and manage the tasks.
Multiprocessors – More than one processor can be active simultaneously. The processors use shared memory to communicate and share data. The allocation of tasks between the processors is handled by the operating system, so the system is able to execute multiple jobs simultaneously. The simultaneous execution improves the overall throughput, and for a given workload, reduces the turnaround time for the applications when compared to a system with a single processor. In certain cases, this reduction might not be sufficient because executing the single application still takes too long. At that point, parallel programming might be considered as a way to address this problem. The application developer needs to select a suitable parallel programming model such as POSIX Threads or OpenMP to implement the parallelism. Most Oracle Sun hardware is available in multiprocessor configurations, with a few entry level servers having one processor. See the Sun Servers page for more information about servers.
Multicore processors – More than one core, or processing unit, in a single chip can be active simultaneously. Multicore processing is sometimes called chip-level multiprocessing (CMP) because multiple processors are on a single chip. The cores use shared memory, a shared system bus, and, in some cases, shared caches, to communicate and share data with each other. The cores generally have their own processing units and registers. The architecture of each core varies with different processor implementations. The operating system views each core as a processor, and handles the allocation of tasks between the cores. A multicore processor is like a multiprocessor system implemented on a single chip. Although differences exist, especially with respect to the sharing of resources, from an application point of view multiple processors and multicore processors are effectively the same. Therefore, with single-threaded applications running on a multicore processor, the throughput of a multijob workload is increased by executing more than one application simultaneously. For example, on a dual-core processor, two programs can run at the same time. For a parallel application, the independent tasks can be scheduled onto the various cores of the processor. In both cases, however, the performance might not be as good as on a true multiprocessor design. The performance largely depends on the multicore implementation, and how many shared resources are needed by the applications that are running simultaneously. Sun servers are available with multicore AMD Opteron processors, multicore Intel Xeon processors, and with multicore SPARC processors..
The Oracle Solaris Studio Topics: Multicore Progamming page provides links to many sources of information about programming for multicore processors.
Multithreaded processors – These processors contain a number of multithreaded cores, which switch between a number of active threads. Some processor cores implement vertical multithreading (VMT), which enables the core to execute multiple threads in an interleaved fashion. If one software thread stalls waiting for a resource (data to come from memory, input/output, and so on), another thread immediately takes over execution. When the second thread stalls, the first thread or another waiting thread takes over. VMT enables processor cycles to be used more efficiently. Early VMT designs suffered from too much resource sharing, and in many cases, overall performance could be improved by disabling resource sharing. Current processors that use vertical threads include the SPARC64 VI and SPARC64 VII/VII+ processors developed by Fujitsu. The Oracle Sun SPARC Enterprise M-series servers use the SPARC64 VI and SPARC64 VII/VII+ processors.
A further refinement of hardware multithreading technology is called simultaneous multithreading (SMT). A truly multithreaded processor, specifically designed for SMT, does not have a resource sharing problem. Sun introduced the term chip multithreading (CMT) for a processor design with multiple cores in which each core is multithreaded. The Oracle SPARC T-Series servers use the UltraSPARC T2 and SPARC T3 processors, which implement the CMT design. See the SPARC T-Series product page for more information.
The Oracle Solaris 10 operating system is optimized for running on chip-multithreaded SPARC processors. For more information about chip-multithreading, see the following Oracle publications:
Cluster computing – A cluster is a group of computers, generally called nodes, working together as a single system. Often the nodes are the same type of computer, running the same operating system, and belonging to the same administrative domain. Special cluster software running on the nodes and a high-speed network connecting the nodes enable rapid communication between them. Clusters can be configured to provide high availability (HA), for situations where the hardware and software must always be up and running. Hardware and software failures in a node do not cause the cluster to fail because built-in redundancy in HA configurations enables other nodes to pick up the tasks of a failed node while the cluster continues to run. Examples of environments requiring high availability are online reservation or ordering systems.
A cluster of systems can also be used as a large parallel computer, useful for high performance computing (HPC). Clusters configured for HPC might be used to run parallelized scientific applications, for example. Usually the HPC and HA uses of a cluster are not combined. When used for HPC environments, all of a cluster's available resources are used for the tasks at hand. If a failure occurs, the hardware or software is fixed and restarted.
To fully realize the multiprocessing benefits of running on a cluster, applications should be parallelized using one of the software parallel programming models.
Grid computing – This term refers to a heterogeneous mix of networked computers working together, similar to a cluster but potentially working across administrative domains or organizations. The nodes on a grid can range from a small group of systems located in the same room to a large set of networked computers installed around the world. Even a cluster can be a node in a grid. Each node in the grid runs special software that enables it to make optimal use of the available resources like CPU cycles and storage that are contributed by the nodes on the grid. Often, the grid software can be configured so that any possible spare CPU cycle is used to run applications. This technique enables optimal use of the system. Originally, grids were used to run scientific applications. More recently, grid use has extended to other environments, including environments where clusters have traditionally been used. As a result, the difference between a cluster and a grid is not always very clear. The system software is often the main differentiator.
The Oracle Solaris kernel and most Oracle Solaris services have been multithreaded and optimized for many years in order to take advantage of multiprocessor architectures. Oracle continues to invest in parallelizing and optimizing Oracle Solaris software to fully support emerging parallel architectures. For a single application to benefit from a multiprocessor architecture including clusters and grids, the program should be parallelized using one of the parallel programming models. In all cases, the application's use of parallelism must improve performance enough to surpass the processing overhead that comes with the programming model. The creation and management of threads are examples of processing overhead.
The programming model used in any application depends on the underlying hardware architecture of the system on which the application is expected to run. Specifically, the developer must distinguish between a shared memory system and a distributed memory system. In a shared memory architecture, the application can transparently access any memory location. A multicore processor is an example of a shared memory system. In a distributed memory environment, the application can only transparently access the memory of the node it is running on. Access to the memory on another node has to be explicitly arranged within the application. Clusters and grids are examples of distributed memory systems.
For more information about parallel computing software models, see the Oracle white paper Developing Parallel Programs — A Discussion of Popular Models.
Shared memory, or multithreaded, programming is sometimes also called threaded programming. In this context, threads are lightweight processes, which are processes that exist within a single operating system process. Threads share the same memory address space and state information of the process that contains them. The containing process is sometimes also called the parent process. The shared memory model is supported on computers that have multiple processors, where each core or processor has access to the same shared memory. Such a system has a single address space. Communication and data exchange between the threads takes place through shared memory.
Parallel programming can be implemented for shared memory systems using any of the following models.
Automatic parallelization – When the program is compiled, the compiler tries to identify the parallelism in the application. The focus is on loops, either a single loop or a set of nested loops, as this area is typically where most of the execution time is spent. Through a dependence analysis, the compiler determines whether parallelizing a loop is safe. If it is safe, the compiler generates the right parallel infrastructure for parallel execution at runtime. The developer merely has to use the appropriate option on the compiler to activate this feature. With the Oracle Solaris Studio compilers, this option is the -xautopar option. The -xloopinfo option, which displays parallelization messages, is also highly recommended.
POSIX threads and Solaris threads – Oracle Solaris supports two shared-memory threading models. The standard POSIX threads API, usually abbreviated as Pthreads, is available for applications written in C. The older Solaris threads API, which predates the Pthreads standard, is also supported. The POSIX threads API is the standard supported on many UNIX-based operating systems. Use of this standard increases portability. Both libraries are included in the standard C library libc in the Oracle Solaris operating system. See the pthreads(5)man page for a comparison of both APIs.
For condensed information about Pthreads programming, see the POSIX Threads Programming tutorial at www.llnl.gov. For a more comprehensive understanding of programming with POSIX threads you might read the books Programming with POSIX Threads by David R. Butenhof and Programming with Threads by Steve Klieman, Devang Shah, and Bart Smaalders.
OpenMP – This API specification is for implementing parallel programming on a shared memory system. OpenMP offers a higher level model than POSIX threads and also provides additional functionality. In many cases, an OpenMP implementation is built on top of a native threading model like POSIX threads. OpenMP consists of a set of compiler directives, runtime functions, and environment variables. Fortran, C and C++ are supported.
The compiler directive plays a key role in OpenMP. By inserting directives in the source, the developer specifies what parts of the program can be executed in parallel. The compiler transforms these specified parts of the program into the appropriate infrastructure, such as a function call to an underlying multitasking library. OpenMP has four main advantages over other programming models:
Portability – Although OpenMP is not an official standard, a program using OpenMP is portable to another OpenMP compiler or environment.
Ease of use – The developer does not have to create and manage threads at the level of POSIX threads, for example. Thread management is handled by the compiler and underlying multitasking library.
The application can be parallelized step by step – The developer specifies the sections that can be executed in parallel, and can thus incrementally parallelize the application as necessary.
The sequential version of the program is preserved – If the program is not compiled with the compiler option for OpenMP, the directives in the code are ignored. This behavior effectively disables parallel execution for that source and the program runs sequentially again.
The OpenMP specification is available at http://www.openmp.org/. Many articles about OpenMP are available on the Oracle Solaris Studio Topics: Multicore Programming page.
The Oracle Solaris Studio 12.2 documentation library includes the OpenMP API User’s Guide, which describes issues specific to the Oracle Solaris Studio implementation of the OpenMP API.
Developers can implement the parallelism in an application by using a very low-level communication interface, such as sockets, between networked computers. However, using such a method is the equivalent of using assembly language programming for applications: very powerful, but also very minimal. As a result, an application parallelized using such an API might be hard to maintain and expand.
The Message Passing Interface (MPI) model is commonly used to parallelize applications for a cluster of computers, or a grid. Like OpenMP, this interface is an additional software layer on top of basic OS functionality. MPI is built on top of a software networking interface, such as sockets, with a protocol such as TCP/IP. MPI provides a rich set of communication routines, and is widely available.
An MPI program is a sequential C, C++, or Fortran program that runs on a subset of processors, or all processors or cores in the cluster. The programmer implements the distribution of the tasks and communication between the tasks, and decides how the work is allocated to the various threads. To this end, the program needs to be augmented with calls to MPI library functions, for example, to send and receive information from other threads.
MPI is a very explicit programming model. Although some convenience functionality is provided, such as a global broadcast operation, the developer has to specifically design the parallel application for this programming model. Many low-level details also need to be handled explicitly.
The advantage to MPI is that an application can run on any type of cluster that has the software to support the MPI programming model. Although originally MPI programs mainly ran on clusters of single processor workstations or PCs, running an MPI application on one or more shared memory computers is now common. An optimized MPI implementation can then also take advantage of the faster communication over shared memory for those threads executing in the same system.
The following resources provide more information about MPI:
The MPI specification is available from Argonne National Laboratory at http://www.mcs.anl.gov/mpi/.
Open MPI is an open–source effort by a consortium of research, academic, and industry partners to build an MPI library that combines technologies and resources from several MPI projects. Open MPI is the basis for the Oracle Message Passing Toolkit, formerly known as Sun HPC ClusterTools.. You can download this software for free from the Oracle Message Passing Toolkit product page.
For a detailed overview of MPI, see the Message Passing Interface (MPI) Tutorial at www.llnl.gov.
Additional online tutorial material about MPI is available at http://www.mcs.anl.gov/mpi/tutorial/.
With the emergence of multicore systems, an increasing number of clusters and grids are parallel systems with two layers. Within a single node, fast communication through shared memory can be exploited, and a networking protocol can be used to communicate across the nodes. Programs can take advantage of both shared memory and distributed memory.
The MPI model can be used to run parallel applications on clusters of multicore systems. MPI applications run across the nodes as well as within each node, so both parallelization layers, shared and distributed, could be used through MPI. In certain situations, however, adding the finer-grained parallelization offered by a shared memory programming model such as Pthreads or OpenMP is more efficient. Typically, parallel execution over the nodes is achieved through MPI. Within one node, Pthreads or OpenMP is used. When two programming models are used in one application, the application is said to be parallelized with a hybrid or mixed-mode programming model.
Another hybrid programming model that is sometimes used is to combine Pthreads and OpenMP. This type of application only runs in one shared-memory system. Each Pthread process is further parallelized using OpenMP, taking advantage of the additional parallelism offered by this type of process.
Oracle offers software products to support the technologies discussed in this article.
Oracle software for shared memory systems includes:
Threads – POSIX threads and Solaris threads libraries are both included in the Oracle Solaris libc library. Documentation is in the Multithreaded Programming Guide in the Oracle Solaris 10 Software Developer Collection.
OpenMP – An implementation of OpenMP for C, C++ and Fortran is included in Oracle Solaris Studio, which is free to download. The -xopenmp compile and link-time option instructs the Oracle Solaris Studio compilers to recognize OpenMP directives and runtime functions in a program. The OpenMP runtime support library, libmtsk, provides support for thread management, synchronization, and scheduling of work. The library is implemented on top of the POSIX threads library.
An implementation of MPI is included in the Oracle Message Passing Toolkit. This product also includes driver compile scripts and tools to query and manage the jobs at runtime. See the Oracle Message Passing Toolkit Documentation for complete information.
Oracle products for implementing and managing clusters include:
Oracle Solaris Cluster, which provides high availability for Oracle Solaris environments.
Oracle Enterprise Manager for managing distributed systems.
The Oracle grid computing product is Oracle Grid Engine, which enables your organization to implement a grid with distributed resource management.
Additional articles on parallel programming with the Solaris Studio tools can be found on the Parallel Programming Topics page.