An OpenMP Runtime API for Profiling

By Marty Itzkowitz, Oleg Mazurov, Nawal Copty, Yuan Lin, Sun Microsystems, February, 2007  

This paper describes a runtime API for use in profiling OpenMP applications. The API is designed to provide data that enables performance tools to present measurements in the user-model of OpenMP programming. The authors give the motivation and design of the API, an example of its use, and a description as to how the data provided by the API can be used to represent performance of OpenMP programs.

The API was primarily designed to support statistical sampling of callstacks, but it also contains extensions to support tracing. It addresses two difficult problems in OpenMP performance measurement: the presentation of performance data in the user model, and the understanding of the behavior of the OpenMP runtime. The API is implemented entirely within the OpenMP runtime, and requires no compiler support. It is described generally here, and its detailed technical description is provided as an accompanying include file at the end of the article.

This paper has been adopted by the OpenMP Architecture Review Board as an official ARB White Paper.

1. Introduction

OpenMP has become a very successful user-model for developing parallel applications, widely adopted and supported. It does not, however, provide a standardized interface that can be used for profiling. This paper describes such an interface, designed and implemented at Sun Microsystems, and the paper is being proposed for adoption by the OpenMP Architecture Review Board as an official ARB White Paper.

OpenMP programs present a challenge for performance measurement tools because the OpenMP specification is in terms of a user-model of the behavior of a conforming application and implementation, but the specification does not specify any implementation details. Performance measurement tools typically record data in the implementation-model, and have to map that behavior back to the user-model.

There are two main technologies used for performance measurement on OpenMP programs: tracing and statistical profiling. Each has its strengths and weaknesses.

Tracing technologies have the advantage that they can capture all the relevant events describing the runtime behavior of the application, but the overhead of tracing can be substantial. Some operations to be traced, a loop iteration, for example, can take less time than recording the traced event. Furthermore, compiling the program to make some of the desired events traceable can impede optimization: for example, if an event is to be generated for a loop iteration, unrolling, and optimizing the resultant instruction stream by interleaving instructions from different iterations cannot be performed. Traces scale poorly with increasing runtime, with no obvious mechanism of throttling the data volume, and without capturing all function call and return points, the dynamic program graph can not be generated.

Many implementations of tracing methodologies have been developed, along with a rich set of data-reduction and visualization tools. See, for example, Paraver, KOJAK, CATCH, and TAU, among others. One tracing methodology, POMP, was proposed as a standard for OpenMP performance measurement, but it was not adopted.

On the other hand, statistical profiles have the advantage that they can scale very nicely to long-running programs, simply by throttling the profiling rate. By capturing complete callstacks with each event, the dynamic callgraph behavior becomes obvious. As will be discussed below, capturing callstacks for user-model OpenMP programs is a challenge. Statistical profiling has one major disadvantage: it is statistical in nature, and may miss some rare events; furthermore, if the application behavior is correlated with the profiling clock, the measurements may be misleading.

The API proposed here was primarily intended to support statistical profiling, but, with the input from various people on the OpenMP Tools Email alias, it has been extended to optionally support tracing of many events. It mandates functionality in the OpenMP runtime, but requires no compiler support.

The next section of this paper, section 2, describes a typical implementation of the OpenMP compiler and runtime, and describes the difficulties of mapping the implementation behavior to the user-model. Section 3 describes the techniques used for performance measurement of OpenMP programs; section 4 describes the proposed API defining a very narrow interface between the performance measurement collection and the OpenMP runtime, and section 5 describes how the data is captured using that API. Section 6 describes using the data to construct a user-model profile for OpenMP applications. The final section, 7, presents our conclusions.

2. How OpenMP Applications Run

The user-level execution behavior of OpenMP applications is described in the OpenMP specifications (See, for example, OpenMP Application Program Interface, Version 2.5, section 1.3.) The specification does not constrain many implementation details, but the behavior of the implementation may be very important for the user to understand the performance and scalability of the application.

While performance measurement in general, and statistical profiling in particular, collects data in the implementation-model, that is, as the instructions are executed in the machine, the performance tools would like to present that information to the user in the context of the user-model. This section will describe the differences between the models.

The discussion is based on the implementation details specific to the Sun TM Studio Compilers and Tools, but the general architecture is similar to many other implementations, for example, the Intel, IBM, and SGI implementations. In the discussion below, generic terms are used for the system-startup routine ( <start>), the thread-startup routine ( <thread_start>), and the OpenMP runtime library ( <omp-rt>).

As any single-threaded program runs, its callstack shows its current location, and a trace of how it got there, starting from the beginning instructions in a system-specific startup routine, which calls main, which then proceeds and calls various routines within the program. When a routine contains a loop, the program executes the code inside the loop repeatedly, until the loop exit criterion is reached. It then proceeds on to the next sequence of code, etc.

When the program is parallelized with OpenMP (or by autoparallelization), the behavior is different. The user-model of that behavior has the main, or master, thread executing just as a single-threaded program. When it reaches a parallel loop or parallel region, additional slave threads appear, each appearing to be a clone of the master thread, with all of them executing the contents of the loop or parallel region, in parallel, each for different chunks of work. When all chunks of work are completed, all the threads are synchronized, the slave threads disappear, and the master thread proceeds.

The implementation behavior is not so straightforward. The compiler typically extracts the code to be executed inside a parallel region or loop (or any other OpenMP construct), and makes the extracted code into an independent function, called (in Sun's implementation) an mfunction. (It may also be referred to as an outlined function, or a loop-body-function.) The body of the parallel region is replaced by a call to the OpenMP runtime, passing in the address of the mfunction. The runtime then arranges for the mfunction to be called in the master and the slave threads. This difference is exposed by the tools, which will show the mfunction called from the introduced call site in the routine from which it is abstracted. In the discussion below, names of mfunctions extracted from an OpenMP construct from function foo are of the form foo-OMPa, foo-OMPb, etc..

When a parallel region is entered (a fork event), the user-model implies that new threads magically appear, and on exit from the region (a join event), they disappear. While a runtime could very well explicitly use the UNIX fork() system call to do that work, overhead would be prohibitive. Most implementations rely on the OpenMP runtime to create and to manage the additional threads, and when those threads run, their callstacks, as seen from a profiler do not match the user-model.

Consider a simple OpenMP program, with a main program, calling function foo, containing a single OpenMP parallel construct, executing with four threads. Before the first entry into the parallel region, the program has only one thread, the master thread, and that thread has the same callstack in both the user-model, and the implementation-model:



After the entry into the parallel region, there are four threads, and in the user-model, their callstacks would look like:


Master Slave 1 Slave 2 Slave 3
foo-OMPa foo-OMPa foo-OMPa foo-OMPa
foo foo foo foo
main main main main
<start> <start> <start> <start>

All the leaf functions are shown as the function foo-OMPa, but the actual leaf program counter (PC) would typically be different in each thread. In the implementation-model, their callstacks would look like:


Master Slave 1 Slave 2 Slave 3
foo foo-OMPa foo-OMPa foo-OMPa
main <omp-rt> <omp-rt> <omp-rt>
<start> <thread_start> <thread_start> <thread_start>

In this representation, the function <omp-rt> represents one or more frames of execution within the OpenMP runtime library. Note that in the slave threads, there is no connection between main and, for that matter, foo, and the actual executing function, foo-OMPa.

The user-model callstacks, and the actual implementation callstacks around fork and join events, and scheduling and synchronization events may be even more confusing. Frames representing user code may be entirely absent, or may show additional calls into the OpenMP runtime.

The OpenMP runtime is responsible for many operations performed by the process, acquiring locks, executing forks and joins, scheduling iterations, synchronizing at barriers, performing reductions, and so forth. The specification does not constrain any specific implementation for these functions, neither as an algorithm, nor as a program structure. However, their performance can be important to understanding the behavior and scalability of the OpenMP application, so some visibility into their behavior is needed.

The next section of this paper describes the methodology of performance tools for OpenMP programs, the challenges to building such tools, and the data necessary to meet those challenges.

3. Understanding OpenMP Performance

There are two challenges to building a performance tool to help users understand OpenMP performance. The first is to reconstruct the user-model callstacks from the measured implementation-model callstacks. The second is to understand the behavior within the OpenMP runtime when it is performing operations, such as synchronizations or reductions. This section will propose solutions to those two problems, and describe the data necessary to implement the solutions.

When an OpenMP program executes, it will generate many entries into parallel regions; each entry into a parallel region is a separate instance of that region. Knowing which instance of which parallel region a thread is executing, and knowing which thread generated the fork event marking the entry into the region, is sufficient to reconstruct user call stacks.

Specifically, the OpenMP runtime must support that information by assigning a unique ID to each parallel region instance, and delivering Dan event, as requested, to the performance tool telling it the new ID and which thread created the instance. The tool can record the callstack of the thread generating the fork event at the time of the fork. In addition, the OpenMP runtime must support a query that provides the current parallel region ID of the calling thread, and the performance tool records that information in every profile packet. For fork events, that ID reports nested parallelism. User-model callstacks can be synthesized from that data, as described in section 6, below.

To solve the second challenge, the OpenMP runtime must maintain a state for each thread and must support a query that returns the current state of the calling thread.

The functionality described above is sufficient for a statistical-sampling based profiler, but trace-based profilers need more events. The mechanism used above for the fork event can simply be used to provide other events. In collaboration with the members of the OpenMP Tools Mailing List, events were defined for join, and for entry and exit to many of the OpenMP constructs. Those events may come at significant cost in overhead and data volume, and their implementation is optional. Trace-based support will not be further described here.

4. The Collector/OpenMP Runtime API

The API presented here is intended to allow a performance measurement tool to garner enough information to meet those challenges. The interface is intended for the OpenMP runtime to know little about the collector, and for the collector to know little about the OpenMP runtime, giving each the freedom to evolve independently. The description here is somewhat schematic; the precise implementation specification for the API is given in the accompanying include file.

The first problem to resolve is that of rendezvous, that is, how the OpenMP runtime and the collector can find each other. Since an OpenMP program usually runs without a collector, and a collector can run on non-OpenMP programs, each must work in the absence of the other. Rendezvous is achieved actively by the collector, and passively by the OpenMP runtime. The OpenMP runtime exports a symbol, __omp_collector_api, and the collector asks the dynamic linker if that symbol is present. If so, it determines its address, and makes an OMP_REQ_START request via that API. The OpenMP runtime does not need to do anything until and unless that request is made. It may choose to maintain states and region IDs, but it need not. All requests from the collector to the OpenMP runtime are made as subsequent requests to that routine.

The API is prototyped as:

int __omp_collector_api( void *arg );

One of the requests to that routine is an OMP_REQ_REGISTER request, that specifies an event of interest, and a callback for when that event occurs. If no register request is made, the OpenMP runtime will never generate events to the collector.

Other requests are available for OMP_REQ_UNREGISTER, OMP_REQ_STATE, OMP_REQ_CURRENT_PRID, OMP_REQ_PARENT_PRID, OMP_REQ_STOP, OMP_REQ_PAUSE and OMP_REQ_RESUME. Some requests are often made from signal handlers, and those requests must be ensured to be safe from a signal context. The include file details which ones are safe to be called from an asynchronous signal handler. Since not all requests are required, one of the possible error codes returned is that the request is not implemented.

In the next section we will show how these requests, the data provided through them, and the generated events can be used to collect data to support constructing a complete user-model of an OpenMP application's behavior.


5. Using the API to Collect Data

The above section described the API between the collector and the OpenMP runtime. This section describes how Sun's collector is implemented to use that API to record data. This represents only one use of the API; many others are possible. We will not describe here how the profiling triggers and events are handled, nor how the signal-handlers are implented, nor how the data is externalized; such issues belong to the tool developer, and not the OpenMP API it is using.

The collector in Sun's performance tools, is a shared object that is LD_PRELOAD'ed into the target's address space. It has a init section, and that section asks the runtime linker if the OpenMP API symbol is present. If it is, the API is called with an OMP_REQ_START request, and an OMP_REQ_REGISTER request for fork events. (in this usage, only the fork events are traced.)

The remainder of initialization of the collector sets up profiling triggers, either clock- or hardware-counter-based, and then allows the application to proceed, just as it would for any program, OpenMP or not.

When a profile trigger fires, a signal is synchronously delivered to each appropriate thread. (Clock-profile signals are delivered at approximately the same time to each active thread, while HWC-profile signals can be completely uncorrelated across threads.) From the signal handler, if the application is an OpenMP application, the collector running on each thread obtains that thread's current state and parallel region ID by issuing OMP_REQ_STATE and OMP_REQ_CURRENT_PRID requests.

When the OpenMP runtime delivers a fork event, the collector issues OMP_REQ_CURRENT_PRID and OMP_REQ_PARENT_PRID requests of the runtime, and records the current callstack of the thread (master thread), and the newly-created parallel region ID. The current parallel region ID of the master is also recorded to support measurements on nested parallelism.

This API could equally well support an alternate tracing-based collector, although this would require additional implementation in the OpenMP runtime. If supported, it could request events for all possible events in the life of the process and record those events with their thread ID and timestamps. They could also record parallel region IDs and any other data appropriate to those events and the particular objectives of the tracing tool.

6. Using The Data To Construct A User-Model Callstack Profile

This section describes the data recorded and how it is processed to produce performance measurements presented in the user-model of OpenMP programs.

Using the API as described in the previous section, two streams of data are produced, with specific information about the OpenMP behavior of the application. One stream, generated in response to the delivered fork events, consists of a time-stamped trace of fork events, each with the current implementation-model callstack of the thread generating the event, the parallel-region ID of that thread before the fork, and the ID of the newly created region. The second stream is a trace of timestamped profile packets. Each packet has normal performance data, augmented with the current parallel-region ID within which the thread is executing and the OpenMP runtime state for that thread. The tools derives OpenMP metrics from these streams, and synthesizes the user-model callstack from the implementation model data in the streams.

In the performance tools each profile packet contains data specific to the profile being collected, which is used to construct metrics of performance. Those metrics are attributed to user-code by mapping each PC from the callstacks for the packet to functions, source-lines, callers and callees. For clock-profiling, the metrics include Total-time, User-CPU-Time, System-CPU-Time, and a number of others. When OpenMP performance data is recorded, additional metrics are computed, and the attribution of data is done to a synthesized user-model callstack, not to the implementation-model callstack as recorded.



  2. OpenMP Metrics
  3. When recorded with clock-based profiles, the OpenMP state information is used to construct two additional metrics of performance, "OpenMP Work" and "OpenMP Wait". Together they add up to the total elapsed time (integrated over threads).

    "OpenMP Work" represents time executing whether serial or parallel, as well as OpenMP library time spent doing reductions. It is computed by summing across the work states reported by the OpenMP runtime: THR_NO_STATE, THR_WORK_STATE, THR_SERL_STATE, THR_RDUC_STATE, THR_MSTR_STATE. THR_SNGL_STATE, and THR_ORDD_STATE.

    "OpenMP Wait" represents the total time waiting, and includes lock and barrier waits, ordered-waits, and OpenMP library overhead, such as scheduling, thread-management, and so forth. It accumulates either when the thread is using CPU time to wait (busy wait), or when the thread is sleeping (sleep-wait). It is computed by summing across the non-work states reported by the OpenMP runtime: THR_WAIT_STATE, THR_IBAR_STATE, THR_EBAR_STATE, THR_LKWT_STATE, THR_CTWT_STATE, THR_ODWT_STATE, THR_OVHD_STATE, and THR_ATWT_STATE.


  4. User-Model Callstack Synthesis
  5. When profile data is processed, the callstack as would be shown in the user-model is synthesized for each event, and the performance data is processed as if the synthesized callstack was the actual one captured.

    Before further processing the performance data, the record of fork events is processed, and a table giving the parallel-region ID and master-callstack-preface is constructed. For simplicity, we'll first consider the case where there is no nested parallelism.

    When a fork event is generated, its recorded callstack would be:



    where <omp-rt> refers to one or more frames from within the OpenMP runtime. Those frames are recognized because they come from the OpenMP runtime shared object whose name is known. They are deleted, and the resultant master-callstack-preface would be:



    Profile packet implementation-model callstacks are transformed into user-model callstacks as follows. First, if the OpenMP state reported in a profile event shows the thread as waiting for work, a single-frame callstack showing only <OMP-idle> is used to replace the entire callstack for the profile event. Otherwise, any frames at the top (leaf) of the stack coming from the OpenMP runtime can be removed, as they should not appear in the user-model. Some of those frames represent thread management activities from the library, while others represent user API calls. But, in either case, they can be removed, and replaced with pseudo-functions, based on the OpenMP state, representing a user-model of what the OpenMP runtime is doing on behalf of the process. If the state at the time the packet was recorded was OpenMP work, no pseudo-function is appended. Otherwise, the pseudo-function is determined from the OpenMP state of the thread, and is one of: <OMP-overhead>, <OMP-implicit_barrier>, <OMP-explicit_barrier>, <OMP-lock_wait>, <OMP-critical_section_wait>, <OMP-ordered_section_wait>, or <OMP-atomic_section_wait>.

    If the packet is not within a parallel region, no further processing is needed; if it is, the next step is to determine where that parallel region is calling back into user code. Profile packets from the master or the slave look a bit different:

    Master Slave
    foo foo-OMPa
    main <omp-rt>
    <start> <thread_start>

    In either case, the call from the parallel region is determined as the first frame before the OpenMP runtime in the stack; all frames below that are removed:

    Master Slave
    foo-OMPa foo-OMPa

    and then from the parallel region ID, the master-callstack-preface is determined and used as the base of the stack, to yield the same resultant callstack for master and slave:

    Master Slave
    foo-OMPa foo-OMPa
    foo foo
    main main
    <start> <start>

    The above description explicitly made the assumption that the application only had a single level of parallelism. Nested parallelism is accommodated straightforwardly. It is recognized when a fork event shows that it came from a thread with a non-zero parent-parallel-region-ID. In that case, the master-callstack-preface is synthesized from its recorded callstack, and the master-callstack-preface of its parent, much as user-model profile callstacks are synthesized above.

    The Sun Studio Performance Tools implemented with a prototype of this API in the OpenMP runtime were used in the paper by Gabriele Jost, Oleg Mazurov and Dieter an Mey, Adding New Dimensions to Performance Analysis through User-Defined Objects, IWOMP 2006, Reims, France, June 12-15, 2006.


7. Conclusions

In summary, we have described an efficient, lightweight API to be exported from the OpenMP runtime providing the support to allow performance tools to present measurements in the user-model of OpenMP programming, independent of the implementation details of the runtime support.

The omp_collector_api.h Header File

 * Copyright (c) 2006, 2007, OpenMP Architecture Review Board.
 * Permission is hereby granted, free of charge, to any person
 * obtaining a copy of this software and associated documentation
 * files (the "Software"), to deal in the Software without
 * restriction, including without limitation the rights to use, copy,
 * modify, merge, publish, distribute, sublicense, and/or sell copies
 * of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 * The above copyright notice and this permission notice shall be
 * included in all copies or substantial portions of the Software.


#ifdef __cplusplus
extern "C" {

 *      This file defines the communication between the OpenMP runtime
 *      library and a "collector", assumed to be a separately compiled
 *      shared object, in the address space of the target and the
 *      OpenMP runtime library
 *    Rendezvous strategy:
 *      The collector asks the runtime linker for the address
 *      corresponding the the string defined as OMP_COLLECTOR_API
 *      (see #define below).  Once it has that address, it calls
 *      that function as its only API.
 *    Initialization:
 *      The first request made by the collector must be a OMP_REQ_START 
 *      request.  It may be issued before or after the OpenMP runtime 
 *      .init section has been called.  If it is issued before, the runtime 
 *      library must be able to initialize itself and handle the request.
 *    Threads
 *      The initial thread is created when the process is instantiated, and
 *      it is referred to as the "initial thread."
 *      Subsequent threads created by the OpenMP runtime are referred to
 *      as OpenMP threads.  Subsequent threads created by non-OpenMP APIs
 *      are referred to as "non-OpenMP threads."  The term "master thread"
 *      refers to the thread that encounters a parallel region and creates
 *      a team of threads to work on that parallel region.  The term "slave
 *      thread" refers to any thread within a team other than the
 *      master thread of that team.
 *    Event notification
 *      One of the requests (OMP_REQ_REGISTER) asks for notification of 
 *      specific events (one of the enum OMP_COLLECTORAPI_EVENT values).  
 *      That callback will be made with the event enum as an argument 
 *      when each event occurs.  The first notification may occur before 
 *      the request to __omp_collector_api completes, and the collector 
 *      must be able to cope with that.
 *    Statistical Profiling
 *      For statistical profiling, the collector establishes whatever
 *      trigger it wants, presumed to be a signal generated to the
 *      target.  The collector's signal handler can make calls to determine
 *      the OpenMP runtime state of the calling thread, the parallel
 *      region in which it is executing, and the lock on which it is
 *      waiting.  The state and lock id are returned from a single
 *      request; the calls are async-signal-safe.
 *    Callstack reconciliation
 *      Typical OpenMP implementations do not execute code such
 *      that the callstack unwound in a thread matches the user model
 *      of execution.  User-model callstacks can be synthesized by
 *      tracing OMP_EVENT_FORK events, and recording the region ID,
 *      and the master-thread callstack at the time of that event.
 *      The callstack as unwound by a thread in that region can be
 *      appended to the master callstack on entry to that region
 *      to synthesize a user-model callstack.  One issue to be resolved
 *      is the elimination of OpenMP runtime frames from the synthesized
 *      callstack, since the runtime does not really appear in the user model.
 *      The current API provides no help in doing that synthesis.
 *    Termination
 *      The collector can turn off all interactions with the runtime
 *      by sending a OMP_REQ_STOP request.  All callbacks will be 
 *      cancelled, and any subsequent requests will get a 
 *      OMP_ERRCODE_SEQUENCE_ERR response.  A new session may be 
 *      started by reissuing a OMP_REQ_START request, and a conforming
 *      implementation must support that request.

/* Interface function in the OpenMP Runtime Library  -- string provides name */

#define OMP_COLLECTOR_API       "__omp_collector_api"

 * int __omp_collector_api( void *arg );
 * The interface for the OpenMP runtime library to enable OpenMP
 *      processing support for a collector, to register and unregister
 *      for event notification, and to inquire for various parameters.
 * The API may be called from any thread, whether the initial thread,
 *      an OpenMP thread, or a non-OpenMP thread.  Some requests will
 *      return an error status or value if called by non-OpenMP threads.
 * Returns 0 if the entire request has been successfully processed
 *      (even if some queries returned error codes),
 * otherwise -1 is returned.
 * Consumer: A generic "collector" recording performance data
 * Producer: the OpenMP Runtime library
 * Consumer allocates memory for 'arg' and specifies in it a request
 * for some producer's properties. Producer uses the same memory to record
 * the result of each entry in the request. The request layout is:
 * arg 
 *  |
 *  v
 *  +-----+----+----+-----+-----+...+----+----+----+-----+-----+...+----+
 *  | sz  | r# | ec | rsz | mem |   | sz | r# | ec | rsz | mem |   | 0  |
 *  +-----+----+----+-----+-----+...+----+----+----+-----+-----+...+----+
 * int sz  [4 bytes] : Entry length in bytes (including the 16-byte header).
 * int r#  [4 bytes] : Request number -- one of enum OMP_COLLECTOR_REQUEST.
 * int ec  [4 bytes] : Error code -- one of enum OMP_COLLECTOR_EC.
 * int rsz [4 bytes] : return size in bytes; if OMP_ERRCODE_MEM_TOO_SMALL
 *      is returned, rsz is the size needed for "mem" to hold the response.
 *      If no error is returned, rsz is the size of the returned data in "mem".
 * char mem[]             : Input parameters and/or return value place holders.
 * sz == 0 implies the end of the list of requests
 * The consumer sets sz. If sz is not a multiple of 4 or less than 16, or
 *      not properly aligned the remainder of the request is ignored, and
 *      -1 is returned.
 *      The request is one of the values of OMP_COLLECTORAPI_REQUEST.
 * The producer also returns a return code in the error code field 'ec'.
 *      The error code is one of the values of OMP_COLLECTORAPI_EC.
 * For known requests, the producer knows the types and sizes for the return.
 *      The size field may be greater than the actual request requires.
 *      Producer converts the pointer to the mem field to the appropriate
 *      type to record a return value. If the size of mem is smaller
 *      than required, the error code OMP_ERRCODE_TOO_SMALL is returned.
 * Producer checks for alignment for robustness before recording 
 *      a property value and ignores entries if not properly aligned.
 *      In such case, producer will return OMP_ERRCODE_ERROR.
 * It's the consumer's responsibility to provide correct property size and
 *      alignment. It's also the consumer's responsibility to know which
 *      requests are async-signal-safe and could be issued from a
 *      signal handler.
 * The request, error code, event, and thread-state enums have explicit
 *      numerical values that will be persistent across releases.

 *  Request Codes

typedef enum {
        OMP_REQ_START = 0,
        OMP_REQ_REGISTER = 1,
        OMP_REQ_STATE = 3,
        OMP_REQ_PARENT_PRID = 5,
        OMP_REQ_STOP = 6,
        OMP_REQ_PAUSE = 7,
        OMP_REQ_RESUME = 8,

 * Requests may be batched in a single call; they will be processed 
 * sequentially, and will behave identically to the same requests made 
 * in independent calls.
 * Specific Requests:
 *              Must be the first request; if any other request comes
 *              prior to this one, it will get an OMP_ERRCODE_SEQUENCE_ERR 
 *              return.  The request should be issued as early as possible, 
 *              it tells the runtime to allocate and necessary space for 
 *              keeping track of thread state, region IDs, callbacks, etc.
 *              If the OMP_REQ_START arrives too late for such initialization,
 *              it will get a OMP_ERRCODE_SEQUENCE_ERR return.
 *              An OMP_REQ_START may not be issued when the API is already
 *              started -- it will get a OMP_ERRCODE_SEQUENCE_ERR return.
 *              An OMP_REQ_START may be issued after an OMP_REQ_STOP, and
 *              it is implementation-dependent as to whether it will work
 *              or return OMP_ERRCODE_SEQUENCE_ERR.
 *              OMP_REQ_START may be issued by any thread.
 *         Input parameters: none
 *         Return Value: none
 *         Async-signal-safe: no
 *              Passes in an event type (enum OMP_COLLECTORAPI_EVENT), and 
 *              a (void *) callback.  The callback routine will be passed in 
 *              the event type, so that a single callback routine could
 *              be defined for all events, or separate callbacks can be used.
 *              The callback function prototype is:
 *                  void        callback(enum OMP_COLLECTORAPI_EVENT);
 *              The callback will be called synchronously by the thread
 *              generating the event, not by the thread registering the callback.
 *              OMP_REQ_REGISTER may be issued by any thread.
 *              If an event is not known, OMP_ERRCODE_UNKNOWN will be returned.
 *              If an event is known but not supported for callbacks,
 *              OMP_ERRCODE_UNSUPPORTED will be returned. If a 
 *              previous registration was done, the new callback replaces it.
 *              It is the responsibility of the collector, not the OpenMP
 *              runtime, to support multiple callbacks.
 *         Input parameters: {OMP_COLLECTORAPI_EVENT, void *callback}
 *         Return Value: none
 *         Async-signal-safe: no
 *              Passes in an event type.  If there was no previous registry 
 *              for that event, the request is ignored.
 *              OMP_REQ_UNREGISTER may be called by any thread
 *         Input parameters: OMP_COLLECTORAPI_EVENT
 *         Return Value: none
 *         Async-signal-safe: no
 *              Returns a value defined in the enum OMP_COLLECTOR_API_THR_STATE, below
 *              and, if applicable, a (void *) wait_id.  The wait_id is a lock
 *              ID for state THR_LKWT_STATE, THR_CTWT_STATE, and THR_ATWT_STATE;
 *              a barrier ID for states THR_IBAR_STATE and THR_EBAR_STATE; 
 *              an ordered-region ID for state THR_ODWT_STATE.
 *              It is undefined for other states.
 *              It may be called from any thread, and will return
 *              THR_SERIAL_STATE for the initial thread or any non-OpenMP thread that is
 *              not in a parallel region.
 *         Input parameters: none
 *         Return Value: {OMP_COLLECTOR_API_THR_STATE, wait_id}
 *         Async-signal-safe: yes
 *              Returns the ID of the current parallel region within which
 *              the calling thread is executing.
 *              If the calling thread is not within a parallel region either because
 *              the thread is in state THR_SERIAL_STATE or THR_IDLE_STATE,
 *              it will return 0.
 *              Calling thread may be any thread.
 *         Input parameters: none
 *         Return Value: uint64_t region_id
 *         Async-signal-safe: yes
 *              Returns the ID of the parallel region in which the current patallel region
 *              is nested.  Returns 0 for an outer (non-nested) parallel region.
 *              Only meaningful when the thread is the master thread of the team.
 *              If the calling thread is an OpenMP thread, but not the master thread
 *              of a team within a parallel region, or a non-OpenMP thread,
 *              it will return OMP_ERRCODE_THREAD_ERR
 *              This call is normally made only from a OMP_EVENT_FORK event callback.
 *         Input parameters: none
 *         Return Value: uint64_t region_id
 *         Async-signal-safe: no
 *              Temporarily disables the generation of any events
 *              Ignored if events are currently already disabled
 *              PAUSE/RESUME are orthogonal to REGISTER/UNREGISTER, so that
 *              a REGISTER issued while event-generaion is disabled will
 *              still be processed, and events generated after a RESUME is issued.
 *              Events occuring while PAUSE is in effect are not sent, and not remembered.
 *         Input parameters: none
 *         Return Value: none
 *         Async-signal-safe: no
 *              Re-enables the generation of any events
 *              Ignored if events are currently not disabled.
 *         Input parameters: none
 *         Return Value: none
 *         Async-signal-safe: no

 *  Return Codes in the 'ec' Field

typedef enum {
        OMP_ERRCODE_OK = 0,             /* successful */ 
        OMP_ERRCODE_ERROR = 1,          /* size is not right or mem is not aligned */
        OMP_ERRCODE_UNKNOWN = 2,        /* request unknown */
        OMP_ERRCODE_UNSUPPORTED = 3,    /* request is known, but not supported */
        OMP_ERRCODE_SEQUENCE_ERR = 4,   /* request was made at the wrong time */
        OMP_ERRCODE_OBSOLETE = 5,       /* request is obsolete, and no longer supported */
        OMP_ERRCODE_THREAD_ERR = 6,     /* request was made from an inappropriate thread */
        OMP_ERRCODE_MEM_TOO_SMALL = 7,  /* return data will not fit in the allocated space */

 *  Events Subject to Notification

typedef enum {
        OMP_EVENT_FORK = 1,
        OMP_EVENT_JOIN = 2,
        OMP_EVENT_THR_END_LKWT = 10,
        OMP_EVENT_THR_END_CTWT = 12,
        OMP_EVENT_THR_END_ODWT = 14,
        OMP_EVENT_THR_END_ATWT = 22,
             * implementation-specific events may be defined with any enum value
             * equal to or greater than OMP_EVENT_THR_RESERVED_IMPL

 *  All events are generated synchronously to the thread in which they occur
 *      The callback routine is passed a single parameter, the event type
 *  Mandatory event -- must be supported in any conforming implementation
 *              The master thread is about to fork a team of worker threads
 *  Optional events -- may or may not be supported
 *      The cost of generating these events may be significant, and distort
 *      the performance of the application.
 *              All threads in a parallel region have completed their
 *              work, and the master has returned to its serial processing.
 *              For nested parallelism, the "serial" processing may be as
 *              a slave thread to an outer parallel region.
 *              A thread enters the THR_IDLE_STATE; a slave thread finishes
 *              work in a parallel region, and rejoins the pool of available
 *              threads.
 *              A thread leaves the THR_IDLE_STATE
 *              A thread enters the THR_IBAR_STATE
 *              A thread leaves the THR_IBAR_STATE
 *              A thread enters the THR_EBAR_STATE
 *              A thread leaves the THR_EBAR_STATE
 *              A thread enters the THR_LKWT_STATE
 *              A thread leaves the THR_LKWT_STATE
 *              A thread enters the THR_CTWT_STATE
 *              A thread leaves the THR_CTWT_STATE
 *              A thread enters the THR_ODWT_STATE
 *              A thread leaves the THR_ODWT_STATE
 *              A thread enters a MASTER region
 *              A thread leaves a MASTER region
 *              A thread enters a SINGLE region
 *              A thread leaves a SINGLE region
 *              A thread enters an ORDERED region
 *              A thread leaves an ORDERED region
 *              A thread begins waiting to enter an atomic region
 *              A thread ends waiting to enter an atomic region

 *  Thread States

typedef enum {
    THR_OVHD_STATE = 1,         /* Overhead */
    THR_WORK_STATE = 2,         /* Useful work, excluding reduction, master, single, critical */
    THR_IBAR_STATE = 3,         /* In an implicit barrier */
    THR_EBAR_STATE = 4,         /* In an explicit barrier */
    THR_IDLE_STATE = 5,         /* Slave waiting for work */
    THR_SERIAL_STATE = 6,       /* thread not in any OMP parallel region (initial thread only) */
    THR_REDUC_STATE = 7,        /* Reduction */
    THR_LKWT_STATE = 8,         /* Waiting for lock */
    THR_CTWT_STATE = 9,         /* Waiting to enter critical region */
    THR_ODWT_STATE = 10,        /* Waiting to execute an ordered region */
    THR_ATWT_STATE = 11,        /* Waiting to enter an atomic region */


 * The "main" thread is the initial thread starting the user code.
 * "user thread" refers to either the initial thread, or a non-OpenMP thread
 *      created explicitly by the user using either the POSIX thread or
 *      Solaris thread thread-creation interface.
 * "OpenMP thread" refers to a thread created by the OpenMP runtime library.
 *      A thread (user or OpenMP) is inside an OMP parallel region and is:
 *        . preparing for the parallel region; or
 *        . preparing for a new worksharing region; or
 *        . computing loop iterations; or
 *        . doing some other overhead work inside a parallel region
 *      A thread is inside an OMP parallel region and doing
 *        "useful" work.  Useful work does *not* include
 *         working on combining reduction results.
 *      A user thread is outside any OMP parallel region.
 *      A non-user thread is waiting to work on an OMP parallel region.
 *      A thread is waiting at an implicit barrier at the end 
 *        of an OMP parallel region, a worksharing region, or a single
 *        region.
 *      A thread is waiting in an explicit barrier. 
 *      This state is entered when the thread encounters a barrier
 *      region.
 *      A thread is working on combining reduction results.
 *      A thread is waiting for an OMP lock.
 *      This state is entered when the thread encounters a call to 
 *      omp_set_lock() or omp_set_nest_lock(), and the lock it is trying 
 *      to set is already locked.
 *      A thread is waiting to entering an OMP critical region.
 *      This state is entered when the thread encounters a critical
 *      region, and the lock associated with that critical region is
 *      already locked.
 *      A thread is waiting to execute an ordered region.
 *      This state is entered when the thread encounters an ordered
 *      region and it cannot execute the ordered region, because some
 *      previous iteration that contains an ordered region has not 
 *      completed yet.
 *      A thread is waiting to enter an OMP atomic region
 *      This state is entered when the thread encounters an
 *      atomic region, and the lock associated with that region (if any)
 *      is held by another thread.
 * There is no state corresponding to being in a critical section,
 *      holding a lock, being in a MASTER section, or a SINGLE
 *      section, or an ORDERED section.  These constructs can be
 *      nested, so a single state is inappropriate.

#ifdef __cplusplus

#endif  /* _OMP_COLLECTOR_API_H */

Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.