|By Marty Itzkowitz, Project Lead, Sun Studio Performance Tools, Sun Microsystems Inc. , November 10, 2005|
What kind of questions do users have about performance, and what is the user model for the Sun Studio performance tools?
For most applications, the basic question users have about performance is "What can I change to improve performance?" That question translates into knowing what resources are being used, where in the program they are being used, and how the program got to that point.
For most programs, including the kernel, the most important information is how much CPU time is being spent in which parts of the program. For many executables, memory and floating-point performance is an issue. In such cases, users want to know the costs of the various components in the memory hierarchy, and the efficiency of floating-point computations.
For Java programs, users are interested in knowing which methods are expensive, how their performance is affected by HotSpot compilation, and how their memory allocation and garbage collection affect behavior.
For multithreaded executables, users need information about how the threads are being used, and what costs are involved in synchronizing them. For parallel applications, users want to know which loops are parallel, and which are serial, and what the obstacles to parallelization of the serial loops are. Users are also interested in the efficiency of load-balancing in parallel regions.
The Sun Studio performance tools can provide the answers. The design goal of the tools was to minimize the number of mouse-clicks to get the "A-ha" point.
The user model for the tools is to build the application exactly as one would for production, including any optimization and parallelization. Then the desired performance data is collected during one or more runs of the program. After the runs, the data can be examined, either with the Performance Analyzer GUI or with the command-line er_print utility.
The following three sections describe each of these steps.
In this section we discuss the issues concerning compiling the target executable, and the various programming models used to develop applications.
The performance tools are shipped with the Sun Studio compilers, and we normally expect that the target programs for data collection will be compiled with those compilers. The tools also work on codes compiled with the GNU compilers, and, to some extent, with other third-party compilers. How well they work depends on the accuracy and completeness of the symbolic information generated by the compiler, but function-level data is almost always correct.
The Sun Studio compilers provide significantly more information than other compilers to help users understand their code. With optimization, the compilers insert commentary describing the transformations performed on the code, any obstacles to parallelization, operation counts for loop iterations, and so forth. The C compiler also supports generation of branch-target tables and load/store structure tagging to support dataspace profiling, described below. The provided information is shown in annotated source and disassembly listings from the tools. The parallel runtime library, libmtsk.so, has support to allow the tools to show loop imbalance.
Users may use any level of optimization and parallelization in compiling their codes. In order to get source-level data, and for the support for parallel Fortran and OpenMP to work fully, the target should be compiled with the -g option ( -g0 for optimized C++). In the current Sun Studio compilers, those options have minimal impact on the generated code at -O2 and -O3, and no effect at all at either higher or lower optimization.
Most of the data recorded includes call stacks, and unwinding the call stack is critical for data collection. On SPARC® based systems, unwinding is mostly straightforward. ABI conformance is necessary, but not sufficient-additional assumptions about the code behavior are used to do the unwind, as described below.
On x86 based systems and x64 based systems, where frame pointers might not be available, unwinding is far more difficult. Even though compiling with options to preserve the frame pointer will deoptimize the program somewhat, we recommend that the users do so.
With the Sun Studio compilers, frame pointers are preserved if the code is compiled with the -xregs=no%frameptr option. With the GNU compilers, frame pointers are preserved if the code is compiled with the -fno-omit-frame-pointer option.
All data is recorded against the actual program instructions (PCs). Depending on which languages and programming models the target code uses, it is important to relate the data back to the user's functions and their source code.
The tools understand both POSIX and Solaris TM threads, and capture data on a per-thread basis. Synchronization tracing is provided to inform the user about the costs of synchronization among threads.
OpenMP (and autoparallel) codes run with a master thread and a number of slave threads. The single-threaded part of execution is executed by the master thread, and the parallel part is executed by both the master and the slave threads. The latest release of Sun Studio software supports nested parallelization, in which one slave thread can be the master for additional threads.
The code generated by the Sun Studio compilers for OpenMP constructs has the body of a loop or worksharing construct abstracted into a separate function, which may be called from either the master or the slave threads. A naming convention allows the tools to associate those functions with the original function from which the construct was extracted. Additional support in the OpenMP runtime library, libmstk.so, allows the tools to attribute time spent in the barrier code at the end of a worksharing construct to that construct, even though the actual runtime behavior does not have that function on the stack.
MPI programs operate by running multiple processes, on the same or different nodes, and having those processes communicate with messages passed among them.
The tools support MPI profiling by collecting data on each of the processes, and aggregating that data into an experiment group. They also can support tracing of API calls into the MPI library.
An increasingly important programming model is a hybrid of MPI and OpenMP. In this model, the processes that communicate via MPI are also multithreaded using OpenMP. The tools support such targets, since data collection on MPI processes is exactly as it is for any other process, including OpenMP support.
Java programs present quite a different challenge for performance measurements. While pure Java programs can be profiled using the facilities provided within the Java TM Virtual Machine (JVM) software, many Java programs are not pure Java (with only interpreted methods). They may be linked with C or C++ libraries (native methods), and also contain dynamically-generated functions (compiled methods). Furthermore, a single Java method might be interpreted part of the time and compiled part of the time, and might even have several different compiled versions.
At any instant during the execution of a Java program, there are two different call stacks that make sense. The first call stack is the Java stack, where the interpreted methods and compiled methods are shown, but no native methods appear. The second call stack is the machine stack. In that call stack, the interpreted methods do not appear; rather the JVM functions used to interpret them are shown. Native and compiled methods appear as such. Neither stack is complete; the tools capture both stacks and reconcile them to produce a Java stack that includes any calls into native code.
The tools support three representations of the program: a User representation, showing only data for the user Java threads, with all other data aggregated into a pseudo-function ; an Expert-User representation showing Java stacks where available, or machine (that is, JVM) stacks where they are not; and a Machine Representation, showing only the machine stacks.
In the User representation, native C/C++ code is shown intermixed with Java methods, and all versions of a Java method are aggregated. Source is the user's Java or native source code, and disassembly is byte-code for Java, and native code for the other languages. Only user-Java threads are shown. In the Expert-User representation, Java call stacks are shown for Java threads, and native call stacks are shown for all other threads. All threads are shown.
In the Machine Representation, the different compiled versions of a Java method are separated, and the disassembly shown for dynamically-compiled methods is native code. Interpreted versions of a method are not shown; the JVM interpreter is shown instead. All threads are shown.
In some sense, the kernel itself is just a program. The Solaris TM 10 Operating System (Solaris OS) provides a very powerful and flexible instrumentation mechanism, DTrace, which can be used to understand the Solaris kernel behavior. The Sun Studio tools use the DTrace mechanism to record kernel experiments in a format analogous to user experiments. From the data, all the power of the Performance Analyzer can be used to explore the behavior of the kernel.
Source-level data is not available for kernel experiments because kernel modules do not contain line-number information.
There are several ways to collect data from a target code:
This section describes the types of data that can be recorded, the various data collection commands, the collector API, and the collector runtime implementation.
All types of data are recorded as a series of events, each with event-specific data, a call stack, thread ids, LWP ids, CPU ids, and a high-resolution timestamp. The event-specific data is used to compute metrics of performance, which can be attributed to functions, source lines, and disassembly instructions. Resources used within a given function (exclusive metrics) are attributed to the leaf PC of the call stack. Resources consumed by a function and all the other functions it calls (inclusive metrics) are attributed to all PCs in the call stack. Care is taken to properly account for direct and indirect recursion. The call stack information can also be used to show how the inclusive metrics in a function are accumulated from its callees, and how they are propagated up to its callers.
Two of the main types of data collected, clock-based profiles and hardware counter overflow profiles, are statistical in nature. The longer the run, or the finer the profiling interval, the more accurate the data will be. On the other hand, the finer the profiling interval, the more profiling may distort the behavior, and the higher the volume of data recorded.
The fundamental assumption in statistical profiling is that the profiling events arrive sufficiently well-separated so that any perturbations in the machine state (cache effects, for example) from one event are dampened out before the next event. If the assumption is true, the recorded data does reflect a valid statistic of the program's unmeasured behavior. At very high profile rates, this assumption may break down, and the data may be suspect.
Clock-based profiling can be used to identify those parts of the program that use the most User CPU time. It has been available since the earliest days of UNIX®, and is the basis for the traditional prof and gprof performance tools.
On the Solaris OS, clock-based profiling is based on the kernel accounting microstates, and produces data to support a User CPU Time metric, as well as the following additional metrics: Wall Clock Time, Total LWP Time, System CPU Time, Wait CPU Time, Text Page Fault Time, Data Page Fault Time, and Other Wait Time. These additional metrics allow the user to identify those pieces of the code that consume resources other than User CPU Time.
Clock-based profiles can be collected at normal resolution (~10 ms.), high resolution (~1 ms.), low resolution (~100 ms.) or at a custom resolution, specified in milliseconds or microseconds. The maximum resolution depends on the system. The profiling clock is asynchronous with respect to the system clock, to avoid the possibility of correlated behavior.
For multithreaded experiments, all of the metrics are summed across all LWPs in the process. Total LWP Time adds up to the wall clock time multiplied by the average number of LWPs in the process.
On Linux systems, the profiling infrastructure is active when only the target is running on the CPU, and it produces data for a User CPU Time metric only. In addition, the profiling clock may be correlated with the system clock, which may lead to misleading profiles.
Hardware counter overflow profiling is implemented on the UltraSPARC-III® and later SPARC chips, and on various x86 and AMD Opteron chips. The set of counters available is different on each chip, and the tools use the exported interfaces from libcpc.so on the Solaris OS to generate the data. Invoking the collect command with no arguments lists all the hardware counters available on that system. Profiling can be done on however many simultaneous counters the hardware supports. On current SPARC systems, a maximum of two counters is available.
The tools have two levels of hardware counter support, one for standard counters, and one for custom counters. Standard counters are those that we think are the most important ones to use. They have predefined values for normal-, high-, and low-resolution profiling, or users may select a custom overflow value. The standard counters for UltraSPARC-III include Cycles, Instructions, D$ read misses, E$ references, D$ and E$ stall cycles, E$ stall cycles, StoreQ stall cycles, FP Additions, FP Multiplies, and many others.
For those counters that count in cycles, the default metrics are inclusive and exclusive times, obtained by dividing the cycles by the machine clock frequency. For those counters that do not count in cycles, the default metrics are counts of the inclusive and exclusive named events.
A custom setting can be used for the user to specify any available counter, as listed by cputrack or by the collect command with no arguments. Metrics for the custom hardware counters are inclusive and exclusive events, with the numerical values equal to the overflow interval multiplied by the number of overflows.
A unique feature of the Sun Studio performance tools is that they support an extension to hardware counter overflow profiling that allows the generation of metrics against the dataspace of the target, not just the instruction space. It enables users to determine which data structures are responsible for the cache misses, not just which instructions.
The hardware counter overflow profiling interrupt is delivered with an interrupt PC, and the call stack is unwound from there. On current SPARC systems, unfortunately, the interrupt is not synchronous; it is delivered a variable number of instructions after the overflow is triggered. The instruction profiles are not, in general, on the instructions that actually caused the counter to overflow. Furthermore, the operating system and chip do not provide the virtual address of a miss. In order to provide data that has the actual instruction address, and the virtual and physical addresses of a memory reference, we have implemented a backtracking scheme. It is invoked by prepending a + to the name of a memory-related counter used for hardware counter overflow profiling.
When backtracking is specified, the collector runtime walks back in address order from the PC delivered in the context of the profiling signal, and attempts to find both the PC of the instruction causing the miss (trigger PC), and the virtual address being referenced. If it gets a virtual address, it asks the OS for the corresponding physical address. If the registers used for the address in the triggering instruction are changed between it and the interrupt PC's instruction, no virtual address can be recorded. Furthermore, the backtracking might have traversed a branch target, in which case the data might not be valid. We therefore verify the backtracking at analysis time. With that information, and symbolic information from the compiler, we can show cache performance metrics against data objects.
Thread synchronization delay tracing is used to help the users understand the synchronization costs of their multithreaded applications. It can be used with either POSIX threads or Solaris threads, and collects a trace of all synchronization events that exceed a adjustable threshold. The data is collected by interposing on the synchronization functions in the library.
By default, the threshold is set to "Calibrate", in which case the code repeatedly acquires an uncontested mutex lock, and sets the threshold to five times the average time to acquire the lock. Alternatively, it can be set to "All" (which traces all calls, independent of the delays), or to a custom value.
Two metrics are available for thread synchronization delay data: Sync Wait Count and Sync Wait Time.
Heap tracing is used to identify memory leaks or inefficient allocations in a user code. It collects a trace of all allocation and deallocation events, and matches them to determine leaks. The data is collected by interposing on the various entry points in the library.
Heap tracing generates four metrics, Leaks, Bytes-leaked, Allocations, and Bytes-allocated. In addition, the data is processed to aggregate leaks and allocations by common callstack to make it easier to plug the leaks, and understand if the allocations can be reduced.
Six metrics are available from MPI tracing: MPI Time, MPI Sends, MPI Bytes Sent, MPI Receives, MPI Bytes Received, and Other MPI Calls.
Samples are always recorded as the target process starts and as it exits. Samples may also be triggered by a specified signal, or taken periodically, or generated by an API call. For dbx experiments (see below), a sample is recorded whenever the process stops at a breakpoint, and, if so specified, periodically during the run.
Execution statistics consist of real high-resolution time stamps and virtual high-resolution time stamps, microstate accounting data, and various kernel counters monitoring the resources consumed by the target.
The primary interface for recording data is the collect command, but other methods are available. They are described below.
The collect command provides the simplest interface for collecting data, and is the one recommended for most applications. It is used just as /bin/time would be used, with the target name and arguments as the last command-line arguments to collect. The command allows the user to specify the name of the experiment and the directory in which to store it. It can also take an optional experiment-group name, used to allow the Analyzer to treat multiple experiments as a single entity. An experiment group can be used to aggregate the data from all the processes in a single MPI job.
By default, the collect command runs a clock-profiling experiment. It can take optional arguments to specify other types of data to be collected, for example hardware counter overflow profiling, or heap or MPI tracing. It also accepts arguments for adding notes to an experiment, and for specifying signal-based controls for sampling, and for enabling or disabling data collection during the run.
In more complex applications, the target may spawn additional processes. The collect command has an option for specifying whether or not data should be collected for such descendant processes. If specified, the data on these descendants are recorded as sub-experiments within the original (or founder) experiment.
The collect command sets up the environment for data collection, including setting the LD_PRELOAD environment variable to ensure that the runtime library, libcollector.so, is in the address space of the target, and setting two environment variables, one to specify the data to be collected, and the second naming the experiment. It then uses exec()) to execute the target on top of itself. The initialization section in the runtime makes all the necessary calls to turn on the data requested, and then the target proceeds with its execution. As the target exits, the experiment is terminated.
The dbx command-line debugger provides a second means for collecting data. It supports the same data collection as the collect command, using a command-line interface. Its primary use for performance measurement is to attach to a running process (a long-running server, for example) and collect data from it.
When data collection is enabled after an attach, dbx first makes an interactive function call to ensure that the collector runtime is in the target address space. It then calls the initialization routine in the runtime to set up the experiment and to start recording data.
If the runtime was loaded with LD_PRELOAD, all functionality for data collection is available; if not, clock-based profiling and hardware counter overflow profiling work, but the collection of trace-based data, which relies on function interposition to work, does not.
The Analyzer GUI has a Collect window that allows the user to specify the parameters for collecting data, and launch a collect run. It may be used either from the standalone Analyzer, or from the Analyzer tab in the Sun Studio IDE.
The Debugging tab in the Sun Studio IDE provides another way to collect data. It offers a Collector window for determining the data to be collected, very much like the Analyzer Collect window, and runs the target under dbx.
Kernel profiling uses the er_kernel command, which is layered on top of the DTrace subsystem of the Solaris 10 OS. The command accepts arguments in the same form as the collect command, but has additional support for data collection from specific process/thread pairs, or for specific kernel threads.
When invoked, the er_kernel command sets up the necessary DTrace probes and makes them active. As the DTrace events come in, they are buffered and written to the experiment. The er_kernel command allows the user to specify an arbitrary load to be run during the experiment.
The only data currently supported by the er_kernel command is clock-based profiling. Unlike user-level clock-based profiling, which has ticks for each of the kernel accounting microstates the process might be in, kernel profiles have only one state, and the recorded data produces only one metric, KCPU Cycles. As soon as the hardware counter overflow profile provider becomes available in DTrace we will extend the er_kernel command to use it.
An API is available allowing the user to make calls to generate sample points in the experiment, to enable and disable data-recording, either globally or on a per-thread basis, and to describe dynamically-generated code. The API is available for C, C++, Fortran, and the Java programming language. Calls may be left compiled into the code; if no experiment is being recorded, they have no effect other than the trivial cost of an if-test.
The runtime code for the collector is in a single shared object, libcollector.so. Initialization is done by code in the .init section of the library. That code reads two environment variables, one specifying the name of the experiment to be recorded, and the second specifying the data collection options and other options as requested by the user.
The collector installs a signal handler for SIGPROF, which is used for both clock-based profiling and general housekeeping, and installs an interval timer. If hardware counter profile data is requested, the collector installs a handler for SIGEMT and initializes the profiling with calls to libcpc.so.
The collector interposes on various library functions. It interposes on the flavors of fork() and exec() so that it can properly handle descendant processes, for which the user may specify recording data or not. It interposes on the entry points for memory allocation and deallocation to support heap tracing, and on synchronization primitives to support synchronization tracing. It also interposes on the MPI library entry points for MPI tracing.
Programs that dynamically open and close shared objects present a challenge, because experiments on those codes have the property that at different times within the run, a given function may be at different address, and a particular address may map to different functions. The collector tracks the coming and going of all shared objects, which allows the tools correctly ap a PC and its corresponding timestamp to the correct function.
One of the key functions in the collector is to capture a call stack in real time. In most cases, on SPARC based machines, stack unwind is straightforward. If a function has not done a save, %o7 contains the return address. If a function has done a save, the return address is on the stack. The collector always records %o7, and then marches up the stack frames. The determination as to whether or not %o7 is really the return address is made while processing the data.
Complications arise when the code uses %o7 for some other purpose, as when compiler-generated PIC code moves %o7 to some other register, and then executes a call .+2 in order to determine an origin to reference global data in a shared object. Other complications can arise when a function return does a restore any place earlier than the delay slot of the ret instruction (as GNU-compiled code does). In that case, %o7 is ignored although it is the actual return address.
Tail-call optimization is another complicating factor. The effect of this optimization is to have the call stack omit some frames, leading to behavior where, for example, the tools show a function, a, calling function c, but the source code shows that a really calls b which calls c.
There are still other end cases that make stack unwind difficult. In 64-bit execution, the trap handler for misaligned data references is called with a non-standard sequence, with the return address in %l6; and in the libmtsk.so barrier code, a special convention is used to inform the collector of the previous loop function, so that it can be inserted into the recorded stack, even though the real return sequence has nothing to do with the previous loop.
On x86 based systems and x64 based systems, stack unwind is much more complicated. If the codes are compiled to use frame pointers, unwind is straightforward; if not, however, various heuristics are used to try to find the return addresses. Unlike on SPARC based systems, the collector cannot walk back in address space, so dataspace profiling cannot be supported.
While the machine, or native, stack is unwound directly, the Java stack is obtained from the JVM, using a special interface function, AsyncGetCallTrace. That function is, as the name implies, async-signal-safe, so it can be called from the signal handlers for SIGPROF or SIGEMT.
Under some circumstances, the JVM can not safely unwind the stack from a signal handler. In those cases, it returns an empty stack, and metrics are reported against the pseudo function, <no Java callstack recorded>. Typically, an empty stack is returned only a small percent of the time, and is a reasonable compromise to avoid the necessity of synchronizing the JVM for each unwind.
Data files in a performance experiment are written as memory-mapped files, in a self-describing format based on records, each with a type and size. When a file is created or extended, the new space is filled with fairly large blocks, each representing a single padding record. The blocks are placed on a queue, and any thread can request a block from the queue, write a record to it, and then release it back to the queue. Blocks do not necessarily appear in timestamp-order in the file.
Records are written by first putting the data after the type and size fields in the current padding record, adding a new padding record covering the rest of the block, and finally updating the size and then the type of the current record. The file is thus always readable, since the record updating is atomic (only the size change need be atomic). Therefore, the Analyzer can be used to look at experiments even while they are being written.
The implementation of I/O blocking and allocation is mostly lock-free. A lock is needed only when the file need be extended. The implementation is robust and scalable.
The two most significant potential pitfalls of profiling are dilation and distortion. Dilation arises when the program when run under data collection takes longer than it would without data collection. For CPU-bound programs, dilation, even by a significant amount, is not serious. It means the runs might take longer, but as long as the dilation is uniform, the data accurately reflects the performance that would obtain in the absence of data collection.
Distortion is more serious. Distortion arises when the effects of measurement change the relative costs of different parts of the program. For example, if function counting is implemented, each function pays a price for every entry. For small functions, that price is proportionally higher than for large functions, which may be misleading.
The Sun Studio performance tools show approximately 5-7% dilation for a default profiling experiment, and very little distortion. Some distortion arises because the cost of unwinding a deep stack is greater than that for a shorter stack, and the cost of writing a data record also depends on the stack depth.
The Analyzer can be invoked on one or more experiments or experiment groups. It is used to navigate through the structure of the target, informed by the measured performance data.
From the data, the Analyzer can present a list of functions, ordered by performance metric, and the callers and callees of any function, with their performance metrics attributed to the callers and callees. The Analyzer can also show annotated source displays, with per-line performance data, and annotated disassembly displays, with per-instruction data. In addition, the source and disassembly displays show commentary from the Sun Studio compilers.
In all cases, the Analyzer attempts to show data relating to the user's particular programming model. For C++ codes or Java codes, names are demangled. For parallel codes, the parallel body routines, as extracted by the compiler, are tied back to the original routine from which they were extracted, and special hooks are used to allow association of time spent in a barrier to the loop invoking the barrier. For outline functions (generated for little-used code, as identified from compiler feedback), the function is shown as if it were called from the original function. For MPI codes, an experiment group is recorded comprised of all the jobs in a single run, and experiment names are defaulted to match the MPI rank of the process.
The Analyzer can process multiple experiments and aggregate their data. If the experiments were recorded with the identical data types, but the target is run with different data sets, the tools show an average behavior over the data sets. If the experiments are recorded with disjoint data types, but the same executable and data set, the tools show the data as if it had all been recorded in a single run, circumventing the limitation on the number of simultaneous hardware counters that can be used for profiling.
The first step in processing the data is to reduce each data file to a vector of events, in timestamp order, with the actual call stack of the event replaced by a pointer to a leaf-node in a balanced-tree structure of the entire dynamic call graph of the program. The nodes in that graph correspond to PCs, and each is mapped to a shared object, and an offset within that object.
To support the views for text profiling, that is, for reporting metrics against elements in the hierarchical structure of the instruction space, the preprocessed data is converted into a structure called the PathTree, which is similar to the call stack tree used in the raw data. Each node in the tree corresponds to a PC, and the processing of the PathTree attributes metrics of performance to the nodes in the tree.
Any element in the program structure (function, source-line, instruction, and so forth) can be represented as a set of nodes in the PathTree, and the metrics for those structure elements can be computed by aggregating the node metrics over all nodes that refer to that element. If the node is a leaf-node, it contributes to both inclusive and exclusive metrics for the program element. If it is not a leaf-node, it contributes only to inclusive metrics.
To support the views for dataspace profiling, the hardware counter events are processed, and the trigger PC is used to determine the name of the object and field being referenced from the symbol tables. That name is looked up in a table, and the metrics accumulated to the field specified, and to the structure as a whole.
Memory object profiling is similar to dataspace profiling, and is based on the same raw data. The data collected for dataspace profiling includes a virtual address and physical address for each event. Those addresses can be mapped to specific elements in the memory hierarchy of the machine, including L1, L2, and (where appropriate) L3 cache lines, memory banks, memory controllers, virtual pages or physical pages, and so forth. An index for each object of each type is simply computed from one or the other of the referenced addresses. Mapping metrics to those objects allows the user to see which cache lines are hot, which memory banks are getting the most traffic, and so forth.
Filtering based on the memory objects would further allow the user to determine which instructions and functions are referencing the hot cache lines, and which data structures are mapped to those lines.
The Analyzer GUI can be run either in standalone mode, or as a tab in the Sun Studio IDE. In either form, it consists of two main panes, the left with a series of tabs for functions, callers and callees, and so forth. The right pane has a series of tabs showing data for any selected object or event. The main tabs are listed and described below.
There are four types of tabs in the left panel: those that refer to text profiling, those that report on dataspace profiling, the Timeline, and a few miscellaneous tabs. The user can customize which tabs are visible and which are not.
The text profiling tabs are intended to show the user the most important functions, and the most important lines and instructions from those functions.
The Functions tab is the initial screen presented, and shows the user the most expensive functions in the code. It shows a table of performance metrics for each function, sorted by the most important metric. Users can sort by any metric, or specify their preference for sorting or metric display in a .er.rc file. The topmost function is selected, and it defines the context for navigating to source or disassembly. Any function can be selected from the Functions Tab, or from any of the other tabs. A Find field allows searching for any string.
The Source tab and Disassembly tab are used to drill down further into a selected function. They show an annotated listing of the source or disassembly for the selected function, and allow the user to see which pieces of code from a function are the most costly.
The source is annotated with per-line metrics, using the same set of metrics as in the function list. In addition, the source contains interleaved compiler commentary, such as loop parallelization information, software pipelining information, and Fortran 95 copyin/copyout messages.
The disassembly shows an annotated copy of the disassembly listing of the selected function. It is also annotated with per-instruction metrics; in addition, it shows interleaved source and compiler commentary.
In both source and disassembly, the Analyzer highlights all lines in the file with high metric values. A search field allows searching for the high-metric lines (or arbitrary strings), and a column next to the scrollbar shows tick marks corresponding to the high-metric lines. Clicking on a tick marks navigates directly to the corresponding line. The threshold for high metric values, and preferences over which compiler commentary to show, can be specified by the user.
For functions that have instructions that do not map to the source file (such as code from #include files, or from inlined functions), a hyperlink line for each alternate source file is shown, with metrics aggregated over all instructions in that function that come from the alternate context. Double-clicking on the link shows the alternate source file, annotated with per-line performance data relative to the selected function.
If the module has been compiled for dataspace profiling, the disassembly of load and store instructions is shown with the name of the structure and field referenced by the instruction. In addition, special lines indicating branch targets are inserted. Those lines have metrics when the dataspace profiling backtracking traversed the target.
The Callers-Callees tab shows the user how the program got to a selected function. It shows that function in the center of the screen, with its callers in a panel above and its callees in a panel below. By clicking on a caller or callee, the display is redrawn centered on the newly-selected function.
In addition to showing the metric values for each function, the tab also shows "Attributed" metrics. For the center function, the attributed metric represents the exclusive metric for that function, while for the callees below, it represents the proportion of the callee's metric that is attributable to calls from the center function. For the callers above, the attributed metrics represent the inclusive metric value for the center function, as attributed to its callers.
Just as the Functions tab shows data for functions, the Lines tab shows data for individual source lines, and the PCs tab shows data for individual instructions. Each is sorted by the selected metric. They are particularly useful when the users are trying to understand the performance of a code they did not write or do not know very well.
Dataspace profiling tabs show performance metrics relating to memory performance against the structures and fields being referenced. They allow the user to understand the layout of the fields, and how they relate to cache behavior.
The DataObjects tab is shown only when dataspace profiling data is recorded in the experiments. It shows a list of structures, and fields within the structures, sorted by one of the dataspace metric values.
The DataLayout tab is similar to the DataObjects tab, but is sorted differently. Each structure is shown, with the fields within it laid out below the structure line. Cacheline boundaries are indicated within each structure, and both the structure and field lines show the dataspace metric values.
Memory Object tabs are similar to the DataObjects Tab, but show different memory objects rather than data structures. A number of predefined memory objects are available, or the user can define a custom memory object, giving it a name and defining the mappings from virtual or physical address to the index of the specific object.
The Timeline tabshows the individual recorded events in time order. It is intended to exploit the world's most powerful pattern recognition device: the user's mind. It allows the user to recognize temporal patterns in behavior of their application that might not otherwise be evident.
Each type of profiling data is represented by a set of bars, one for each thread, LWP, or CPU, depending on the user preference. Each event is drawn in time order on the Timeline. The call stack for an event is shown as a rectangle with colored segments, each segment representing a frame within the call stack. The user can select any event and see the details of its event-specific data and its call stack, and can warp to the Source tab, Disassembly tab, or Callers-Callees tab from any frame of the call stack. The Timeline tab has an additional bar for the global sample data for each experiment.
The GUI provides powerful hooks for managing the function coloring, which can be used to tease out a pattern. For example, all functions can be colored with one color, and then all functions whose names match some particular pattern can be colored with a different color.
The Timeline tab also allows navigation from events on one bar to the closest event on another, as well as controls over which bars should be shown and the depth of the call stack displayed.
The Leaklist tab provides a graphical representation of the memory leaks or allocations as recorded with heap tracing. It shows two bars, one for leaks and one for allocations, aggregated by common call stack. Each entry has wings above and below showing the byte count and event count, respectively. The call stack is shown in the middle, in the same format as in the Timeline tab.
The Statistics tab shows totals for various system statistics summed over the selected samples.
The Experiments tab shows the experiments currently loaded, with information about the data collected, the target and its architecture, and the machine and OS on which it was collected. It also contains a Notes field, which can be preloaded with user information from the collect command's -C option, and which can be edited and saved with the experiment for future reference.
The detail tabs are those on the right side of the Analyzer main window. They provide the detailed information for the a selected item from one of the left-side tabs.
The Summary tab contains all the data known for the object that has been selected. Selections can be made from any of the text profiling tabs or dataspace profiling tabs, and can be of functions, source lines, or instructions, or of data structures and fields.
The Event tab contains the detailed information for the event selected in the Timeline tab. It shows the event-specific data and the call stack for that event. Selecting any frame in that call stack selects the function from which it came.
The Legend tab shows the colormap of functions as used by the Timeline tab. It can also be used to control the color for a function.
The Leak tab contains the detailed information for the current leak or allocation selected from the Leaklist tab.
Filtering is a very powerful technique to obtain detailed information about part of the program. By default, all the data for the experiment or experiments are shown. The user can choose to filter the data, however, to show only a subset of threads, LWPs, or CPUs, or a subset of samples. If the Analyzer was invoked on multiple experiments, or experiment groups, it can be set to show data for a subset of experiments, too.
Sample filtering is used to show data for a selected interval or intervals from the time course of the experiment. It is implemented by allowing the user to specify a set of samples to be included. Each sample corresponds to a part of the execution. By default samples are recorded every second, but they can also be generated by sending a signal to the process, or by having the process make API calls.
One application of sample filtering could be the investigation of different phases of a compiler; another could be the performance cost of some particular user action in a GUI application. For the former, API calls can be placed at the boundaries of the phases. For the latter, an API call to record a sample point can be inserted in the event handler for a mouse click to bring up a menu, and a second sample point generated by sending a signal to the target when the screen is painted. By filtering the data to include only the events between the two points defining a phase or user action, a profile specific to that phase or action is shown.
Filtering can also be applied to look at data for a subset of all threads (or LWPs, on the Solaris 8 OS under the default thread library, T1, where threads and LWPS are different entities) or of all CPUs. Comparison of different subsets can reveal anomalies in behavior for specific threads, or for threads running on specific CPUs.
A more general filtering mechanism is available, allowing complex expressions to define whether or not a particular record in the data should be counted. Filtering can be on threads, LWPs, or CPUs, or by functions as either the leaf function, or anywhere in the stack, or by data structures or memory objects being referenced in the packet. It can also be by time, or by arbitrary combinations of any of the above. Filtering is defined by an expression in C-like syntax, that evaluates to either true or false.
All of the data can be accessed from a command-line program, the er_print utility, which will produce ASCII reports that match any of the current displays, except for the Timeline tab. The same printer output can be obtained from a Print menu item on the Analyzer.
To show the utility of the tools, we present several simple examples: a simple C program, a simple Java program, a memory-bound and floating-point-bound Fortran program, an explicitly multithreaded C program, one of the SPEC 2000 programs, mcf, and an er_kernel experiment. The first four examples are based on codes shipped as demos for the product.
Click on the screenshot images in these examples to open a larger and more readable image in another window.
synprog is a simple C program, written as part of our test suite, that shows a wide variety of typical programming constructs. We will explore one such construct, gpf, designed to show the so-called gprof-fallacy.
gpf is a routine that calls two other routines, gpf_a and gpf_b. Both of these call a third routine, gpf_work.
The data shown is based on a default experiment, clock-based profiling with a ten millisecond interval. Figure 1 shows the Functions Tab, with gpf_work shown as the third most important function. Figure 2 shows the Callers-Callees tab, centered on that function to show which of the callers uses the most CPU time.
Figure 1. The Functions tab for synprog
Figure 2. The Callers-Callees tab for synprog
The attributed time from gpf_work is shown as primarily coming from gpf_b. If the traditional gprof tool had been used on this code, it would show quite the opposite. It would show the bulk of the time in gpf_work as coming from gpf_a, which is, in fact, incorrect. gprof attributes time based on the number of calls, while the Analyzer attributes time based on the actual time spent, as shown by the statistical sampling of call stacks.
The gpf_work routine takes its argument, squares it, and uses that value as a loop counter. Hence, a single call with an argument of 10 does 100 units of work, while ten calls with an argument of 1 do 10 units of work. This example is artificial, but many real functions, such as matrix-multiply or list-sorting, have their costs strongly dependent on their input arguments.
Figure 3. The Functions tab for jsynprog
It shows a mixture of methods, some from Java source, and others from C++ source. The Caller-Callee Tab would show a seamless transition between the Java and native code.
The same code is a good example of the use of the timeline. In the Java representation, the Timeline Tab is shown in Figure 4. The second bar represents the single user thread in this code. In Figure 5, the Timeline tab is shown in Expert-User mode (with some threads edited out).
Figure 4. The Timeline tab for jsynprog, User representation
Figure 5. The Timeline tab for jsynprog, Expert-User representation
In this figure, there are three profile bars shown, the main user Java thread, the Garbage collector thread, and the HotSpot compiler thread. Note that the Garbage collector thread is quite active during two of the tasks in the user thread. These two tasks are those in which a lot of garbage is, in fact, generated. In the first, the code allocates an array of fairly large arrays in a loop, with each iteration of the loop overwriting the pointers to the previous iteration's allocation. By selecting an event in the GC thread, and then using the up-arrow button to move to the nearest event in the user Java thread, the user can quickly identify the problematic method in the code, and can navigate to its source from there.
The third bar in that Timeline represents the HotSpot compiler thread. Most of the methods in the program are fairly CPU-bound, and the JVM recognizes it as such, and attempts to compile each method as it is invoked. The events on that bar directly correlate with changes in pattern in the user Java thread.
The third example is a simple Fortran program that does a vector-matrix multiply. The code is written with two copies of the dgemv function in each of four source files. The two copies differ only in the nesting of the loops. Each of the four source files is compiled with a different optimization level giving a total of eight versions of the loop.
For this example, we have run two experiments: the first with clock-based profiling, and hardware counter overflow profiling using I$ stall cycles, and D$ and E$ stall cycles; and the second with hardware counter overflow profiling, using cycles and E$ stall cycles, and no clock-based profiling.
The eight versions of the loop are named with suffixes _g1 and _g2, for the unoptimized versions of the two loop orders, _o1 and _o2 for the optimized ( -O) versions, _hi1 and _hi2 for the highly-optimized ( -fast) versions, and _p1 and _p2 for the parallelized version. They show very different performance characteristics. The function list for this code is shown in Figure 6, below.
Figure 6. The Functions tab for cachetest, memory operations
The dgemv_g1 version takes approximately 12 seconds based on clock-based profiling, and 7.2 seconds based on hardware counter-cycle profiling. These two metrics represent time in which the kernel thinks the target is using User CPU time, and time in which the chip is in user mode, respectively. The difference between the two consists of time during which the User CPU clock is running, but the chip is in kernel mode. That time consists of time spent in fast-trap handling in the kernel. In this example, such time is entirely spent handling TLB misses for the user. It also shows about 4 seconds in D-cache and E-cache stalls, and slightly less time in pure E-cache stalls. On the other hand, the dgemv_g version takes only about 4 seconds in both User CPU Time and CPU Cycles, implying a significant difference in TLB behavior. It also shows less than 1 second in D- and E-cache stalls.
For the versions compiled with -O, the two versions are each faster, but still show the significant difference in performance between the two loop orders. On the other hand, the versions compiled with -fast show that both versions take approximately the same amount of time. The compiler is smart enough to recognize that the order of the two loops can be safely inverted, and that the second order will provide better performance. Compiler commentary in the source or disassembly indicates these loop transformations.
Another application of hardware counter overflow profiling is to measure floating-point performance. Figure 7 shows the Functions tab from an experiment on this same program with clock-based profiling and hardware counter overflow profiling for FP Adds and FP Multiplies. Note that the operation counts are approximately the same for each variant of the dgemv routines. Despite the significant difference in performance, they do the same amount of work.
Figure 7. Functions tab for cachetest, floating point operations
Our fourth example is another one of the codes from our test suite, mttest. This code mimics a typical server-client code. It queues up a number of requests, and then spawns a number of threads. Each thread fetches a request from the queue, does the work for that request, and then fetches the next request from the queue. When a thread finds the queue empty, it exits.
The program sets up the queues and threads a number of times, each using a different synchronization primitive. We use this code to show two things: first, the difference between global and local locks, and second, the effect of false sharing of data between threads.
For this example, we use a single experiment, run with both clock-based profiling and synchronization delay tracing. The code uses four requests and four threads, and was run on a four CPU machine. The function list from the run is shown in Figure 8 below.
Figure 8. Functions tab for mttest
In the figure, notice the two functions, lock_global and lock_local. The two functions each use a lock to synchronize access to the individual request being processed, but in the global case, there is one lock for the entire program, while in the local case, there is a lock in each request block.
The User CPU time spent in the two functions is about the same (~four seconds), but the synchronization wait time is very different. In the local case, there is zero synchronization wait time, but in the global case, it amounts to about six seconds. In the local case, each thread acquires its uncontended lock, and proceeds with its ~one second of work to process the request. In the global case, the first thread acquires the lock, but the other three threads must wait. The second thread waits about one second for the first, the third thread waits about two seconds for the first two, and the last thread waits about three seconds for the other three, adding up to the recorded six seconds of synchronization wait total for the task.
This code shows another interesting aspect of multithreaded programs: false sharing of cache lines. False sharing occurs when two threads refer to different data items, but those items share a single cache line.
For each of the request blocks processed, after synchronization, the worker thread calls a compute function to do the simulated work to process the request. These functions are named compute, computeA, computeB, and so forth. An excerpt from the annotated source for the program is shown in Figure 9 below.
Figure 9. Annotated source for mttest compute functions
Note that each of the compute functions consists of exactly the same source code. Most of the functions take approximately 4 seconds on the single line comprising the body of the function. But computeB takes 24 seconds. How can that be?
The difference between them is in the nature of the argument passed in. For all of the functions except computeB, the argument is a double word within the request block. For computeB, however, the argument is a double word from an array indexed by the index of the request. Those words occupy consecutive locations in memory, and consequently share a cache line. Whenever one thread updates its word from one CPU, the store causes any copies of the cache line in the other CPUs to be invalidated. Therefore, the next time the word is used by the other CPU, the cache line must be reloaded, and its store invalidates the other copies of the cache line. This false sharing is responsible for the significant degradation in performance of the one routine compared to the others which have the identical source.
Another example is mcf, one of the SPEC 2000 benchmarks. This code is a single-depot vehicle scheduler, single-threaded, and written in C. We compiled the code for dataspace profiling, and ran two experiments collecting dataspace data. Figure 10 shows the Functions tab from the combination of the two experiments, with only exclusive metrics displayed.
Figure 10. Functions tab for mcf
The function shows that the running of this code takes approximately 549 seconds, of which almost 300 seconds is E$ stall cycles. Furthermore the code shows a large TLB miss rate.
Figure 11 shows the DataObjects tab from those experiments. It shows which structures and fields are the ones against which the E$ stall cycles are accumulated. The data object appearing about two-thirds of the way down the screen represents events for which the backtracking failed. It is treated as a structure, with fields corresponding to each of the ways in which the backtracking can fail.
Figure 11. DataObjects tab for mcf
Figure 12 shows the DataLayout Tab for the same experiments. In that display, the structures and their fields are ordered, and cacheline boundaries are indicated. The data shows that the node structure is almost, but not quite, one 128-byte E$ line, and that the hot fields in that structure, and in the arc structure, are spread across multiple cache lines. Padding the node structure to be a full E$ line, and rearranging the fields in the node and arc structures, along with specifying large pages for the heap, combine to yield an improvement of more than 20% in the performance of this code.
Figure 12. DataLayout tab for mcf
The final example is taken from a kernel profile. The experiment is a nine-second TTCP-Receive run on a four-CPU V440, running the Solaris 10 OS. The data was recorded using the DTrace driver in the er_kernel utility.
Figure 13 shows the Functions tab from that experiment. The time shown for the <Total> function is about 36 seconds, representing 9 seconds on each of the four CPUs. About 24 seconds of that time is attributed to <IDLE>, while only 0.2 seconds is attributable to <USER_MODE>. Of the kernel CPU time of about 12 seconds, 3.4 seconds, or about one-half of one CPU is attributed to intr_thread, THe kernel code processing interrupts.
Figure 13. Functions tab from the TTCP_Receive er_kernel experiment
Figure 14 shows the Timeline tab from that experiment, ordered by CPU. The data shows the clear partitioning of functionality among the CPUs.
Of the three busy CPUs, the first CPU 0, is processing the user calls, for which most of the work is in copying the data from the kernel receive buffers to the user code. CPU 1 is handling the input stream, while CPU 3 is handling all interrupts for the IO. CPU 2 is idle. The pattern of usage is quite evident from the different coloring of events on each of the four CPU bars.
Figure 14. Timeline tab from the TTCP_Receive er_kernel experiment
One of the difficulties in testing performance tools is that they are based on statistical sampling, and thus are not repeatable from run to run. To verify the correctness of the collector's statistical sampling, we implemented a set of test codes, each of which writes an accounting file, with gethrtime and gethrvtime deltas for each of several tasks. An experiment is run on the code, and the Function list output is compared with the accounting file.
For clock-based profiles on the Solaris OS, the comparison is done for both Total Time and User CPU Time. On Linux, only the User CPU Time is used. For hardware counter experiments, the comparison is between the accounting file, and the data for Cycle-based profiles.
A test passes if all functions are within 3% of the accounting file number for Total Time, and 5% for User CPU Time and and the absolute difference does not exceed a threshold of 300 ms. for Total Time and 500 ms. for User CPU Time.
There are many circumstances under which one or more functions may show a discrepancy, and the test harness provides a mechanism for tagging such functions as exceptions. The harness also provides for compiling each of the test codes with a wide variety of compiler options to ensure that the tools can correctly measure optimized codes.
The test suite is run every night on the nightly build of the tools on each of 25 machines, each with a different chip or OS flavor.
The tools currently support measurements on user programs written in C, C++, Fortran, the Java programming language, or a mixture of them, either single-threaded or multithreaded. We are intending to extend them to provide new ways of navigating through the data, to cover other programming models, and to record other kinds of data.
One of the most frequently requested enhancements to the tools is to provide a way to compare two experiments, or two threads, CPUs, or time-regions from a single experiment. For experiments on exactly the same executable, the comparison is straightforward, but comparing experiments on two different versions of the target can be problematic. To successfully do such comparisons, the tools must determine when two functions are really the same, which source lines can be matched and which cannot.
The current Caller-Callee tab shows the callers and callees of a selected function, and allows the user to navigate up and down the program hierarchy. Users have requested a more graphical display showing a tree of the program structure, and providing for expanding and collapsing any nodes in the tree.
The current MPI support only traces the MPI library APIs. Other tools, VAMPIR among them, collect information on the MPI messages, and have a graphical display connecting the sends and receives. We would like to extend the tools to provide the equivalent. We would also like to extend the tools to collect aggregate statistics, equivalent to the mpprof functionality in the Sun TM Cluster Tools.
For OpenMP programs, we are considering adding a loop-level to the hierarchy of program structure (functions, lines, instructions), and capturing a block of data for each execution of a parallel region, which would include data on thread balance and iteration counts.
We would also like to develop a way to report call stacks as they would be thought of in the user model, rather than as they are executed.
We could wrap any I/O system calls, and record trace events for those, too. And we could provide a mechanism for user-defined traces.
Marty Itzkowitz received an A.B. degree from Columbia College and a Ph.D. in Chemistry and Physics from CalTech. After a post-doctoral fellowship at the University of California, Berkeley, he worked on operating systems and distributed services at Lawrence Berkeley Laboratory. He was head of Operating Systems at Vitesse Electronics, and then worked on operating system performance and performance tools at Sun Microsystems and then at Silicon Graphics. He returned to Sun in 1998 as project lead for the Sun ONE Studio Tools. His interests include operating system design and performance, multiprocessor performance, performance tools and scientific visualization. He is avid handball player and cook.
Marty has written and/or presented papers at various conferences, including: The Second International Conference on Distributed Computing Systems, Paris, 1981; AFUU Convention Unix 90, Paris, 1990; Supercomputing Debugging Workshop, Albuquerque, 1991; Workshop on Debugging and Performance Tuning for Parallel Computing Systems, Cape Cod, 1994; SuperComputing '96, Pittsburgh, 1996; WOMPAT 2000, San Diego 2000; and SuperComputing '03, Phoenix, 2003.