Profiling Java Applications with Sun Studio Performance Tools

By Marty Itzkowitz, Project Lead, Sun Studio Performance Tools, Sun Microsystems, August 25, 2006  
This article describes how to profile Java applications using the Sun Studio Performance Tools.


How Java Applications Run

To the typical Java developer, a Java application runs just as any other program runs. The program begins at a main entry point, typically named class.main . That method may call other methods, which may call other methods, just as a C or C++ application does.

To the operating system, a Java application, either pure Java or mixed Java/C/C++, runs as a process instantiating the Java Virtual Machine (JVM). The JVM is compiled from C++ sources and starts execution at _start, which calls main, and so forth. It reads a Java byte-code program from .class and/or .jar files, and performs the operations specified in that program. Among the operations that can be specified is the dynamic loading of a shared object, and calls into various functions or methods contained in that object.

During execution of a Java application, most Java methods are interpreted by the JVM; these methods are referred to below as interpreted methods. Other java methods may be dynamically compiled using HotSpot, and are referred to as compiled methods. Dynamically compiled methods are loaded into the data space of the application, and may be unloaded later. For any particular Java method, there will be an interpreted version, and there may also be one or more compiled versions. Java code may also call directly into compiled code, either C, C++, or native-compiled (NJC) Java; the targets of such calls are referred to as native methods.

The JVM does a number of things that are typically not done by applications written in traditional languages. When the JVM starts, it creates a number of regions of dynamically-generated code in its data space. One of these is the actual interpreter code used to process the applications Java byte-code methods.

During the interpretive execution, the HotSpot JVM monitors performance, and may decide to take one or more methods that it has been interpreting, and generate machine code for them, and execute the more-efficient machine code version, rather than interpret the original version. That generated machine code is also in the data space of the process. In addition, other code is generated in the data space to execute the transitions between interpreted and compiled code.

A Java program is inherently multithreaded, and has one JVM thread for each thread in the user's Java program. It also has several housekeeping threads used for signal handling, memory management (garbage-collection), HotSpot compilation, and other things. The main thread is used for Java initialization, and then becomes the user's java-program main thread; additional user threads are created as needed. Java always uses bound threads, so that there is a one-to-one correspondence between threads and LWPs.

Java programs allocate memory for instantiating classes and storing data, but, unlike C and C++ applications, they do not explicitly deallocate the memory. Rather memory is managed by a so-called garbage-collector. That code, a part of the JVM, periodically scans memory to find allocated areas that are no longer pointed to elsewhere in the program, and it reclaims the memory, deallocating it and making it available for other uses.


Java Callstacks

Java callstacks, as reconciled, are shown in the Java Representation, and in the Expert Java Representations if they were captured for the event. Native callstacks are shown in the Machine representation, and in the Expert Java Representation if a Java callstack was not captured for the event.


Clock- and HW-counter profiling


Synchronization Tracing


Runtime Performance Distortion and Dilation

For clock- and HWC-profiling, the data collection process makes various calls into the JVM, and handles profiling events in signal handlers. The overhead of these routines, and the cost of writing the experiments to disk will dilate the runtime of the Java program. For the most part, we estimate that dilation as less than 10%.

In addition, although the default garbage collector supports JVMPI/JVMTI, there are other garbage collectors that do not. Any data-collection run specifying such a garbage collector will get a fatal error.

For synchronization tracing, data collection uses other JVMPI/JVMTI events, which will cause dilation in proportion to the amount of monitor-contention in the application.

Collecting the Data

Profiling a Java application is done by collecting data on the JVM as it runs the Java application.

In the current release, clock- and hw-counter-profiling, as well as synchronization tracing are supported.

  • Using the collect Command on the JVM
    The collect command recognizes that the target supplied is a Java class or jar file, and automatically invokes the JVM with appropriate flags for data collection. Alternatively, the user may explicitly use a named JVM version as the target, and supply a -j on flag, so that the collect command can interpose the necessary arguments for data capture to the JVM argument list. For 64-bit JVM runs, the 64-bit JVM must explicitly be listed as the target, and the -j on flag supplied.
  • Under dbx
    Data collection for Java is not yet supported under dbx.
  • From Applications that dlopen a JVM
    A non-Java application can dynamically open and pass it a Java class. A workaround to support profiling for such an application, is to not set -j on for the initial collect invocation, and to ensure that the -Xruncollector option is passed to the invoked JVM.
  • From Descendant Processes that run JVMs
    Applications can also create descendant processes that run a JVM. Java profiling is not yet supported for this usage.

Browsing the Data

Browsing the data can be done with the GUI Analyzer or the command-line er_print.

There are three modes of looking at the data from Java programs: the user representation, the expert representation, and the machine representation. By default the user representation is shown. The Analyzer has a radio button in the Data Presentation Dialog for turning view mode to user, expert, or machine. er_print has a command to set viewmode user, viewmode expert, or viewmode machine.


The User Representation

A PC for a Java method in the Java representation corresponds to the method-id and a byte-code index into that method; a PC for a native function correspond to a machine PC. The callstack for a Java thread may have a mixture of Java PCs and machine PCs. It will not have any frames corresponding to Java housekeeping code, which does not have a Java representation. Under some circumstances, the JVM can not unwind the Java stack, and a single frame with the special function, <no Java callstack recorded>, is returned. Typically, it will amount to no more than 5-10% of the the total time.

For the housekeeping threads, only a machine callstack is obtained, and, in the Java representation, data for those threads is attributed to the special function, <JVM-System>.

The function list in the Java representation shows metrics against the Java methods, and any native methods called. It also will show the pseudo-functions from JVM overhead threads. The caller-callee panel shows the calling relationships in the Java representation.

Source for a Java method corresponds to the source code in the .java file from which it was compiled, with metrics on each source line. The disassembly of any Java method shows the byte-code generated for it, with metrics against each byte-code, and interleaved Java source, where available.

The Timeline in the Java representation shows all Java threads. Each Java thread's event callstack is shown with its Java methods.

Java programs may have explicit synchronization, usually performed by calling the monitor-enter routine.

Synchronization-delay tracing in the Java representation is based on the JVMPI/JVMTI synchronization events. Data from the native synchronization tracing is not shown in the Java representation.


The Expert Representation

<no Java callstack recorded>

All threads, including the JVM system threads are shown in the Timeline.


The Machine Representation

In the machine representation, multiple HotSpot compilations of a given method will be shown as completely independent functions, although the functions will all have the same name. All functions from the JVM will be shown as such.

Source from compiled methods will be shown against the Java source; the data will represent the specific instance of the compiled-method selected. Disassembly for compiled methods will show the generated machine assembler code, not the Java byte-code. Caller-callee relationships will show all overhead frames, and all frames representing the transitions between interpreted, compiled, and native methods.

The Timeline in the machine representation will show bars for all threads, LWPs, or CPUs, and the callstack in each will be the machine-representation callstack.

In the machine representation, memory is allocated and deallocated by the JVM, typically in very large chunks. Memory allocation from the Java code is handled entirely by the JVM and its garbage-collector.

In the machine representation, thread synchronization devolves into calls to _lwp_mutex_lock, and no synchronization data is shown, since these calls are not traced.

Advanced Topics

Timeline-controls for Java Programs

(Thanks to Po-Ting Wu for developing and implementing these features)

The Analyzer Timeline has two features that can be used to help understand Java programs in particular. They are:

  • Control over coloring of classes and methods
  • Control over non-CPU-time profile displays

Control over Coloring of Classes and Methods

The color for any method can be set individually, or a color can be selected, and then applied to a set of methods. The set can be all those methods whose name starts with a given string, or ends with a given string, or contains a given string, or which matches a regular expression.

To use this functionality, bring up the Analyzer on an experiment, select the Timeline, and then bring up the color chooser. Initially, set all functions to some arbitrary color, say gray. Then, select a color, and then select the algorithm to be used from the option menu initially labeled "Starts with". Then enter a string or regular expression in the text field to the right of the option menu, and click on the "Set Functions" button to the left. All methods whose name matches the algorithm chosen will be colored with the selected color.

One particularly useful strategy to use is to set it to match all methods in a class or set of classes. To do so, select "Starts with" and enter The leading parts of a extended class name. For example org.openide.loaders will select the NetBeans data loader classes. You can set all its methods to one color, all the methods from to another, all the methods in org.netbeans.core.modules to a third, etc. The color patterns will enable your to recognize their patterns of use in the timeline.


Control over Non-CPU-Time Profile Displays

Signal-control of Data Collection

For the mechanism to work properly, the signal specified must be one that the user code does not use, and does not explicitly ignore, or set a signal handler for. If the target sets a signal-handler, the collector will put a warning message in the experiment, but the data collection controls will not work properly.


Signal Controls


For programs with a very long startup that is not interesting, pause and resume is usually preferable. For programs which don't run more than ten or fifteen minutes total, sample-marking is usually preferable.

collect-y SIG SIG ,r

If pause and resume of data collection is used, the total data volume will be lower, but you will need to record a new experiment if you want to look at a part of the run during which data collection was paused.

Sample Markers
collect-l SIG SIG collect-Soff

If sample markers are used, the data volume will be higher, but you can filter by sample to look at any part of a run, even parts you thought would be uninteresting when the data was recorded. The filtering can be changed at any time, and data is there for the entire run.


Profiling Startup Time of an Application

To use pause and resume, invoke collect with -y SIG,r where SIG is the name or number of the signal chosen. When the target has completed its startup, send the signal to it, and then let the target run to completion, or terminate it.

The data in the experiment will only cover the startup interval.

To use the sample mechanism, use -l instead of -y, and proceed as above, sending the signal to the target at the end of startup. When you look at the experiment, the first sample will contain the data for the startup.


Profiling Steady-state Behavior of an Application

To use pause and resume, invoke collect with -y SIG where SIG is the name or number of the signal chosen. Data collection will start paused, and when the target reaches steady-state, send it the signal, to start data collection. When you are ready to terminate the run, send the signal again, which will disable data recording.

The data in the experiment will cover the interval between the two signals, which should be the steady-state behavior.

To use the sample mechanism, again use -l instead of -y, and send the signals at the same times. The data in the experiment will show startup in the first sample, steady-state in the second, and shutdown in the third.

API-control for Measuring GUI Operations

(Thanks to Po-Ting Wu for deriving these techniques)

In order to measure the performance of GUI operations, you must have some way of marking the data so that you can selectively look at the behavior of the application in performing the operation.

If you have an application which sits in an idle state until the user requests an operation, then performs that operation, and then returns to the idle state, you can use the Signal-control of data collection techniques discussed above. If your application is more complex, and you don't want to rely on your own guess as to when the application is in its idle state, you can modify various classes to use the collector API to give more precise notification.

To do so, you need to ensure that the collector API is loaded into your application, and modify a few Swing classes to make API calls at key places. The steps to be taken are:

  1. Modifying the classes
  2. Compiling the changed classes
  3. Packing the new .class files into a new SE.jar file
  4. Ensuring the new SE.jar file and libcollectorAPI.jar is Loaded
  5. Collecting the data


Modifying the Classes

  • the AWT EventQueue class -- to set samples at mouse and key events;
  • the Swing JMenu class -- to set a sample when a menu selection is changed; and
  • the Swing JMenuItem class -- to set a sample when a submenu selection is changed.

The changes to each are shown below.

Modifying .../java/awt/
The EventQueue class contains a method, getNextEvent() , which is invoked when a the mouse button is pressed or released, or when a key is pressed. When the method is called, the java.awt.AWTEvent.getID() returns 501 for mouse-down, 502 for mouse-up, and 401 for key-pressed.

The code is modified to call the collector API to record a sample for each of those events. The relevant part of the file is shown below, with added code is indicated red.

public AWTEvent getNextEvent() throws InterruptedException {


do {

 * SunToolkit.flushPendingEvents must be called outside
 * of the synchronized block to avoid deadlock when
 * event queues are nested with push()/pop().
synchronized (this) {

for (int i = NUM_PRIORITIES - 1; i >= 0; i--) {

if (queues[i].head != null) {

EventQueueItem eqi = queues[i].head;
queues[i].head =;
if ( == null) {

queues[i].tail = null;

// (Inserted code below)
switch (eqi.event.getID()) {
case 501:

CollectorAPI.sample("Mouse Down");

case 502:

CollectorAPI.sample("Mouse Up");

case 401:

CollectorAPI.sample("Key Pressed");

// (End inserted code)
return eqi.event;




} while(true);


Modifying .../javax/swing/
A menu selection is another important event, but it may be triggered just by moving the mouse, not pressing the mouse button or a key. To record a sample at those events, the JMenu's method, menuSelectionChanged() , is modified to call the collector API to record a sample for each of those menu-selection events The relevant part of the file is shown below, with added code is indicated red.

public void menuSelectionChanged(boolean isIncluded) {


// (Inserted code below)
CollectorAPI.sample("Menu Selected");
// (End inserted code)
if (DEBUG) {

System.out.println("In JMenu.menuSelectionChanged to" + isIncluded);



Modifying .../javax/swing/
A similar change must be made in JMenuItem's menuSelectionChanged() method, since JMenu does not call super. The relevant part of the file is shown below, with added code is indicated red.

public void menuSelectionChanged(boolean isIncluded) {


// (Inserted code below)
CollectorAPI.sample("Menu Selected");
// (End inserted code)



Compiling and Using the New Classes


% javac -classpath $COLLECT_HOME/lib/collector.jar java/awt/
% javac -classpath $COLLECT_HOME/lib/collector.jar javax/swing/
% javac -classpath $COLLECT_HOME/lib/collector.jar javax/swing/

The newly compiled .class files should be packed into SE.jarfile, along with the other unmodified classes, maintaining its package structure.

% jar cvf SE.jar java/awt/*.class javax/swing/*.class


The JVM run should be invoked to use the new SE.jar file and the collector.jar, by adding the following options to the JVM invocation:

-Xbootclasspath/p:.../SE.jar:$COLLECT_HOME/lib/collector.jar -Dsun.boot.library.path=$JDK_HOME/jre/lib/sparc:$COLLECT_HOME/lib


Finally, the data collection run can be made, using whatever data collection options you'd like. It is a good idea to disable the default periodic sampling, to make it easier to spot the events you care about. Do so by adding the following option to the collect command: -S off


Examining Experiments Recorded with the API

An example of the Timeline from such an experiment is:

In this example, the sample numbers, and the legend superimposed on the timeline were added to the image by hand. In addition, the Timeline coloring techniques described above were used to color the callstacks by the classes to which each function belongs.

Script-launched Pure Java Applications

A script-launched pure Java application is one that is launched by a shell script, RunMyApp, which computes various arguments to be passed to the JVM or to the user's Java code, and then exec's a version of the binary java on the user's .jar or .class file.

The crucial line in such scripts is one that contains something like:

        eval {JAVA_LOCATION}/bin/java . . .

The modify such scripts to collect data, the simplest trick is to preface the eval with a "preface" command, defined by some environment variable, for example, MY_SCRIPT_PREFACE, and also add another environment variable, MY_SCRIPT_JAVA_ARGS that allows you to pass additional arguments to the Java command appropriate to the MY_SCRIPT_PREFACE command.

To collect performance data, all you then need to do is:

setenv MY_SCRIPT_PREFACE "collect"

The result of that run would be an experiment, named by default, in the directory from which the script was launched, or in whatever directory the script cd'd to before running the Java code.

If you want to record collecting data other than the default experiment, you can, for example:

setenv MY_SCRIPT_PREFACE "collect -p on -h cycles -d /tmp"

The result of that run would be an experiment containing both clock-profiling and HW-counter cycle profiling, stored in the directory /tmp, and again, named by default,

You can also use signal-controls, discussed above, for marking samples, or for enabling and disabling data collection during the run, in order to have finer control on the performance data. To use those controls, you could set:

setenv MY_SCRIPT_PREFACE "collect -S off -l USR1 -y USR2 -d /tmp"
setenv MY_SCRIPT_JAVA_ARGS " -Xusealtsigs "

The -S argument to collect turns periodic sampling off; the -l argument sets SIGUSR1 as a signal to record a sample, and -y argument sets SIGUSR2 as a signal to toggle recording data from paused to resumed.

With these settings, the experiment starts with data collection paused; if the -y argument were written as -y USR2,r, the experiment would start with data collection enabled.

As soon as the code is launched, you can determine its process id, pid. Whenever you want to put a sample in the experiment, enter the command:

         kill -USR1 pid

Whenever you want to toggle recording data in the experiment, enter the command:

         kill -USR2 pid

Because of the way Java uses signals internally, it must be passed the -Xusealtsigs flag, telling it not to use SIGUSR1 and SIGUSR2.

Profiling NetBeans

The "NetBeans Performance Page" contains quite a few HOWTOs about what the NetBeans team did in this area recently. Extensive improvements in performance were achieved for S1S 5, and most of the information on that page relate to the S1S 5 effort.

NetBeans is a Script-launched Pure Java Application, and can be profiled with the Sun Studio tools, as described above. The NetBeans startup script uses an environment variable, _NB_PROFILE_CMD, in lieu of the generic MY_SCRIPT_PREFACE described above. It also supports one or more -J arguments that can be used in lieu of the generic MY_SCRIPT_JAVA_ARGS described above.

See Also:

Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.