Java 2 Platform, Standard Edition (J2SE Platform),
version 1.4.2
Performance White Paper

Java Platform Performance Engineering

Sun Microsystems, Inc.

Index

Introduction
New Performance and Scalability Features
   Garbage Collection Improvements
      Throughput Collector
      Concurrent Low Pause Collector
      AggressiveHeap - Server Performance Option
   SSE and SSE2 Instruction Sets for Floating Point Computation
JVM Runtime Optimizations
Light Weight Performance Monitoring
New Platform Support
   IA64 64-bit for Windows and Linux
   Linux Thread Optimizations
Client-side Performance Improvements
   Start-up Performance
Appendix
   Systems Under Test
      X86 Test System
      Test System for theSPARC® Architecture
   Testing Methodology
      Sample Size
      Calculate the Confidence Interval
   Benchmark Disclosure
      SPECjbb2000
      SciMark 2.0
      Instant Messaging Benchmark
      Startup Time Benchmark

Introduction

Similar to the release of Java 2 Platform, Standard Edition (J2SE) version 1.4, a design center for the release of J2SE platform, version 1.4.2 was to improve the performance and scalability of the Java platform. In order to do that, the team at Sun Microsystems, Inc. put in place a rigorous program to drive these improvements, working closely with customers and partners to determine key areas where performance improvements would have the most impact. Sun Microsystems, Inc. is also driving performance improvements through the use of various industry standard and internally developed benchmarks. These improvements span areas important to both server-side and client-side Java programs.

This guide gives an overview of the performance and scalability improvements made in the J2SE version 1.4.2 release. This includes the results of various benchmarks to demonstrate improvements in existing APIs as well as an overview of key new technologies included in J2SE for the first time. Version 1.4.2 gives you the infrastructure you need for your application to:

  • Utilize machines with larger numbers of CPUs
  • Run applications requiring low pause times
  • Perform on many new hardware and software platforms

This guide contains sample performance results from a range of applications running on different operating systems and hardware, and helps to illustrate that the performance improvements are broadly applicable to many systems and applications.

New Performance and Scalability Features

Several key new features that improve scalability have been added to the J2SE platform in version 1.4.2. These include the new throughput and concurrent low-pause garbage collectors, Linux NPTL thread library support, and SSE/SSE2 register support for floating point operations.

Garbage Collection Improvements

In the J2SE platform version 1.4.1 two new garbage collectors were introduced to make a total of four garbage collectors from which to choose. In J2SE platform version 1.4.2 the performance of the new collectors have been improved through algorithm optimizations and bug fixes, along with documentation updates to educate developers and administrators on what collectors to choose and when.

For a detailed look at garbage collection and the new collectors, go to:

here

Throughput Collector

The throughput collector uses a parallel version of the young generation collector. It is used by passing the -XX:+UseParallelGC on the command line. The tenured generation collector is the same as the default collector. Use the throughput collector when you want to improve the performance of your application with larger numbers of processors.

In the default collector garbage collection is done by one thread, and therefore garbage collection adds to the serial execution time of the application. The throughput collector uses multiple threads to execute a minor collection and so reduces the serial execution time of the application.

Concurrent Low Pause Collector

The concurrent collector is used to collect the tenured generation and does most of the collection concurrently with the execution of the application. The concurrent collector employs a separate collector thread that consumes CPU cycles during application execution, this allows the application to be paused for only short periods of time during the collection but could lower overall throughput. It is used by passing the -XX:+UseConcMarkSweepGC on the command line.

Use the concurrent collector if your application would benefit from shorter garbage collector pauses and can afford to share processor resources with the garbage collector when the application is running. Typically applications which have a relatively large set of long-lived data (a large tenured generation), and run on machines with two or more processors tend to benefit from the use of this collector. However, this collector should be considered for any application with a low pause time requirement. Optimal results have been observed for interactive applications with tenured generations of a modest size on a single processor.

AggressiveHeap - Server Performance Option

The -XX:+AggresiveHeap option inspects the machine resources (size of memory and number of processors) and attempts to set various parameters to be optimal for long-running, memory allocation-intensive jobs. It was originally intended for machines with large amounts of memory and a large number of CPUs, but in the J2SE platform, version 1.4.1 and later it has shown itself to be useful even on four processor machines. The physical memory on the machines must be at least 256MB before AggresiveHeap can be used. The size of the initial heap is calculated based on the size of the physical memory and attempts to make maximal use of the physical memory for the heap (i.e., the algorithm attempts to use half of available memory, or all of possible memory less 160mb, whatever is the lesser of the two).

There are several optimizations and changes in parameter values with AggressiveHeap in J2SE platform version 1.4.2 that were added in an effort to make the option more useful for general server use. AggressiveHeap is recommended for server applications requiring high performance and scalability and can greatly ease performance tuning efforts. With Sun's emphasis on low-cost enterprise computing, we renewed our commitment to performance on Sun's latest offerings. Illustration 1 shows the performance gains as measured on SPECjbb®2000 while using AggressiveHeap on J2SE platform version 1.4.1 and J2SE version 1.4.2. Although SPARC was already highly optimized in J2SE version 1.4.1 with AggressiveHeap, we made further improvements of nearly 20% in J2SE version 1.4.2.

SPECjbb2000 Performance Improvement
Illustration 1: J2SE platform version 1.4.2 Performance as shown with SPECjbb2000

SSE and SSE2 Instruction Sets for Floating Point Computation

J2SE platform version 1.4.2 now uses SSE and SSE2 instruction sets for floating point computations on hardware and software platforms that support this feature. Use of the SSE and SSE2 instruction sets allows J2SE platform version 1.4.2 to have optimal performance of scientific and numerical computations and to take full advantage of new hardware and software platforms. The graph below highlights the performance gain of SSE and SSE2 instruction support as measured by SciMark 2.0, a scientific and numerical computing application performing floating point computations.

SciMark 2.0 Performance Improvement
Illustration 2: J2SE Plafform version 1.4.2 Performance as shown with SciMark 2.0

JVM Runtime Optimizations

Several optimizations and bug fixes are in included in the J2SE platform version1.4.2 Java Virtual Machine for the Java platform (JVM) Runtime which have improved overall performance, and in some cases, show substantial performance improvement. Note that the terms "Java virtual machine" and and "JVM" mean a virtual machine for the Java platform.

Example JVM Optimization Performance
Illustration 3: JVM Machine Optimization performance running thread micro-benchmark on Red Hat Linux 7.3

An example of such improvement was to make system dictionary reads lock-free. The system dictionary is an internal JVM machine data structure, and holds all the classes loaded by the system. It helps a lot for calls like Class.forName(), which do lookups into this data structure at the lowest level. Before this change, both readers and writers took out a lock to look at the system dictionary. Illustration 3 above highlights the performance gain of system dictionary locking improvements described above, as measured by a heavily threaded micro-benchmark running on Red Hat Linux 7.3 with traditional pre-NPTL Linux threads. The micro-benchmark measures 400 threads all accessing the system dictionary simultaneously.

Light Weight Performance Monitoring

Monitoring the performance of deployed Java applications can be rather challenging. Existing tools are either too intrusive or can only be enabled with a restart of the application. The Java HotSpot virtual machine included with J2SE platform version 1.4.2 includes an experimental lightweight instrumentation and monitoring interface. This interface is always on and provides for non-intrusive, real-time JVM performance monitoring in production environments.

The HotSpot JVM in J2SE platform version 1.4.2 includes instrumentation for the various garbage collectors, the client and server JIT compilers, the class loader, and various configuration parameters. The instrumentation is exported though a private interface that allows for asynchronous monitoring.

A set of experimental performance monitoring tools, called the jvmstat tools, is provided as a separate download from:

here

Note: The jvmstat 1.0 tools only support the HotSpot Java Virtual machine distributed with J2SE platform version 1.4.1. A release of jvmstat tools that supports the HotSpot Java Virtual Machine distributed with J2SE platform version 1.4.2 will be available shortly.

The jvmstat tools provide for asynchronous sampling and display of the instrumentation exported from the HotSpot JVM. The jvmstat command line tool displays the instrumentation in textual format. The visualgc tool provide a graphical view of the garbage collection system and is useful for diagnosing Java runtime environment heap configuration and tuning issues.

The combination of instrumentation for the Java HotSpot virtual machine and the jvmstat monitoring tools provide for new and powerful mechanisms to monitor the performance of production Java applications.

New Platform Support

IA64 64-bit for Windows and Linux

With the release of J2SE platform version 1.4.2 comes a new addition to the platforms supported by the HotSpot JVM. Full 64-bit support for the Intel IA-64 architecture and the Itanium family of processors is a major addition for the J2SE platform, and is a strong example of Sun's continued focus on enterprise and network computing.

With a new port to IA-64 comes the opportunity to leverage past work with the 64-bit version of the Java virtual machine for the Solaris Operating System (SPARC® Platform Edition) ensuring delivery of a reliable, scalable, high performing, and highly competitive 64-bit version of the JVM machine.

Linux Thread Optimizations

The first thing Java developers notice when running their application on Linux is that the ps command, used to display the list of processes, appears to show multiple copies of Java runtime environment running even though only one Java application was started.

This is due to the implementation of the system threads library on Linux. Linux threads are implemented as a cloned process, that means each Java thread appears as a new Linux process. The advantage of this approach is that the threads implementation is simpler and stable, however the downside is that this also affects the performance of even a moderately threaded Java application on Linux.

The confusing ps command display issue is fixed in the 2.0.7 version of the procps package, however the overhead with using processes for each Java thread has been, until now, the biggest challenge for adopting the Java runtime environment on the Linux platform.

The scalability and signal handling issues of the Linux threads implementation is well known in the Linux community. The two most well known thread library projects that set out to solve this problem have been the NGPT (Next Generation Posix Threads) library , and a new library called NPTL (Native Posix Thread Library).

The NPTL approach keeps the 1:1 thread mapping, 1 user or Java thread to 1 kernel thread, but optimizes the kernel for thread related operations, including signal handling, synchronization and thread creation speed. The NPTL library is now available in Red Hat 9 by default and this is a very exciting time for developers for the Java programming language on the Linux developers.

NPTL Highlights:

  • O(1) scheduler, which claims to scale better.
  • Signal delivery is now POSIX-compliant.
  • No hard limit on the number of threads. The only limit is address space (32bit) or physical memory.
  • Synchronization is done by the new futex interface within kernel.
  • Faster thread creation.

The graph below highlights the performance gain of NPTL support in J2SE platform version 1.4.2 as described above, as measured by a heavily threaded instant messaging application running on Linux with NPTL support.

Note: the faster the time to completion (smaller bar) the better the result.

J2SE 1.4.2 Performance with Linux NPTL
Illustration 5: J2SE 1.4.2 Performance with Linux NPTL

Client-side Performance Improvements

Start-up Performance

Work was done in J2SE version 1.4.2 to decrease the startup time of applications. A few possible approaches were investigated and the general approach decided upon was measurement and optimization of the J2SE core libraries. New tools were built for acquiring and analyzing fine-grained performance measurements and the expensive areas of code optimized.

Several performance optimizations have been made, including:

  • Unicode table setup in the static initializer for java.lang.Character has been optimized for the Latin-1 character sets; more work is planned in future releases to bring the same benefits to all locales.
  • Parsing of jar file manifests has been eliminated in many cases except where the end user application requests it.
  • Most properties files used internally by the J2SE platform have been translated into .class files and are loaded automatically by the ResourceBundle mechanism.
  • Initialization of subsystems such as preferences and logging has been made more lazy.
  • File name canonicalization has been optimized with internal caching which improves performance for redundant or semi-redundant queries. There should be effectively no user-visible semantic changes due to this optimization; see the release notes for more details.

Performance measurements indicate that startup time for small command-line applications has been reduced by roughly thirty percent and for small Swing applications by roughly fifteen to twenty percent. Some of the optimizations appear to have carried over to larger applications. Additional startup time work is planned for future releases. This graph shows a startup benchmark that measures the aggregate time to load up three different industry known GUI applications. We can see that startup of J2SE1.4.2 has improved by over 30% when compared with J2SE1.4.1.

Startup Performance
Illustration 6: J2SE 1.4.2 Startup Time Improvments

Appendix

Systems Under Test

x86 Test System

System: Dell Power Edge 6650
CPUs: 4 x 1.6 GHz Intel P4 Xeon
Memory: 4 GB RAM
Operating System(s): Microsoft Windows 2000 Server
Red Hat Linux 7.3
Red Hat Linux 9
Solaris 9 Operating System (x86 Platform Edition)

Test System for theSPARC® Architecture

System: Sun Fire V880
CPUs: 4 x 900 MHz Ultra SPARC IIIcu
Memory: 4 GB RAM
Operating System(s): Solaris 9

Testing Methodology

Proper statistical analysis to identify both performance regressions and gains is critical to software development and is a core component to JVM testing within Java Software. This section gives a brief look at the analysis used during the testing highlighted in this paper, and a few suggestions to point readers in the right direction for further research.

Sample Size

In order to calculate meaningful confidence intervals all of the performance testing reported in this paper requires no less then 10 test samples, or in other words, each test is run at least ten times. Ensuring a proper sample size is key to further analysis, doing so will make it possible to identify small regressions and gains, rather than simply explaining it as 'noise'.

Calculate the Confidence Interval

Calculating a confidence interval adds reliability to your results and makes it possible to identify true regressions or gains when sample differences are small or are not consistent (high standard deviation). For the tests highlighted in this paper, the following computations were performed.

  • Mean, Geometric Mean, Variance, and Standard Deviation for all data points
  • Student's T-Test p-value (equal and unequal)
  • F-Test p-value when comparing two sets of data

Benchmark Disclosure

SPECjbb2000

SPECjbb2000 is benchmark from the Standard Performance Evaluation Corporation (SPEC). The performance referenced is based on Sun internal software testing conforming to the testing methodologies listed above. All SPECjbb2000 comparison tests were run with the following arguments:

java -server -Xms1600m -Xmx1600m -XX:+AggressiveHeap
              

            

For the latest SPECjbb2000 results visit http://www.spec.org/osg/jbb2000

SciMark 2.0

SciMark 2.0 is a Java benchmark for scientific and numerical computing. It measures several computational kernels and reports a composite score in approximate Mflops (Millions of floating point operations per second). All SciMark 2.0 comparison tests were run with the following arguments:

java -server -Xms1600m -Xmx1600m -XX:+AggressiveHeap
              

            

For more information on SciMark 2.0 visit http://math.nist.gov/scimark2/

Instant Messaging Benchmark

This benchmark simulates a server which handles routing brief text messages between a number of users. In this case there were 400 users, each of which is serviced by one worker thread. All comparison tests were run with the following arguments:

java -server -Xms1600m -Xmx1600m -XX:+AggressiveHeap
              

            

Startup Time Benchmark

The Startup Time benchmark measures the time it takes to load up an application. The definition of load up is the time from when the command is executed to the time when the application comes to a steady state. In this case, we are aggregating the startup times of JMol, TeaTimeJ, and Mailpuccino which are applications that are heavily dependent on Swing and AWT components. The measurements were taken with the client compiler and all default options.

Sun Microsystems, Sun, the Sun logo, Solaris, Java, J2SE, JVM, and HotSpot are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the United States and other countries. Products bearing SPARC trademarks are based upon and architecture developed by Sun Microsystems, Inc.