Java™ Platform Performance Engineering
One of the principal design centers for Java Platform, Standard Edition 6 (Java SE 6) was to improve performance and scalability by targeting performance deficiencies highlighted by some of the most popular Java benchmarks currently available and also by working closely with the Java community to determine key areas where performance enhancements would have the most impact.
This guide gives an overview of the new performance and scalability improvements in Java Standard Edition 6 along with various industry standard and internally developed benchmark results to demonstrate the impact of these improvements.
Java SE 6 includes several new features and enhancements to improve performance in many areas of the platform. Improvements include: synchronization performance optimizations, compiler performance optimizations, the new Parallel Compaction Collector, better ergonomics for the Concurrent Low Pause Collector and application start-up performance.
Biased Locking is a class of optimizations that improves uncontended synchronization performance by eliminating atomic operations associated with the Java language’s synchronization primitives. These optimizations rely on the property that not only are most monitors uncontended, they are locked by at most one thread during their lifetime.
An object is "biased" toward the thread which first acquires its monitor via a monitorenter bytecode or synchronized method invocation; subsequent monitor-related operations can be performed by that thread without using atomic operations resulting in much better performance, particularly on multiprocessor machines.
Locking attempts by threads other that the one toward which the object is "biased" will cause a relatively expensive operation whereby the bias is revoked. The benefit of the elimination of atomic operations must exceed the penalty of revocation for this optimization to be profitable.
Applications with substantial amounts of uncontended synchronization may attain significant speedups while others with certain patterns of locking may see slowdowns.
Biased Locking is enabled by default in Java SE 6 and later. To disable Biased Locking, please add to the command line -XX:-UseBiasedLocking .
For more on Biased Locking, please refer to the ACM OOPSLA 2006 paper by Kenneth Russell and David Detlefs: "Eliminating Synchronization-Related Atomic Operations with Biased Locking and Bulk Rebiasing".
There are some patterns of locking where a lock is released and then reacquired within a piece of code where no observable operations occur in between. The lock coarsening optimization technique implemented in hotspot eliminates the unlock and relock operations in those situations (when a lock is released and then reacquired with no meaningful work done in between those operations). It basically reduces the amount of synchronization work by enlarging an existing synchronized region. Doing this around a loop could cause a lock to be held for long periods of times, so the technique is only used on non-looping control flow.
This feature is on by default. To disable it, please add the following option to the command line: -XX:-EliminateLocks
Adaptive spinning is an optimization technique where a two-phase spin-then-block strategy is used by threads attempting a contended synchronized enter operation. This technique enables threads to avoid undesirable effects that impact performance such as context switching and repopulation of Translation Lookaside Buffers (TLBs). It is “adaptive" because the duration of the spin is determined by policy decisions based on factors such as the rate of success and/or failure of recent spin attempts on the same monitor and the state of the current lock owner.
Java SE 6 supports large page heaps on x86 and amd64 platforms. Large page heaps help the Operating System avoid costly Translation-Lookaside Buffer (TLB) misses to enable memory-intensive applications perform better (a single TLB entry can represent a larger memory range).
Please note that large page memory can sometimes negatively impact system performance. For example, when a large amount of memory is pinned by an application, it may create a shortage of regular memory and cause excessive paging in other applications and slow down the entire system. Also please note for a system that has been up for a long time, excessive fragmentation can make it impossible to reserve enough large page memory. When it happens, the OS may revert to using regular pages. Furthermore, this effect can be minimized by setting
-Xms == -Xmx, -XX:PermSize == -XX:MaxPermSize and
-XX:InitialCodeCacheSize == -XX:ReserverCodeCacheSize .
Another possible drawback of large pages is that the default sizes of the perm gen and code cache might be larger as a result of using a large page; this is particularly noticeable with page sizes that are larger than the default sizes for these memory areas.
Support for large pages is enabled by default on Solaris. It's off by default on Windows and Linux. Please add to the command line -XX:+UseLargePages to enable this feature. Please note that Operating System configuration changes may be required to enable large pages. For more information, please refer to the documentation on Java Support for Large Memory Pages on Sun Developer Network.
The method instruction System.arraycopy() was further enhanced in Java SE 6. Hand-coded assembly stubs are now used for each type size when no overlap occurs.
Prior to Java SE 6, the HotSpot Client compiler did not compile Java methods in the background by default. As a consequence, Hyperthreaded or Multi-processing systems couldn't take advantage of spare CPU cycles to optimize Java code execution speed. Background compilation is now enabled in the Java SE 6 HotSpot client compiler.
The HotSpot client compiler features a new linear scan register allocation algorithm that relies on static single assignment (SSA) form. This has the added advantage of providing a simplified data flow analysis and shorter live intervals which yields a better tradeoff between compilation time and program runtime. This new algorithm has provided performance improvements of about 10% on many internal and industry-standard benchmarks.
For more information on this new new feature, please refer to the following paper: Linear Scan Register Allocation for the Java HotSpot™ Client Compiler
Parallel compaction is a feature that enables the parallel collector to perform major collections in parallel resulting in lower garbage collection overhead and better application performance particularly for applications with large heaps. It is best suited to platforms with two or more processors or hardware threads.
Previous to Java SE 6, while the young generation was collected in parallel, major collections were performed using a single thread. For applications with frequent major collections, this adversely affected scalability.
Parallel compaction is used by default in JDK 6, but can be enabled by adding the option -XX:+UseParallelOldGC to the command line in JDK 5 update 6 and later.
Please note that parallel compaction is not available in combination with the concurrent mark sweep collector; it can only be used with the parallel young generation collector (-XX:+UseParallelGC). The documents referenced below provide more information on the available collectors and recommendations for their use.
For more on the Parallel Compaction Collection, please refer to the Java SE 6 release notes. For more information on garbage collection in general, the HotSpot memory management whitepaper describes the various collectors available in HotSpot and includes recommendations on when to use parallel compaction as well as a high-level description of the algorithm.
The Concurrent Mark Sweep Collector has been enhanced to provide concurrent collection for the System.gc() and Runtime.getRuntime().gc() method instructions. Prior to Java SE 6, these methods stopped all application threads in order to collect the entire heap which sometimes resulted in lengthy pause times in applications with large heaps. In line with the goals of the Concurrent Mark Sweep Collector, this new feature is enabling the collector to keep pauses as short as possible during full heap collection.
To enable this feature, add the option -XX:+ExplicitGCInvokesConcurrent to the Java command line.
The concurrent marking task in the CMS collector is now performed in parallel on platforms with multiple processors . This significantly reduces the duration of the concurrent marking cycle and enables the collector to better support applications with larger numbers of threads and high object allocation rates, particularly on large multiprocessor machines.
For more on these new features, please refer to the Java SE 6 release notes.
In Java SE 5, platform-dependent default selections for the garbage collector, heap size, and runtime compiler were introduced to better match the needs of different types of applications while requiring less command-line tuning. New tuning flags were also introduced to allow users to specify a desired behavior which in turn enabled the garbage collector to dynamically tune the size of the heap to meet the specified behavior. In Java SE 6, the default selections have been further enhanced to improve application runtime performance and garbage collector efficiency.
The chart below compares out-of-the-box SPECjbb2005™ performance between Java SE 5 and Java SE 6 Update 2. This test was conducted on a Sun Fire V890 with 24 x 1.5 GHz UltraSparc CPU's and 64 GB RAM running Solaris 10:
In each case the benchmarks were ran without any performance flags. Please see the SPECjbb 2005 Benchmark Disclosure
In each case the benchmarks were ran without any performance flags.
We compared VolanoMark™ 2.5 performance between Java SE 5 and Java SE 6. VolanoMark is a pure Java benchmark that measures both (a) raw server performance and (b) server network scalability performance. In this benchmark, the client side simulates up to 4,000 concurrent socket connections. Only those VMs that successfully scale up to 4,000 connections pass the test. In both the raw performance and network scalability tests, the higher the score, the better the result.
In each case we ran the benchmark in loopback mode without any performance flags. The result shown is based upon relative throughput (messages per second with 400 loopback connections).
The full Java version for Java SE 5 is:
java version "1.5.0"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0-b64)
Java HotSpot(TM) Client VM (build 1.5.0-b64, mixed mode)
The full Java version for Java SE 6 is:
java version "1.6.0_02"
Java(TM) SE Runtime Environment (build 1.6.0_02-b05)
Java HotSpot(TM) Client VM (build 1.6.0_02-b05, mixed mode)
Please see the VolanoMark™ 2.5 Benchmark Disclosure
Some other improvements in Java SE 6 include:
On server-class machines, a specified maximum pause time goal of less than or equal to 1 second will enable the Concurrent Mark Sweep Collector.
The garbage collector is allowed to move the boundary between the tenured generation and the young generation as needed (within prescribed limits) to better achieve performance goals. This mechanism is off by default; to activate it add this to the command line: option
Promotion failure handling is turned on by default for the serial (-XX:+UseSerialGC) and Parallel Young Generation (-XX:+ParNewGC) collectors. This feature allows the collector to start a minor collection and then back out of it if there is not enough space in the tenured generation to promote all the objects that need to be promoted.
An alternative order for copying objects from the young to the tenured generation in the parallel scavenge collector has been implemented. The intent of this feature is to decrease cache misses for objects accessed in the tenured generation.This feature is on by default. To disable it, please add this to the command line -XX:-UseDepthFirstScavengeOrder
The default young generation size has been increased to 1MB on x86 platforms
The Concurrent Mark Sweep Collector's default Young Generation size has been increased.
The minimum young generation size was increased from 4MB to 16MB.
The proportion of the overall heap used for the young generation was increased from 1/15 to 1/7.
The CMS collector is now using the survivor spaces by default, and their default size was increased.
The primary effect of these changes is to improve application performance by reducing garbage collection overhead. However, because the default young generation size is larger, applications may also see larger young generation pause times and a larger memory footprint.
To reduce application startup time and footprint, Java SE 5.0 introduced a feature called "class data sharing" (CDS). On 32-bit platforms, this mechanism works as follows: the Sun provided installer loads a set of classes from the system jar (the jar file containing all the Java class library, called rt.jar) file into a private internal representation, and dumps that representation to a file, called a "shared archive". On subsequent JVM invocations, the shared archive is memory-mapped in, saving the cost of loading those classes and allowing much of the Java Virtual Machine's metadata for these classes to be shared among multiple JVM processes.
In Java SE 6.0, the list of classes in the "shared archive" has been updated to better reflect the changes to the system jar file.
The Java Virtual Machine's boot and extension class loaders have been enhanced to improve the cold-start time of Java applications. Prior to Java SE 6, opening the system jar file caused the Java Virtual Machine to read a one-megabyte ZIP index file that translated into a lot of disk seek activity when the file was not in the disk cache. With "class data sharing" enabled, the Java Virtual Machine is now provided with a "meta-index" file (located in jre/lib) that contains high-level information about which packages (or package prefixes) are contained in which jar files.
This helps the JVM avoid opening all of the jar files on the boot and extension class paths when a Java application class is loaded. Check bug 6278968} for more details.
Below we show a chart comparing application start-up time performance between Java SE 5 and Java SE 6 Update 2. This test was conducted on an Intel Core 2 Duo 2.66GHz desktop machine with 1GB of memory:
The application start-up comparison above shows relative performance (smaller is better) and in each case the benchmarks were ran without any performance flags.
We also compared memory footprint size required between Java SE 5 and Java SE 6 Update 2. This test was conducted on
an Intel Core 2 Duo 2.66GHz desktop machine with 1GB of memory:
The footprint comparison above shows relative performance (smaller is better) and in each case the benchmarks were run without any performance flags.
Despite the addition of many new features, the Java Virtual Machine's core memory usage has been pared down to make the actual memory impact on your system even lower than with Java SE 5
Java SE 6 provides a solution that allows an application to show a splash screen before the virtual machine starts. Now, a Java application launcher is able to decode an image and display it in a simple non-decorated window.
Swing's true double buffering has now been enabled. Swing used to provide double buffering on an application basis, it now provides it on a per-window basis and native exposed events are copied directly from the double buffer. This significantly improves Swing performance, especially on remote servers.
Please see the Scott Violet's Blog for full details.
The UxTheme API, which allows standard Look&Feel rendering of windows controls on Microsoft Windows systems, has been adopted to improve the fidelity of Swing Systems Look & Feels.
Please see the Supported System Configurations chart for full details.