Technical Article

Improving Java Application Performance and Scalability by Reducing Garbage Collection Times and Sizing Memory Using JDK 1.4.1

by Nagendra Nagarajayya and J. Steven Mayer , November 2002

New Parallel and Concurrent Collectors for Low Pause and Throughput applications

Abstract

As Java technology becomes more and more pervasive in the enterprise and telecommunications (telco) industry, understanding the behavior of the garbage collector becomes more important. Typically, telco applications are near-real-time applications. Delays measured in milliseconds are not usually a problem, but delays of hundreds of milliseconds, let alone seconds, can spell trouble for applications of this kind -- applications compromise on throughput to provide near-real-time performance. Enterprise applications are transaction oriented and tolerate delays better. They need to crunch as many transactions as possible in the shortest time i.e., the more compute time and resources available, the better. This means, faster and multiple CPUs, lots of memory, increases performance. A garbage collector that can make use of the extra resources, enhances performance.

Garbage collection limitations, which affected the performance of telco and enterprise applications is in the process of being eliminated with the introduction of new parallel and concurrent collectors. The collectors can be used to reduce delays to milliseconds for telco applications -- SIP servers, Call Processing applications --, while increasing throughput by providing more compute time to enterprise applications -- J2EE, OSS/BSS, MOM type of applications.

Analytical and modeling suggestions are presented along with features available in J2SE 1.4.1. A new tool, GC Analyzer, is available to model application behavior and try out the performance tuning suggestions.

Table of Contents

1.Introduction
2.New Garbage Collectors in JDK 1.4.1
3.Heap Layout in JDK 1.4.1
4.Using JDK 1.4.1 - Different options
5.Problems with earlier JDK versions, and how they are now getting solved with the new collectors
6.How GC Pauses Affect Scalability and Execution Efficiency
7.A SIP Server, the Benchmark Application
8.Modeling Application Behavior to Predict GC Pauses
9.Modeling Application Behavior Using "verbose:gc" Logs
10.Using the GC Analyzer Script to Analyze "verbose:gc" Logs
11.Reducing Garbage Collection Times
12.Reducing the Frequency of Young and Old GCs
13.Detection of Memory Leaks
14.Finding the Optimal Call-Setup Rate by Using the Rates of Creation and Destruction
15.Learning the Actual Object Lifetimes
16.Sizing the Young and Old Generations' Heaps
17.Determining the Scalability and Execution Efficiency of a Java Application
1.8Other Ways to Improve Performance
19.On the Horizon & Research Ideas
20.Conclusion
21.Acknowledgments
22.References
Appendix A

1. Introduction

This paper is an update to our previous paper "Improving Java Application Performance and Scalability by Reducing Garbage Collection Times and Sizing Memory Using JDK 1.2.2" [ 1]. That paper focussed on improving performance of telecommunication (telco) applications by optimizing garbage collection using collectors like the concurrent garbage collector. The paper also introduced application modeling, and its use to make application behavior deterministic.

The paper builds on the earlier paper and introduces new collectors and options in J2SE 1.4.1. The new collectors can be used to tune performance of both telco and enterprise applications. Garbage collection (GC) delays can be successfully lowered to be in milliseconds with the help of the parallel and the concurrent collector -- size the young-generation to about 64 MB, and the older generation to about 500 MB. So GC, a sizable contributor to performance degradation can be tuned, improving application efficiency to 90+%.

A new tool, GC Analyzer that works with the JDK 1.4.1 verbose GC logs can be downloaded to model application behavior and try out the performance tuning suggestions, (see Appendix A3 for download instructions).

1.1 Introduction to Garbage Collection, And Existing Garbage Collectors

The previous paper also introduced garbage collection, different collectors like the copy collector, mark-sweep-compact collector, incremental collector, generational collection and the advanced concurrent collector.

JDK 1.4.1 has all of the above collectors, and in addition offers new collectors.

2. New Garbage Collectors in JDK 1.4.1

The Java platform now provides new collectors like the young generation parallel collector and the old generation concurrent collector. In fact, there are two parallel collectors, one, which works in conjunction with the old generation concurrent collector, and is for near real-time or pause dependent applications, while the second is for enterprise or throughput oriented applications - J2EE, billing, payroll, OSS/BSS, MOM apps. The parallel collectors make use of multiple threads to parallelize and scale young generation collections on SMP architectures, speeding up the scavenging ¹. The difference between the two collectors is that the low-pause parallel collector works with the new concurrent older collector and with the traditional mark-compact collector, while the throughput parallel collector, at the moment, works only with the mark-compact collector.

2.1 Low Pause Collectors

2.1.1 Parallel Copying Collector

The Parallel Copying Collector is similar to the Copying Collector [ 1], but instead of using one thread to collect young generation garbage, the collector allocates as many threads as the number of CPUs to parallelize the collection. The parallel copying collector works with both the concurrent collector and the default mark-compact collector.

Figure 1 - Single Threaded & Parallel Collection

The parallel copying collection is still stop-the-world, but the cost of the collection is now dependent on the live data in the young generation heap, divided by the number of CPUs available. So bigger younger generations can be used to eliminate temporary objects while still keeping the pause low. The degree of parallelism i.e., the number of threads collecting can be tuned. This parallel collector works very well from small to big young generations.

The figure (Fig. 1) illustrates the difference between the single threaded and parallel copy collection. The green arrows represent application threads, and the red arrow(s) represent GC threads. The application threads (green arrows) are stopped when a copy collection has to take place. In case of the parallel copy collector, the work is done by n number of threads compared to 1 thread in case of the single threaded copy collector.

2.1.2 Concurrent Collector

The concurrent collector uses a background thread that runs concurrently with the application threads to enable both garbage collection and object allocation/modification to happen at the same time. The collector collects the garbage in phases, two are stop-the-world phases, and four are concurrent and run along with the application threads. The phases in order are, initial-mark phase (stop-the-world), mark-phase (concurrent), pre-cleaning phase (concurrent), remark-phase (stop-the-world), sweep-phase (concurrent) and reset-phase (concurrent). The initial-mark phase takes a snapshot of the old generation heap objects followed by the marking and pre-cleaning of live objects. Once marking is complete, a remark-phase takes a second snapshot of the heap objects to capture changes in live objects. This is followed by a sweep phase to collect dead objects - coalescing of dead objects space may also happen here. The reset phase clears the collector data structures for the next collection cycle. The collector does most of its work concurrently, suspending application execution only briefly.

Figure 2 - Concurrent Collection

The figure (Fig. 2) illustrates the main phases of the concurrent collection. The green arrows represent application threads, and the red, GC thread(s). The small red arrow represents, the brief stop-the-world marking phases, when a snapshot of the heap is made. The GC thread (big red arrow) runs concurrently with application threads (green arrows) to mark and sweep the heap.

Note: If "the rate of creation" of objects is too high, and the concurrent collector is not able to keep up with the concurrent collection, it falls back to the traditional mark-sweep collector.

2.2 Throughput Collectors

2.2.1 Parallel Scavenge Collector

Figure 3 - Parallel Collection

The parallel scavenge collector is similar to the parallel copying collector, and collects young generation garbage. The collector is targeted towards large young generation heaps and to scale with more CPUs. It works very well with large young generation heap sizes that are in gigabytes, like 12GB to 80GB or more, and scales very well with increase in CPUs, 8 CPUs or more. It is designed to maximize throughput in enterprise environments where plenty of memory and processing power is available.

The parallel scavenge collector is again stop-the-world, and is designed to keep the pause down. The degree of parallelism can again be controlled. In addition, the collector has an adaptive tuning policy that can be turned on to optimize the collection. It balances the heap layout by resizing, Eden, Survivor spaces and old generation sizes to minimize the time spent in the collection. Since the heap layout is different for this collector, with large young generations, and smaller older generations, a new feature called "promotion undo" prevents old generation out-of-memory exceptions by allowing the parallel collector to finish the young generation collection.

The figure (Fig. 3) illustrates the application threads (green arrows) which are stopped when a copy collection has to take place. The red arrow represents the n number of parallel threads employed in the collection.

2.2.2 Mark-Compact Collector

The parallel scavenge collector interacts with the mark-sweep-compact collector, the default old generation collector. The mark-compact collector is the traditional mark-compact collector, and is very efficient for enterprise environments where pause is not a big criterion. The throughput collectors are designed to maximize the younger generation heap while keeping the older generation heap to the needed minimum - old generation is intended to very long-term objects only.

Figure 4 - Mark-Compact Collection

The figure (Fig. 4) illustrates a stop-the-world, old generation mark-sweep-compact collection. The application threads (green arrows) are stopped during the collection. The old generation collection single threaded.

3. Heap Layout in JDK 1.4.1

3.1 Young Generation Heap

In JDK 1.4.1, the heap is divided into 3 generations, young generation, old generation, and permanent generation. Young generation is further divided into an Eden, and Semi-spaces.

Figure 5 - Heap layout

The size of the Eden and semi-spaces is controlled by the SurvivorRatio and can be calculated roughly as:

Eden = NewSize - ((NewSize / ( SurvivorRatio + 2)) * 2) From space = (NewSize - Eden) / 2 To space = (NewSize - Eden) / 2

NewSize is the size of the young generation and can be specified on the command line using -XX:NewSize option. SurvivorRatio is an integer number and can range from 1 to a very high value.

The young generation can be sized using the following options:

-XX:NewSize
-XX:MaxNewSize
-XX:SurvivorRatio

For example, to size a 128 MB young generation with an Eden of 64MB, a Semi-Space size of 32MB, the NewSize, MaxNewSize, and SurvivorRatio values can be specified as follows:

java -Xms512m -Xmx512m -XX:NewSize=128m -XX:MaxNewSize=128m \ -XX:SurvivorRatio=2 application

3.2 Old Generation Heap

The old generation or the tenured generation is used to hold or age objects promoted from the younger generation. The maximum size of the older generation is controlled by the -Xms parameter.

For the previous example to size a 256 MB old generation heap with a young generation of 256 MB the -mx value can be specified as:

java -Xms512m -Xmx512m -XX:NewSize=256m -XX:MaxNewSize=256m \ -XX:SurvivorRatio=2 application

The young generation takes 256 MB and the old generation 256 MB. -Xms is used to specify the initial size of the heap.

3.3 Permanent Generation Heap

The permanent generation is used to store class objects and related meta data. The default space for this is 4 MB, and can be sized using the -XX:PermSize, and -XX:MaxPermSize option.

Sometimes you will see Full GCs in the log file, and this could be due to the permanent generation being expanded. This could be prevented by sizing the permanent generation with a bigger heap using the -XX:PermSize and -XX:MaxPermSize options.

For example:

java -Xms512m -Xmx512m -XX:NewSize=256m -XX:MaxNewSize=256m \ -XX:SurvivorRatio=2 -XX:PermSize=64m -XX:MaxPermSize=64m application

Another way of disabling permanent generation collection is to use the -Xnoclassgc option. This is should be used with care since this disables class objects from being collected. To use this, size the permanent generation bigger so that there is enough space to store class objects, and a garbage collection is not needed to free up space.

For example:

java -Xms512m -Xmx512m -XX:NewSize=256m -XX:MaxNewSize=256m \ -XX:SurvivorRatio=2 -XX:PermSize=128m -XX:MaxPermSize=128m \ -Xnoclassgc application

4. Using JDK 1.4.1 - Different options

4.1 Default usage

A java application can be started using the following command:

java application

By default, the young generation uses 2 MB for the Eden, and 64KB for the semi-space. The older generation heap starts from about 5MB and grows up to 44MB. The default permanent generation is 4MB.

4.2 Using The -Xms and -Xms Switches

The old generation, default heap size can be overridden by using the -Xms and -Xmx switches to specify the initial and maximum sizes respectively:

java -Xms <initial size> -Xmx <maximum size> program

For example:

java -Xms128m -Xmx512m application

4.3 Using the :XX Switches to Enable the New Low Pause or Throughput Collectors

4.3.1 Using the Low Pause Collectors

The young generation, parallel copying collector can be enabled by using the -XX:+UseParNewGC option, while the older generation, concurrent collector can be enabled by using the -XX:+ UseConcMarkSweepGC option.

For example:

java -server -Xms512m -Xmx512m -XX:NewSize=64m -XX:MaxNewSize=64m \ -XX:SurvivorRatio=2 -XX:+ UseConcMarkSweepGC \ -XX:+UseParNewGC application

Note:

If -XX:+UseParNewGC is not specified, the young generation will make use of the default copying collector [ 1].
If -XX+UseParNewGC is specified on a single processor machine, the default copy collector is used since the number of CPUs is 1. You can force the parallel copy collector to be enabled by increasing the degree of parallelism.

4.3.1.1 Controlling the Degree of Parallelism ²

By default, the parallel copy collector will start as many threads as CPUs on the machine, but if the degree of parallelism needs to controlled, then it can be specified by the following option:

-XX:ParallelGCThreads=<desired parallelism>

Default value is equal to number of CPUs.

For example, to use 4 parallel threads to process young generation collection:

java -server -Xms512m -Xmx512m -XX:NewSize=64m -XX:MaxNewSize=64m \ -XX:SurvivorRatio=2 -XX:+UseParNewGC -XX:ParallelGCThreads=4 \ -XX:+UseConcMarkSweepGC application

4.3.1.2 Simulating The "promoteall" Modifier In JDK 1.4.1

"promoteall" is a modifier available in JDK 1.2.2 that enables promotion of all live objects at a young generation collection to be promoted to the older generation without any tenuring. There is no "promoteall" modifier in JDK 1.4.1, but similar behavior can be achieved by controlling the tenuring distribution. The number of times an object is aged in the young generation is controlled by the option MaxTenuringThreshold. Setting this option to 0 means objects are not copied, but are promoted directly to the older generation. SurvivorRatio should be increased to 20000 or a high value (see 3.1 for heap calculations) so that Eden occupies most of the Young Generation Heap space.

-XX:MaxTenuringThreshold=0 -XX:SurvivorRatio=20000

For example:

java -server -Xms512m -Xmx512m -XX:NewSize=64m -XX:MaxNewSize=64m \ -XX:SurvivorRatio=20000 -XX:MaxTenuringThreshold=0 \ -XX:+UseParNewGC -XX:+UseConcMarkSweepGC application

4.3.1.3 Controlling the Concurrent collection initiation

The concurrent collector background thread starts running when the percentage of allocated space in the old generation goes above the -XX:CMSInitiatingOccupancyFraction, default value is 68%. This value can be changed and the concurrent collector can be started earlier by specifying the following option:

-XX:CMSInitiatingOccupancyFraction=<percent>

For example:

java -server -Xms512m -Xmx512m -XX:NewSize=64m -XX:MaxNewSize=64m \ -XX:SurvivorRatio=20000 -XX:MaxTenuringThreshold=0 \ -XX:+UseParNewGC -XX:+UseConcMarkSweepGC \ -XX:CMSInitiatingOccupancyFraction=35 application

4.3.2 Using the Throughput Collectors

The young generation, parallel scavenge collector, can be enabled by using the -XX:UseParallelGC option. The older generation collector need not be specified since the mark-compact collector is used by default.

For 32 bit usage:

java -server -Xms3072m -Xmx3072m -XX:NewSize=2560m \ -XX:MaxNewSize=2560m XX:SurvivorRatio=2 \ -XX:+UseParallelGC application

For 64 bit usage:

java -server -d64 -Xms8192m -Xmx8192m -XX:NewSize=7168m \ -XX:MaxNewSize=7168m XX:SurvivorRatio=2 \ -XX:+UseParallelGC application

Note:

-XX:TargetSurvivorRatio is a tenuring threshold that is used to copy the tenured objects in the young generation. With large heaps and a SurvivorRatio of 2, survivor semi-space might be wasted, as the TargetSurvivorRatio by default is 50. This could be increased to maybe 75 or 90, maximizing use of the space.

4.3.2.1 Controlling the Degree of Parallelism

Again, by default, the parallel scavenge collector will start as many threads as CPUs on a machine, but if the degree of parallelism needs to controlled, then it can be specified by the following switch:

-XX:ParallelGCThreads=<desired parallelism>

Default value is equal to number of CPUs.

For example, to use 4 parallel threads to process young generation collection:

java -server -Xms3072m -Xmx3072m -XX:NewSize=2560m \ -XX:MaxNewSize=2560m XX:SurvivorRatio=2 \ -XX:+UseParallelGC -XX:ParallelGCThreads=4 application

4.3.2.2 Adaptive Sizing for Performance

The Parallel scavenge collector performs better when used with the -XX:+UseAdaptiveSizePolicy. This automatically sizes the young generation and chooses an optimum survivor ratio to maximize performance. The parallel scavenge collector should always be used with the -XX:UseAdaptiveSizePolicy.

For example:

java -server -Xms3072m -XX:+UseParallelGC \ -XX:+UseAdaptiveSizePolicy application

5. Problems with earlier JDK versions, and how they are now getting solved with the new collectors

The biggest problem with JDK 1.2.2_08 was stop-the-world collection, which introduced serialization. Serialization directly affects scalability and throughput (see next section 6). Even though the GC pause, in JDK 1.2.2_08, was reduced to less than 100 ms range - using the concurrent collector and sizing the younger generation smaller, about 12-16MB --, any increase in load would increase the frequency and pause, as the young generation collection was single threaded. This problem is now solved with JDK 1.4.1, as the young generation collection is parallel (multi-threaded) and run on multiple CPUs at the same time. The collection is still stop-the-world but happens in parallel, resulting in smaller pauses even with bigger young generations. The decreased frequency of the collection and pause, reduces the serialization problem as application threads run for a longer duration increasing the scalability and throughput. So GC serialization, which was one of the biggest factors affecting scalability, especially with increased loads, will be less of a factor.

6. How GC Pauses Affect Scalability and Execution Efficiency

Linear scalability [ 28] of an application running on an SMP (Symmetrical Multi Processing)-based machine is directly limited by the serial parts of the application. The serial parts of a Java application are:

Stop-the-world GC
Access to critical sections
Concurrency overhead such as scheduling, context switching, ...
System and communication overhead

The percentage of time spent in the serial portions determines the maximum scalability that can be achieved on a multi-processor machine and in turn the execution efficiency of an application. Because the young- and old-generation collectors all have at least some single-threaded stop-the-world behavior, GC will be a limiting factor to scalability even when the rest of the application can run completely in parallel. Using the concurrent collector will help reduce this effect, but will not completely remove it.

The scalability and execution efficiency of an application can be calculated using Amdahl's law.

6.1 Amdahl's Law & Efficiency calculations

If S is the percentage of time spent (by one processor) on serial parts of the program and P is the percentage of time spent (by one processor) on parts of the program that could be done in parallel , then:

Speedup = (S + P) / (S + (P / N))

-> (1)

where N is the number of processors.

This can be reduced to:

Speedup = 1 / (F + ((1 - F) / N))

-> (2)

where F is the percentage of time spent in the serial parts of the application and N is the number of processors.

Speedup = 1 / F

à (3)

when N is very large.

So if F = 0 (i.e., no serial part), then speedup = N (the ideal value). If F = 1 (i.e., completely serial), then speedup = 1 no matter how many processors are used.

The effect of the serial fraction F on the Speedup when N = 10:

Figure 6 - Speedup

6.1.1 Scaled Speedup

Assuming that the problem size is fixed, then Amdahl's Law can predict the speedup on a parallel machine:

Speedup = (S + P) / (S + (P / N)) -> (4)

Speedup = 1 / (F + ((1 - F) / N)) -> (5)

Figure 7 - Scaled Speedup

More details on Amdahl's Law and scaled speedup can be obtained from [ 22] .

6.1.2 Efficiency

Execution efficiency is defined as:

E = Speedup / N -> (6)

Execution efficiency translates directly to the CPU percentage used by an application. For a linear speedup this is 1, or 100%. The higher the number, the better, because it translates to higher application efficiency.

7. A SIP Server, the Benchmark Application

Our previous paper had used a SIP server to gather research data. This time, most of the work has been done with a simple Java program that simulated creation of different type of objects. The idea was that this program could be extended to emulate different kinds of application behavior from near real time applications (telco) to throughput applications (enterprise). From a macroscopic view of the collector, every application has the same behavior, n number of objects are created / second, of these, a certain percentage is temporary, some intermediate, and the remaining long term. The ratios and the lifetime of these are the most important factor. Next would be the types of objects, object complexity, and relationships between the objects and how these affect the collection. Once this can be emulated, most application behavior can be simulated. The prototype is extremely primitive at the moment, and is available on request.

7.1 Using SIP To Test Real Time Requirements

Session Initiation protocol (SIP) [ 21] is a protocol defined by the Internet Engineering Task Force (IETF), used to set up a session between two users in real time. It has many applications, but the one focussed in here is setting up a call between users. After a call setup is completed, the SIP portion is complete. The users are then free to pass real-time media (such as voice) between the two endpoints. Note that this portion, the call, does not involve SIP or the servers that are routing SIP call setups.

If this protocol is still unfamiliar, it might help to think of it as akin to Hyper Text Transfer Protocol (HTTP) [ 24]. It is a similar, text-based protocol founded on the request/response paradigm. One of the key differences is that SIP supports unreliable transport protocols, such as UDP. When using UDP, SIP applications are responsible for ensuring that packets reach their destination. This is accomplished by retransmitting requests until a response is received.

One problem that arises from the model of application-level reliability is retransmission storms. If an application does not respond quickly enough (within 500 ms by default in the specification), the request will be retransmitted. This retransmission continues until a response is received. Each retransmit makes more work for the receiving server. Hence, retransmissions can cause more retransmissions, and thrashing can ensue.

A GC cycle that takes longer than 500 ms will cause retransmissions. One that takes several seconds will ensure many retransmissions. These, in conjunction with the new requests that arrive at the server during garbage collection, will make even more work for the server and it will fall further behind.

Even absent the overburdening problems just described, other constraints make garbage collection pauses unacceptable. There are carrier grade requirements that state acceptable latencies from the moment an initiating side begins a call to the moment it receives a response from the terminating side. These are typically sub-second times, so a multiple-second pause in a SIP server for GC is unacceptable.

7.2 Call Setups

"Call setup" [ 22] is a concept used throughout this paper. A call setup is simply the combination of an originating side stating that it wishes to initiate a session (a call in this case) and a terminating side responding that it is interested as well. In SIP, there is also the concept of an acknowledgement being sent back to the terminating side after it accepts.

After a call setup is complete, the server must maintain call setup state for 32 seconds in order to handle properly any application-level retransmissions that might occur. This specification is the reason the value of 32 seconds is used as an active duration1 throughout the paper.

8. Modeling Application Behavior to Predict GC Pauses

Modeling Java applications makes it possible for developers to remove unpredictability attributed to the frequency and duration of pauses for GC.

The frequency and pause duration of GC are directly related to the following factors:

Incoming call-setup rate
Number and total size of objects created per call setup
Average lifetime of these objects
Size of the Java heap

Developers can construct a theoretical model to show application behavior for various call-setup rates, numbers of objects created, object sizes, object lifetimes, and heap sizes. A theoretical model reflects the extremities of the application behavior for the best conditions and the worst. Once developers know these extremities, they can model a real-life scenario that helps predict the maximum call-setup rate that can be achieved with acceptable pauses, and that shows what can happen when call-setup rates exceed the maximum and application performance deteriorates.

8.1 Best-Case Scenario

Assumptions:

Calculations are for a SIP application, but could be applied to any generic client-server application that has a request, a response, and an active time that request state is maintained
For simplicity, only call setups are modeled (call tear-downs are not considered)
The young collector is configured to use the promoteall modifier
The old generation uses a concurrent collector

The variables that change from scenario to scenario are listed in a table preceding the results

Parameter	Assumed value
Time call-setup state is maintained - active duration of call setup	0 ms
Time between call setups	10 ms
Young semi-spacgeneration Eden e size	5 MB
Young GC Eden thresholdthreshold	100100%
Oldheap size	N/A
Old GC threshold	N/A
Total size of objects per call setup	50 KB
Lifetime of objects	0 ms

A young generation GC (GC[0], see 9.2 Snapshot of a GC Log) occurs when the 5-MB Eden semi-sspace is 100% full.

Ideal objects space taken / call setup:

= (semi-spaceEden size / total size of each call setup's objects) = 5 MB / 50 KB = 102.4 call setups / semi-space

->(10)

So a GC occurs every:

frequencyperiod = (call setups * time between call setupscall-setup rate) = 102.4 * 10 = 1,024 ms

->(11)

Remember, in this scenario, we are not accumulating any garbage, because we assume that all objects are very short-lived and they are all collected every young GC. Hence, nothing is ever promoted to the old heap. So, assume that each young GC takes 50 ms.

So in an hour of execution the application spends roughly this muchtime in young GC:

= (seconds per hour * (GC pause in ms / 1000)) = 3600 * (50 / 1000) = 180 seconds, or 3 minutes

->(12)

Because no objects are promoted to the older generation, the concurrent collector is not activated, and its cost on the application performance is assumed to be zero.

Total processing time available / hour:

= (Minutes in a hour) - (time lost to young generation collection) = 60 - 3 = 57 minutes.

Total call setups per hour @ 100 calls / second:

= (call setups per second) * (seconds per minute) * (numberof valid minutes in a hour, available to the application) = 100 * 60 * 57 = 342,000 call setups / hour on 1 CPU

Serial portion from (12) is 3 minutes lost to GC every hour.

= 3 / 57 = 0.0526

Scalabilitywith a four-CPU machine:

Speedup = 1 / (F + ((1 -F) / N) Speedup= 1 / (0.0526 + (1 - 0.0526) / 4) Speedup = 1 / 0.289 = 3.45 Efficiency = Speedup / N Efficiency = (3.45 / 4) = 86.37%

<- From Amdahl's law

So at 86.37% efficiency, we can process:

(342,000 * 0.8637) * 4 = 1,181,542 call setups / hour

8.2 Worst-Case Scenario

Assumptions:

Parameter	Assumed value
Time call-setup state is maintained - active duration	Infinite
Time between call setups	10 ms
Young Edensemi-space size	5 MB
Young GC Eden thresholdthreshold	100100%
Old heap size	512 MB
Old GC threshold	68%
Total size of objects per call setup	50 KB
Lifetime of objects	Infinite

Because all objects in this scenario are long-lived and the promoteall option has been specified, at every young-generation GC all the objects are promoted. Because the young GC threshold is 100%, 5 MB of objects will be promoted.

->(20)

So to fill up a 5 MB semi-space:

From (#10), we should have received about 102.4 calls

A GC occurs every:

frequencyperiod= (cps * time between call setupscall-setup rate) = 102.4 * 10 = 1,024 ms

->(21)

Each collection promotes 5 MB, 5,242,880 Bytes

->(22)

So if the application is run with an old heap of 512 MB:

The old heap should fill up in about: = (heap size) / (size of objects being promoted) = 512 MB / 5 MB = 536,870,912 / 5,242,880 = 102 promotions

->(23)

So if promoting 5 MB of objects to the old generation takes the young GC collector about 200 ms (because young GC is more expensive when promotion takes place, we assume a higher value than before), thenwe have:

total pause duration = (number of promotions * pause duration) = 102 * 200 = 20,400 ms = 20.4 seconds

->(24)

A promotion takes place every 1,024 ms, so for 102 promotions:

time for promotions = (number of promotions * (frequencyperiodicity + pause duration)) = 102 * (1,024 + 200) = 124,848 ms = 2.08 minutes

->(25)

With one CPU the heap should fill up in about 2.08 minutes.

Serial percentage due to young-generation GC:

= ((total pause duration * 100) / (60 * (total time for promotions)) = 20.4 * 100 / (60 * 2.08) = 16.34%

->(26)

Note: The serial percentage due to the old generation collector is assumed to be zero, because the amount of time spent in stop-the-world GC for the concurrent collector is negligible compared to the amount of time spent in stop-the-world young GC.

With four CPUs:

Speedup= 1 / (.1634 + (1 - .1634) / 4) = 2.673

->(27)

Execution Efficiency = (2.673 / 4) * 100 = 66.93%

->(28)

So with 4 CPUs, we should fill up the heap in about:

= (#23) * (#26) = (2.08 * 0.6693) = 83.52 seconds

->(29)

8.3 A Real Scenario with Each Call Setup Active for 32 Seconds

(Calculated for Concurrent GC)

Assumptions:

Parameter	Assumed value
Time call-setup state is maintained - active duration	32,000 ms
Time between call setups	10 ms
Young Edensemi-space size	5 MB
Young GC threshold	100%
Old heap size	512 MB
Old GC threshold	68%
Total size of objects per call setup	50 KB
Lifetime of objects	50% = 0 ms 50% = 32,000 ms

Each young-generation GC promotes about 2.5 MB of live objects, becausehalf the objects are short-lived and half long-lived.

->(30)

To fill up the 5-MB semi-space Eden: From (10), 102.4 call setups

A GC occurs every:

= (call setups received * time between callsetupscall-setup rate) = 102.4 * 10 = 1,024 ms

->(31)

A promotion promotes:

= 2.5 MB of live data <- assumption (30) = 2,621,440 Bytes

->(32)

So a promotion occurs every 1,024 ms, so in 32,000 ms, the number of promotions:

= (active duration of call setup / frequencyperiodicity) = 32,000 / 1,024 = 31 promotions

->(33)

32,000 ms is the active duration of a call setup; the first call setup will be released at the end of the 32,000 ms so in 32,000 ms the number of call setups that can be received: = (active duration of a call setup / call-setup ratetime between call setups) = 32,000 / 10 = 3,200 calls

So one active-duration segment is made up of 3,200 calls.

->(34)

At the end of 64,000 ms, all call setups from the first active duration will be dead. At the end of 96,000 ms, call setups from two active durationsduration's will be dead. At the end of 128,000 ms, call setups from three active durationsduration's will be dead. At the end of 160,000 ms, call setups from four active durationsduration's will be dead.

In each young GC we promote:

= 2,621,440 Bytes <- from (32),

So total promotion in bytes every 32 seconds is:

= (#32) * (#33) = 2,621,440 * 31 = 81,264,640 or 79.36 MB

->(35)

Theinitial mark GC initial mark phase will occur when the heap is 68% fullonly 32% of the heap is free:

= (6832% * 512 MB) = 163.84 MB free = 512 MB - 163.84 MB = 348.16 MB used

->(36)

Time when an initial mark GC initial mark phase might occur is:

= (initial mark size in MB / promotion size in MB) * frequencyperiodicity of promotion = (348.16 MB / 2.5 MB) * 1,024 ms = 142,606 ms

->(37)

After an initial mark GCInitial mark phase, a remark GCremark phase takes place; assume that this happens when the heap is at 7228%full:

= ((100% - 28%) * 512 MB) = 368.64 MB used

->(38)

Time when a remark GCremark phase might occur is:

= (remark size in MB / promotion size in MB) * frequencyperiodicity of promotion = (368.64 MB / 2.5 MB) * 1,024 = 150,995 ms

->(39)

Number of active durationsduration's present in the old heap at remark GCremark phase:

= ((#38) / (#35)) = 368.64 / 79.36 = 4.6

->(40)

Time to fill up an active-duration segment:

= (#33) * (#31) = 31 * 1,024 = 31,744 ms

->(41)

Time to fill up 4.6 active durationsduration's:

= 31,744 * 4.6 = 146,022 ms

Thetime when a sweepresize GC phase can take place is soon after a remark GCremark phase:

= (#39) = 150,995 ms

(assume sweep phase takes almost 0 ms to free the heap)

->(42)

The number of active-duration segments that will be freed at the end of the sweep a resize phaseGC can clean up: = ((#42) / active duration of a call setup) - (adjust factor for current active-duration segment) = (150,995 / 32,000) - 1 = 3.72 Active-duration segments

->(43)

The sweep phase should free 3.72 active duration's:

= (#43) * (#35) = 3.72 * 79.36 = 295.22 MB

->(44)

A young-generation GC should occur every 1,024 ms, and an old-generation Initial mark phase should occur every 142,606 ms

9. Modeling Application Behavior Using "verbose:gc" Logs

The model above is theoretical and relies on a lot of assumptions; but developers can use the "verbose:gc" log from an actual application run to construct a real-world model. The model will allow one to visualize the runtime behavior of both the application and the garbage collector. The verbose:gc logs contain valuable information about garbage-collection times, the frequency of collections, application run times, number of objects created and destroyed, the rate of object creation, the total size of objects per call, and so on. This information can be analyzed on a time scale, and the behavior of the application can be depicted in graphs and tables that chart the different relationships among pause duration, frequency of pauses, and object creation rate, and suggest how these can affect application performance and scalability. Analysis of this information can enable developers to tune an application's performance, optimizing GC frequency and collection times by specifying the best heap sizes for a given call-setup rate (see 16. Sizing the Young and Old Generations' Heaps).

9.1 Java "verbose:gc" Logs

Logs containing detailed GC information are generated when a Java application is run with the -verbose:gc flag turned on. With earlier JDK versions, specifying this twice on the command line, increased the GC info printed out, and this could then be analyzed. Similar and more richer output is available with JDK 1.4.1 VM, but instead of specifying the -verbose:gc option twice, the below options provide a more detailed output that can be analyzed:

-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC

Note:

Using -XX:+PrintHeapAtGC generates lot of information. If there are a lot of GCs, and log size is a concern, then this option might be omitted. The collected information is not self-explanatory but can still be processed by the analyzer (see pattern 2, and pattern 7 under Appendix A4.1).
-Xloggc is a new option available with J2SE 1.4. This option logs verbose:gc data into a log file. But it does not log the data ouput with the above switches. So it is mentioned here.

For example:

java -server -Xms512m -Xmx512m -XX:NewSize=64m -XX:MaxNewSize=64m \ -XX:SurvivorRatio=2 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC \ -XX:+PrintGCDetails -XX:+PrintGCTimeStamps \ -XX:+PrintHeapAtGC application

9.2 Snapshot of a GC Log

The snapshot below is of a verbose:gc log entry generated by the JDK 1.4.1 VM with the above options. Highlighted phrases with superscript numbers are footnoted below.

	`= (#51) - (#52) = 1,661,646 - 1,307,750`
	`= 353,896 KB`	-> (53)
	`= (#53) / ((#42) / (#41) = (353,896 * 1,024) / (687,152,824 / 5.22) = 3.31 active durations`

Parameter Name	Meaning
-d	Print out a detailed analysis and a summary
gclogfile	verbosegc log file to process
CPS	Call setups / second (default = 50)
active_call_duration	Duration call setup is active (default = 32,000 ms)
CPUs	Number of CPUs in the system (default = 1)
applcation_run_Time_in_ms	Number of ms application was run (default is calculated from GC time stamps)

Option	Default value	Max Value	Description
`-XX:NewSize`	2Mb	< 100 GB	Controls the initial size of young generation
`-XX:MaxNewSize`		< 100 GB	Controls the maximum size of the young generation
`-XX:SurvivorRatio`	8 on windows, and 64 on Solaris	<100000	Controls the size of the eden and the survivor spaces
`-Xms`	5 MB	< 100 GB	Initial heap size
`-Xmx`	4 MB	< 100 GB	Maximum heap size
`-Xmn`			Size of young generation
`-XX:MaxTenuringThreshold`	31		Maximum number of times object is aged. Set this to 0 to enable promoteall.
`-XX:PermSize`	4 MB	< 10 GB	Initial size of permanent generation
`-XX:MaxPermSize`	64 MB	< 10 GB	Maximum size of permanent generation
`-XX:NewRatio`	2 for server, 8 for client on Solaris		Ratio of young generation / old generation
`-XX:PretenureSizeThreshold`	0		Objects size greater than this is directly allocated in the older generation
`-XX:+UseISM`			Use intimate shared memory (see section 16.2.6.1)
`-XX:+UseMPSS`			Use multiple page sizes (see section 16.2.6.1)
`-XX:+UseTLAB`	True on Solaris		Use thread-local object allocation
`-XX:+AggressiveHeap`			Use for throughput applications with lots of CPUs and memory > 256MB. Turns on various flags, uses parallel scavenge collector for young generation, turns on `Adaptive SizePolicy`, increase sizes of TLAB and other data structures
`-XX:TargetSurvivorRatio`	50	100	Desired percentage of survivor space used after a scavenge

Option	Default value	Max Value	Description
`-XX:+UseParNewGC`			Enables young generation parallel copying collector. Use with concurrent collector or default mark-sweep-compact collector
`-XX:ParallelGCThreads`	As many threads as CPUs		Controls the number of threads used for copying collection

Option	Default value	Description
`-XX:+UseConcMarkSweepGC`		Enables old generation concurrent collection
`-XX:UseCMSCompactAtFullCollection`	false	Enable compaction Full GC and if full collection are occurring too frequently
`-XX:CMSFullGCsBeforeCompaction`	1	Parameter that affects compaction of the old generation. If at least this number of concurrent collections has not succeeded between full collection, do a compaction on full collections. If 0, always do compactions on full collections when `UseCMS Compact AtFull Collection` is true
`-XX:PrintCMSStatistics`	0	If > 0, Print statistics about the concurrent collections. For example, the number of times the concurrent collection yield to a young generation collection and the number of cards precleaned
`-XX:PrintFLSStatistics`	0	If > 0, print statistics about the concurrent free lists. For example, a fragmentation parameter
`-XX:PrintFLSCensus`	0	if > 0, print the populations of the CMS free lists
`-XX:CMSInitiatingOccupancyFraction`	68%	Change the percentage when a CMS collection is started

Option	Default value	Description
`-XX:UseParallelGC`		Enables young generation parallel scavenge collector. Works only with the default mark-sweep-compact collector. Do not use with the concurrent collector.
`-XX:+UseAdaptiveSizePolicy`	false	Automatically sizes the young generation and chooses an optimum survivor ratio to maximize performance.
`-XX:+PrintAdaptiveSizePolicy`	false	Prints information about adaptive size policy
`-XX:ParallelGCThreads`	As many threads as CPUs	Controls the number of threads used for copying collection

Option	Default value	Description
`-XX:+PrintGCDetails`	false	PrintGC details
`-XX:+PrintGCTimeStamps`	false	Adds timestamp info to GC details
`-XX:+PrintHeapAtGC`	false	Prints detailed GC info including heap occupancy before and after GC
`-XX:+PrintTenuringDistribution`	false	Prints object aging or tenuring information
`-XX:+PrintHeapUsageOverTime`	false	Print heap usage and capacity with timestamps
`-Xloggc:filename`	false	Prints GC info to a log file
`-verbose:gc`	false	Prints some GC info
`-XX:+PrintTLAB`	false	Print TLAB information