|
Developer: J2EE
An Introduction to Java Map Collection Classes
by Jack Shirazi
Learn the basics of one the most commonly used collection types, Maps, and how to optimize Maps for your application-specific data.
Published September 2004
The collection classes in java.util include some of the most commonly used classes in Java. The most commonly used collection types are List and Map. Concrete implementations of List include ArrayList and Vector, which are variable size lists ideal for building, storing and manipulating a list of elements of any type of object. Lists are ideal when you want to access elements by numerical index.
Maps provide a more general way of storing elements. The Map collection type allows you to store pairs of elements, termed "keys" and "values", where each key maps to one value. Conceptually, you could consider Lists as Maps which have numeric keys. However in practice there no direct connection between Lists and Maps except that they are both defined in java.util. In this article, we will focus on the Maps that are available with the core Java distribution, and also consider how to adapt or implement specialized Maps that are more optimal for your application specific data.
Understanding Map Interface and Methods
There are many Map classes pre-defined in the Java core classes. Before we consider concrete implementations, we'll look at the Map interface itself so that we know what all the implementations have in common. The Map interface defines four types of methods that every Map has. Let's look through these starting with a couple of uninteresting ones, listed in table 1.
Table 1: Overriden Methods. These two methods are overriden from Object to enable Map objects to be correctly compared by equality.
| equals(Object o) | Compares the specified object with this map for equality |
| hashCode() | Returns the hash code value for this map |
Map Building
Map defines several mutator methods to insert and remove elements, listed in table 2.
Table 2: Map Updating Methods: Methods which change the contents of the Map.
| clear() | Removes all mappings from the map |
| remove(Object key) | Removes the key and associated value from the map |
| put(Object key, Object value) | Associates the specified value with the specified key |
| clear() | Removes all mappings from the map |
| putAll(Map t) | Copies all of the mappings from the specified map to this map |
Nothing surprising here, though it may be worth noting that putAll() cannot normally be more efficient than using lot's of calls to put(), even assuming that you exclude the cost of building the other Map that you pass to putAll(). That's because putAll() will need to iterate the passed Map's elements, as well as running through the algorithm for adding each key value pair to the Map that put() executes. Note however that putAll() could size the Map correctly before adding all the elements, so if you are not sizing the Map yourself (we'll get to that in a bit), putAll() could be more efficient than expected.
Viewing Maps
There is no straightforward way of running through the elements in a Map. If you want to query a Map to see which elements satisfy a particular query, or if you want to iterate through all the elements for whatever reason, you need to first acquire a "view" of the Map. There are three possible views (see table 3)
- All key-value pairs see entrySet()
- All keys see keySet()
- All values see values()
The first two return Set objects, the last returns a Collection. In each case, the story doesn't end there, because you cannot iterate Collection or Set objects directly, instead you have to obtain an Iterator object to iterate. So to iterate through the elements of a Map, you have the rather cumbersome idioms
Iterator keyValuePairs = aMap.entrySet().iterator();
Iterator keys = aMap.keySet().iterator();
Iterator values = aMap.values().iterator();
It is worth realizing that these objects (Sets, Collections and Iterators) are actually views on the underlying Map, not copies containing all the elements. That makes them fairly efficient to use. The toArray() methods of Collection or Set, on the other hand, do create array objects containing all the elements of the Map, and hence are not as efficient unless you really need the elements in an array.
I ran a small test using HashMap and comparing the cost of running through the Map elements using the following two alternatives (Test1 in the accompanying files):
int mapsize = aMap.size();
Iterator keyValuePairs1 = aMap.entrySet().iterator();
for (int i = 0; i < mapsize; i++)
{
Map.Entry entry = (Map.Entry) keyValuePairs1.next();
Object key = entry.getKey();
Object value = entry.getValue();
...
}
Object[] keyValuePairs2 = aMap.entrySet().toArray();
for (int i = 0; i < mapsize; i++)
{
Map.Entry entry = (Map.Entry) keyValuePairs2[i];
Object key = entry.getKey();
|
Profilers in Oracle JDeveloper
Oracle JDeveloper contains an emdedded profiler which measures memory and execution times, and allows you to quickly identify bottlenecks in code. I used Jdeveloper's execution profiler to profile the containsKey() and containsValue() methods for HashMap, and quickly found that the containsValue() method is considerably slower than the containsKey() method in fact several orders of magnitude longer! (See Figure1 and Figure 2, and the Test2 class in the accompanying files).
|
Object value = entry.getValue();
...
}
There are two alternatives to measure in this test: the times for running through the elements, and the additional cost of creating the array with the toArray call. The first measurement, ignoring the time taken to create the array shows that the running through the elements is faster by about 30%-60% when using an array already created from the toArray call, compared to using the Iterator. But, when you include the overhead of creating the array using the toArray method, then using the Iterator is actually faster by 10%-20%. So, if you need to create an array of elements from a collection for a reason other than to iterate those elements, then you should use that array to run through the elements. But, if you do not need that secondary array, then do not create one, instead use an Iterator to iterate the elements.
Table 3: Map Methods Returning Views: These methods return objects which allow you to traverse the elements of the Map, and also delete elements from the Map.
| entrySet() | Returns a Set view of the mappings contained in the map. Each element in the set is a Map.Entry object which can have it's key and value elements accessed with getKey() and getValue() methods (also has a setValue() method) |
| keySet() | Returns a Set view of the keys contained in the map. Removing elements from the Set will also remove the corresponding mapping (key and value) from the Map |
| values() | Returns a Collection view of the values contained in the map. Removing elements from the Collection will also remove the corresponding mapping (key and value) from the Map |
Accessing Elements
Map accessing methods are listed in table 4. Maps are normally optimized for access by key, not value. There is nothing in the Map definition that specifies that this must be true, but generally you can expect it to be true. So, for example, you can expect the containsKey() method to be about as fast as the get() method. On the other hand, the containsValue() method is likely to need to scan the values in the Map, so is probably going to be much slower.
Table 4: Map Accessing And Testing Methods: Methods which retrieve information about the Map contents without altering the Map contents.
| get(Object key) | Returns the value associated with the specified key |
| containsKey(Object key) | Returns true if the map contains a mapping for the specified key |
| containsValue(Object value) | Returns true if this map maps one or more keys to the specified value |
| isEmpty() | Returns true if the map contains no key-value mappings |
| size() | Returns the number of key-value mappings in the map |
Running a test of the time taken to test all the elements in a HashMap for containsKey() and containsValue() shows containsValue() taking much much longer. In fact several orders of magnitude longer! (See Figure 1 and Figure 2, and Test2 in the accompanying files.) So quite probably if containsValue() is ever a performance problem in your application it will show up very quickly and be easy to identify by profiling your application. In this case I'm sure you will be able to think of efficient alternative ways to enable the equivalent functionality that containsValue() provides. But if you can't, one feasible solution is to create a second Map with all the values of the first Map as keys. Then containsValue() on the first Map becomes the much more efficient containsKey() on the second Map.
|
| Figure 1: Using JDeveloper to create and run the Map testing classes
|
|
| Figure 2: Performance profiling in JDeveloper using the execution profiler identifies where the bottleneck is in the application
|
Core Maps
Java is distributed with a variety of Map classes. There are three types of these Map classes:
- General purpose Maps that you might use in your application to manage mappings, mostly implemented under the java.util package
- HashMap
- Hashtable
- Properties
- LinkedHashMap
- IdentityHashMap
- TreeMap
- WeakHashMap
- ConcurrentHashMap
- Special purpose ones that you wouldn't normally create yourself but instead would access via some other class
- java.util.jar.Attributes
- javax.print.attribute.standard.PrinterStateReasons
- java.security.Provider
- java.awt.RenderingHints
- javax.swing.UIDefaults
- And one abstract class for assistance in implementing your own Map classes
Hashing Internals: Hash Mapping Technique
Almost all the general purpose Maps use hash mapping. This is quite a simple mechanism for mapping elements into an array, and it is worth understanding how hash mapping works so that you can get the best out of your Maps.
Hash map structures consist of an internal array where elements are stored. Since the internal storage is an array, clearly there must be a mechanism for determining an index into the array for an arbitrary key. In fact, the mechanism needs to give an integer index value which is smaller than the size of the array. This mechanism is called the hash function. In Java hashed based Maps, the hash function converts any object into an integer that fits into the internal array. You don't have to look hard to find an easily available hash function: every object has a hashCode() method which returns an integer value. To map that value into any array, it is sufficient to convert it to a positive value and take the remainder after dividing by the array size. So here is a simple hash function for Java that works for any object
int hashvalue = Maths.abs(key.hashCode()) % table.length;
(The % binary operator, known as modulo, returns the remainder as an integer after dividing the left hand side by the right hand side.)
In fact, until the 1.4 release, this was exactly the hash function that was used by the various hash based Map classes. Though if you look in the code you will see
int hashvalue = (key.hashCode() & 0x7FFFFFFF) % table.length;
which is essentially the same function, using a faster mechanism to get a positive value. From the 1.4 release, the HashMap class implementation uses a different and more complex hash function, based on Doug Lea's util.concurrent packages (I'll talk about Doug Lea's classes again in a little more detail later).
|
| Figure 3: How hashing works
|
So that gives the basic groundwork for hash mapping, but we haven't quite taken care of everything. Our hash function maps an arbitrary object into an array location, but what happens if two different keys map to the same location? Nothing prevents that from occurring. In the parlance of hash mapping, this is called a collision. The way the maps deal with these collisions is to insert a linked list at the index location, and simply add elements to the linked list. So a basic put() method for a hash based Map could look like this
public Object put(Object key, Object value) {
//Our internal array is an array of Entry objects
//Entry[] table;
//Get the hashcode, and map to an index
int hash = key.hashCode();
int index = (hash & 0x7FFFFFFF) % table.length;
//Loop through the linked list at table[index] to see if
//we already have this key entry and if so overwrite it
for (Entry e = table[index] ; e != null ; e = e.next) {
//Must check that the key is equal, since the hash
//could be the same for different key objects
if ((e.hash == hash) && e.key.equals(key)) {
//This is the same key, overwrite the value
//and keep the old value to return from the method
Object old = e.value;
e.value = value;
return old;
}
}
//Still here, so it's a new key, just add a new Entry
//An Entry object has Object "key" and "value", an int "hash",
//and an Entry "next" to point to the next Entry in the list
//Create a new Entry pointing to the start of the previous
//list, and insert that new Entry into the table
Entry e = new Entry(hash, key, value, table[index]);
table[index] = e;
return null;
}
If you look at the source for various hash based Map's you'll see this is pretty much how they work. There are some further considerations, like handling null keys and values, and re-sizing the internal array. The put() method defined here also includes the algorithm for the corresponding get(), since the insertion includes searching the entries at the mapped index to see if the key is already present. (I.e. the get() method is the same algorithm as the put(), but without the insertion and overwite code.) Using a linked list is not the only way to handle collisions, and some of the hash maps use another "open addressing" scheme which I won't go into here.
Optimizing Hasmaps
If the internal array of our hash map consisted of only one element, then all the entries would have to go into that one array location, as one long linked list. That is pretty inefficient, our updates and accesses would consist of linear searches through a linked list, which is much slower than if each array index in the map contained just one object. The time to access or update the linked list is linearly dependent on the size of the list, whereas using a hash function to access or update a single element in an array is independent of the size of the array in terms of the asymptotic performance (Big-O notation), the former is O(n), whereas the latter is O(1). So it makes sense to use a bigger array rather than let too many entries accumulate in too few array locations.
Re-sizing Map Implementations
In hashing parlance, each position in the internal array is known as a "bucket", and the number of buckets available (i.e. the size of the internal array) is known as the capacity. In order to make the Map object efficiently handle any number of entries, the map implementations can re-size themselves. But re-sizing is expensive. Re-sizing requires all of the elements to be re-inserted to the new array since a different array size means that objects now map to different index values. Keys that previously collided may no longer collide, whereas other keys that didn't previously collide may now collide. This immediately tells you that if you size your Map sufficiently large then you can reduce or even eliminate the re-sizing it does, which is potentially a big speedup.
Running a simple test of populating a HashMap with a large number of entries (over a million) with the 1.4.2 JVM. Table 5 shows the results, with all times normalized to the pre-sized server mode (Test3 in the associated files). Client and server mode JVM ran in almost identical times for the pre-sized JVMs (after discarding the JIT compilation phase). But using the default size for the Maps incurred many re-sizes, and the cost was considerable, 50% longer in the server mode, and nearly three times as long in client mode!
Table 5: Time taken to populate pre-sized HashMap compared to default sized HashMap
| | Client mode | Server mode |
| Pre-sized large | 100% | 100% |
| Default size | 294% | 157% |
Using Load Factors
In order to decide when to re-size, rather than keep a count of the depth of linked list in each bucket, the hash based Maps use an extra parameter and a rough calculation of how densely packed the buckets are. The Maps use a parameter called "load factor" which indicates how much "load" the Map will take, i.e. how full it will get, before it resizes. The relationship between load factor, number of entries (map size), and the capacity is straightforward:
- When (load factor) x (capacity) > (map size), then the map will be re-sized
So, for example, if the default load factor is 0.75, and the default capacity is 11, then 11 x 0.75 = 8.25, which is rounded down to 8 elements. So when we add the eighth entry to this Map, the Map will re-size itself to a larger value. Conversely, to calculate what initial capacity you need in order to avoid re-sizing, divide the number of entries you will make by the load factor and round up, e.g.
- For 100 entries with load factor of 0.75, then capacity should be set to 100/0.75 = 133.33, rounded up to 134 (or 135 to use an odd number)
Buckets sizes that are odd numbers should let the map perform more efficiently by reducing the number of collisions. Ideally prime numbers should be used for capacities, though tests I've made don't show prime numbers producing consistently better times (Test4 in the associated files). Some Maps since the 1.4 release (e.g. HashMap and LinkedHashMap but not Hashtable or IdentityHashMap) use a hash function that needs powers-of-two capacities, but the next highest power-of-two capacity is automatically calculated by those Maps so you don't need to try and calculate it yourself.
The load factor itself is a tuning tradeoff between space and time. Smaller load factors will take more space but will reduce the likelihood of collisions, thus making access and updates faster. Load factors above 0.75 are probably unwise, and above 1.0 are definitely unwise since that guarantees at least one collision. Load factors below 0.50 will give you diminishing returns, but there should be no performance cost to small load factors as long as you size the map effectively, only a memory cost. But smaller load factors will imply more frequent resizing if you don't pre-size the Map, and that will incur a performance penalty, so do bear that in mind if you are tuning the load factor.
Choosing the Right Map
Which map should you use? Should it be synchronized or not synchronized? These are perhaps the two most important questions to answer to establish optimal performance for your application. Add in sizing the Map and choosing a load factor, and that pretty much covers Map tuning options if you are using one of the available general purpose Maps.
So here is a simple recipe to follow to get optimal Map performance
- Declare all your Map variables as Map, not as any concrete implementation, i.e. not as HashMap or Hashtable or any other Map class implementation.
Map criticalMap = new HashMap(); //GOOD
HashMap criticalMap = new HashMap(); //BAD
This allows you to replace any particular Map instance very easily, by changing just one line of code.
- Download Doug Lea's util.concurrent packages (http://gee.cs.oswego.edu/dl/classes/EDU/oswego/cs/dl/util/concurrent/intro.html). Use the ConcurrentHashMap as your default Map. When you move to the 1.5 release, change to using java.util.concurrent.ConcurrentHashMap as your default Map. Do not wrap ConcurrentHashMap in synchronized wrappers, even when it will be used multi-threaded. Use the default size and load factors.
- Profile your application. If you find a Map creating a bottleneck, analyze why it is causing the bottleneck, and change some or all of the following for that Map: the Map class; the Map size; the load factor; the key object equals() method implementation. Specialist uses of Maps will almost certainly require special purpose custom Map implementations, but otherwise the general purpose Maps should achieve your desired performance targets.
Map Choice
Perhaps you were expecting more complex deliberations and that seems too easy? Okay, let's take it a little slower. Firstly, which Map should you use? The answer is really straightforward: do not choose any specific Map for your design unless there are real design requirements which will specify a particular type of Map. The choice of concrete Map implementation is normally not a design-time decision. You may know you need a Map, but not which. And that is exactly the point of having a Map interface. Leave the choice of Map implementation until you need to make that decision and if you use "Map" declared variables everywhere, then changing the Map implementation for any particular Map in your application requires one line to be changed, which is a pretty low cost tuning option. And the default Map implementation to use? I'll get to that shortly.
Synchronizing Maps
What about whether to synchronize or not to synchronize? (For synchronization you can either use a synchronized Map or use Collections.synchronizedMap() to turn an unsynchronized Map into a synchronized Map. This latter technique uses "synchronized wrappers") This is an incredibly complex decision, dependent on exactly how your Map will be used in terms of multi-threaded concurrent access and update, and you need to also include maintenance considerations too. For example what if you started without concurrent update to a particular Map, but it later changes to concurrent update. In that sort of situation, it is very easy to start with a non-synchronized Map and forget to change it to be synchronized when you later add the concurrent update threads to your application. Which leaves your application open to corruption, one of the worst kinds of bugs to determine and track down. But if you synchronize by default, then you are going to serialize the execution of multi-threaded applications with a consequent terrible performance. Seems like we need a decision tree of some sort to work our way through the options.
Doug Lea is a professor of Computer Science at the State University of New York at Oswego. He has created a set of public domain packages, collectively called util.concurrent, which contain many utility classes that make high performance concurrent programming simpler. Amongst the classes are two Maps, ConcurrentReaderHashMap and ConcurrentHashMap. These Map implementations are thread-safe and do not require synchronization for concurrent access or updates, and are suitable for most situations that require a Map. They are also much much more scalable than synchronized Maps like Hashtable or using synchronized wrappers, and have only a small performance penalty compared to HashMap. The util.concurrent packages formed the basis of JSR166, which has developed concurrency utilities for inclusion with the 1.5 Java release, and the 1.5 release will include those Maps in a new java.util.concurrent package.
| Next Steps
Download Oracle JDeveloper 10g: Change the way you think about Java development
Profilers in Oracle JDeveloper 10g:
The Profiler takes advantage of the features in the Java Virtual Machine enabling you to find programming inefficiencies, performance problems, and memory leaks in your application code. You can use the Profiler with the Debugger and CodeCoach for powerful and efficient troubleshooting in your application code. Learn more about event profiling, execution profiling, and memory profiling. |
All of which means that you don't need a decision tree to decide whether to use synchronized or non-synchronized Maps. Instead just use ConcurrentHashMap. Of course there will be situations where ConcurrentHashMap is not right. But these are few and far between, and should be able to be handled on a case-by-case basis. And that is what profiling is for.
Is That All?
Oracle JDeveloper made it very easy to create the test classes that allow me to compare the performance of the various Maps. More importantly, the well integrated profilers allows performance bottlenecks to be quickly and easily identifed during development - profilers integrated into the IDE tend to be used more often and so help make for a successful project. So now you are equipped with a profiler, and with the basics about general Maps and their performance, you can start running your own tests to see whether you have bottlenecks in your application due to Maps, and where you might want to change your use of Maps.
That pretty much covers the basics of general Maps and their performance. Of course there are more sophisticated and interesting considerations about specific Map implementations, and how to use them for different requirements, but that is a subject for part 2 of this article.
Jack Shiraziis the author of O'Reilly's Java Performance Tuning and director of the popular website JavaPerformanceTuning.com, the world's premier site for Java performance information. Jack consults and writes on Java performance related matters. He also oversees the output at JavaPerformanceTuning.com, publishing around 1 000 performance tips a year as well as many articles about performance tools, discussion groups, and much more. In his earlier life Jack also published work on protein structure prediction and black hole thermodynamics, and contributed to some Perl5 core modules "back when he had more time".
|