Java Technology and XML-Part 3: Performance Improvement Tips

   
   

Articles Index

Neither Java nor XML Technology need an introduction, nor the synergy between the two: "Portable Code and Portable Data." With the growing interest in web services and e-business platforms, XML is joining Java in the developer's toolbox. As of today, no less than six extensions to the Java Platform empower the developer when building XML-based applications:

  • Java API for XML Processing (JAXP)
  • Java API for XML/Java Binding (JAXB)
  • Long Term JavaBeans Persistence
  • Java API for XML Messaging (JAXM)
  • Java API for XML RPC (JAX RPC)
  • Java API for XML Registry (JAXR)

The first of the three articles in this series gave an overview of the different APIs available to the developer by presenting some sample programs. The differences in performance were addressed in the second article. This third article gives tips on improving the performance of XML-based applications from a programmatic and architectural point of view.

XML processing is very CPU, memory, and I/O or network intensive. XML documents are text documents that need to be parsed before any meaningful application processing can be performed. The parsing of an XML document may result either in a stream of events if the SAX API is used, or in an in-memory document model if the DOM API is used. During parsing, a validating parser may additionally perform some validity checking of the document against a predefined schema (a Document Type Definition or an XML Schema).

Processing an XML document means recognizing, extracting and directly processing the element contents and attribute values or mapping them to other business objects that are processed further on. Before an application can apply any business logic, the following steps must take place:

  • Parsing
  • Optionally, validating (which implies first parsing the schema)
  • Recognizing
  • Extracting
  • Optionally, mapping

Parsing XML documents implies a lot of character encoding and decoding and string processing. Then, depending on the chosen API, recognition and extraction of content may correspond to walking through a tree data structure, or catching the events generated by the parser and processing them according to some context. If an application uses XSLT to preprocess an XML document, even more processing is added before the real business logic work can take place.

Using the DOM API implies the creation in memory of a representation of the document as a DOM tree. If the document is large, so is the DOM tree and the memory consumption.

The physical structure and the logical structure of an XML document may be different. An XML document may contain references to external entities which are substituted in the document content while parsing and prior to validating. Those external entities and the schema itself (such as DTD) may be located on remote systems, especially if the document itself is originating from another system. In order to proceed with the parsing and the validation, the external entities must first be loaded (downloaded). Documents with a complex physical structure may therefore be very I/O or network intensive.

In this article, we will give some tips for improving performance when processing XML documents, articulated around improving the CPU, memory, and I/O or network consumption.

Using the Most Appropriate API: Choosing Between SAX and DOM

Both DOM and SAX have features that make them more suitable for certain tasks than others:

Table 1: SAX and DOM features

SAX DOM
Event based model Tree data structure
Serial access (flow of events) Random access (in-memory data structure)
Low memory usage
(only events are generated)
High memory usage
(the document is loaded into memory)
To process parts of the document
(catching relevant events)
To edit the document
(processing the in-memory data structure)
To process the document only once
(transient flow of events)
To process multiple times
(document loaded in memory)

Omitting the impact of memory consumption on overall system performance, processing using the DOM API is usually slower than processing using the SAX API, mainly because the DOM API may have to load the whole document in-memory first in order to allow it to be edited or data to be easily retrieved, while the SAX API allows immediate processing as the document is being parsed. Therefore, DOM should be used when the source document is to be edited or processed multiple times.

SAX is very convenient when you want to extract information from an XML document (an element content or an attribute value) regardless of its overall context -- its position in the XML document tree, or when the document structure maps exactly to the business object structure. Otherwise, keeping track of the element nesting may be very tedious and one may better end up using DOM. Nevertheless, when the source document is to be mapped to a business object which is not primarily represented as a DOM tree, it's recommended to use SAX to map directly to the business object, avoiding an intermediate resource-consuming representation. Of course, if the business object has a direct representation in Java, technologies like XML Data Binding (JAXB) can be used.

Since high level technologies like XSLT rely on lower level technologies like SAX and DOM, the performance when using those technologies may be impacted by their use of SAX or DOM. JAXP provides support for XSLT engine implementations that accept source input and result output in the form of SAX events. When building complex XML processing pipelines, one can use JAXP SAXTransformerFactory to process the result of another style sheet transformation with a style sheet. Working with SAX events until the last stage in the pipeline will optimize performance by avoiding the creation of in-memory data structures like DOM trees.

Considering Alternative APIs

JDOM is not a wrapper around DOM, although it shares the same purpose as DOM with regard to XML. It has been made generic enough to address any document model. JDOM has been optimized for Java and moreover, by the use of the Java Collection API, it has been made straightforward for the Java developer. JDOM documents can be built directly from, and converted to, SAX events and DOM trees, allowing JDOM to be seamlessly integrated in XML processing pipelines and in particular as the source or result of XSLT transformations.

dom4j is another alternative API very similar to JDOM. It additionally comes with a tight integration to Xpath: the org.dom4j.Node interface for example defines methods to select nodes according to an Xpath expression. dom4j also implements an event-based processing model which allows it to efficiently process large XML documents. Handlers can be registered to be called back during parsing when Xpath expressions are matched, allowing you to immediately process and dispose of parts of the document without waiting for all the document to be parsed and loaded into memory.

If a document model fits the core data structure of an application, JDOM and dom4j should be seriously considered. Additionally, as opposed to DOM 1 , JDOM or dom4j documents are serializable, which gives even more options when architecting complex inter-communicating applications.

Using alternative APIs like JDOM and dom4j, a developer may avoid some performance pitfalls like the one described in the second article, when accessing elements by their tag names, since the API through the support of the Java Collection API is more straightforward. Since it is lightweight and optimized for Java, you may often expect a sensitive gain in performance.

Be Aware of the Differences in the Implementations

As we highlighted in the second part of this series, implementations differ. Some emphasize functionality, others performance. The plugability feature of JAXP allows the developer to swap between implementations and select the most appropriate one to achieve the application requirements.

As an example, when using DOM, a common complaint is the lack of support in the API itself for serialization (that is, transformation of a DOM tree to a XML document). Therefore, it's tempting to step out of the standard API and call implementation-dependent serialization features at the cost of losing JAXP's plugability benefits. Below are code samples for serializing a DOM tree to an XML stream with both Xerces and Crimson.

Code Sample 1: Serialization with Xerces relies on a separate API which is packaged along with the DOM implementation

import org.w3c.dom.*;
import org.apache.xerces.dom.*;
                    import org.apache.xml.serialize.*;

Document document = ...
OutputFormat format = new OutputFormat(document);
format.setDoctype(null, "../samples/dtd/Chessboard.dtd");
XMLSerializer serializer = new XMLSerializer(new
    PrintWriter(System.out), format);
serializer.asDOMSerializer();
                    serializer.serialize(document.getDocumentElement());

                

Code Sample 2: Serialization with Crimson relies on methods specific to the DOM implementation

                    import org.apache.crimson.tree.XmlDocument;

Document document = ...
((XmlDocument) document).setDoctype(null,
    "../samples/dtd/Chessboard.dtd", null);
                    ((XmlDocument) document).write(System.out);

                

JAXP addresses the serialization of a DOM tree through the use of the Identity Transformer as presented in the example below. The identity transformer just copies the source tree to the result tree and applies the specified output method. To output in XML, the output method needs only to be set to xml. It solves the problem in an easy and implementation-independent way.

Code Sample 3: Implementation-independent serialization with the identity transformer (no argument passed to the factory method TransformerFactory.newInstance)

                    import javax.xml.transform.*; import javax.xml.transform.stream.*; import javax.xml.transform.dom.*;

Document document = ...
TransformerFactory transformerFactory
    = TransformerFactory.newInstance();
Transformer transformer
    = transformerFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.METHOD, "xml");
transformer.setOutputProperty(OutputKeys.DOCTYPE_SYSTEM,
    "../samples/dtd/Chessboard.dtd");
                    transformer.transform(new DOMSource(document),     new StreamResult(System.out));

                

JAXP, with its support by many parsers and style sheet engines, is a strong asset for your application. It's worth capitalizing on so that later on, the underlying parser implementations can be swapped easily without requiring any application code changes.

Tuning the Underlying Implementations

The JAXP API defines methods to set/get features and properties in order to configure the underlying implementations. Apart from the standard properties and features such as the http://xml.org/sax/features/validation feature used to turn on or off the validation, a particular parser, document builder or transformer implementation may define specific features and properties to switch on or off specific behaviors dedicated to performance improvement. For example, Xerces defines the feature http://apache.org/xml/features/dom/defer-node-expansion, which enables or disables a lazy DOM mode (enabled by default); in this mode, the DOM tree nodes are lazily evaluated, their creation is deferred: they are created only when they are accessed. The construction of a DOM tree from an XML document returns faster and only the accessed nodes get expanded. This feature is particularly useful when only parts of the DOM tree are to be processed.

Setting specific features and properties should be done with care to preserve the interchangeability of the underlying implementation. When a feature or a property is not supported or not recognized by the underlying implementation, a SAXNotRecognizedException, a SAXNotSupportedException or an IllegalArgumentException may be thrown by the SAXParserFactory, the XMLReader or the DocumentBuilderFactory. Avoid grouping unrelated features and properties, especially standard versus specific ones, in a single try/catch block; handle the exceptions independently so that optional specific features or properties don't prevent switching to a different implementation. You may design your application in such a way that features and properties which are specific to the underlying implementations may also be defined externally to the application, in a configuration file for example.

Reusing and Pooling Parsers

An XML application may have to process different types of documents (such as documents conforming to different schemas), and these documents can be accessed from different sources. A single parser may be used (per thread of execution) to handle documents of different types successively just by reassigning the handlers according to the source documents to be processed. Since they are complex objects, parsers may be pooled so that they can be reused by other threads of execution, reducing the burden on memory allocation and garbage collection. Additionally if the number of different document types is large and if the handlers are expensive to create, handlers may be pooled as well. The same considerations apply to style sheets and transformers.

Partial Parsing with SAX

If you can use SAX, and the information you want to extract from the document is located at the beginning or at least not located at the very end, you may have better performance if you can interrupt the parsing as soon as all the information has been extracted. You can achieve this by throwing a SAX exception. This may be especially useful when a document is wrapped inside another document (the envelope) and you need to get some information like the recipient to be able to route it. You may only want to extract information from the envelope without parsing the contained document which may be much bigger.

Code Sample 4: When the first occurrence of the targeted element (variable target) has been extracted, an EndOfProcessingException is thrown to stop the parsing

                    private String target;

public class  
                   EndOfProcessingException
  
                    extends SAXException {
 public EndOfProcessingException(String msg) {
 super(msg);
 }
}

public class ChessboardHandler extends HandlerBase {

 private boolean acquired = false;

 public void startElement(String name,
     AttributeList attrs) {
   
                   if (name.equals(target)) {    acquired = true;
   ...
   // Start processing the targeted element
                  

   ...
   
                   }
  ...
  return;
 }

 public void characters(char[] ch, int start,
      int length) {
   
                   if (acquired) {
   ...
   // Process the targeted element content
   ...
  }
  ...
  return;
 }

 public void endElement(String name,
     AttributeList attrs)
    
                   throws SAXException {
   
                   if (name.equals(target)) {
    
                   if (acquired) {
    ...
    // Finish processing the targeted element
    ...
     
                   throw new EndOfProcessingException("Done.");
   }
   
                   }
  ...
  return;
 }
 ...
}

                

When run against the XML source documents used for the benchmark to search for the first occurrence of the KING element, such a test program executes in the same time regardless of the document size.

Reducing Validation Cost

Validation is important and may be required to guarantee the reliability of an XML application. An application may legitimately rely on validation by the parser to avoid double-checking the validity of element nesting and attribute values. A valid XML document may still be invalid in the application domain. The capabilities of Document Type Definitions are limited. For example, in the Chessboard application domain, nothing could prevent two pieces from having the same row and column attribute values. XML validation doesn't discharge the application from validating other uncovered constraints that may be violated without invalidating a document. Not relying on XML validation may put more burden on the application; on the other hand, validation affects performance.

In the following discussion, we mainly refer to DTD but the principles discussed can be extended to other XML schema languages.

 

Note: Per the XML specifications, a non-validating parser is not required to read external entities (including external DTD); therefore external entities referenced in the document may not be expanded and attributes may not have their default value substituted. In such a case, the information passed to the invoking application may not be equivalent when using validating and non-validating parsers. In the context of this article, we are only considering parsers which, even when non-validating, do -- by default or through proper configuration -- load and parse the DTD and the entities referenced in the document. This allows, for example, the entities to be substituted, the attribute values to be normalized, and their default value properly substituted, so that the application can run unchanged when switching from validation to non-validation of the input document.


Code Sample 5: A valid invalid document. XML valid, but application domain invalid: two pawns are at the same position; XML validation doesn't discharge the application from enforcing some domain specific constraints

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE CHESSBOARDS SYSTEM "dtd/Chessboards.dtd">
<CHESSBOARDS>
 <CHESSBOARD>
  <WHITEPIECES>
   <KING><POSITION COLUMN="G" ROW="1" /></KING>
   <BISHOP><POSITION COLUMN="D" ROW="6" /></BISHOP>
   <ROOK><POSITION COLUMN="E" ROW="1" /></ROOK>
    
                   <PAWN><POSITION COLUMN="A" ROW="4"  
                   /></PAWN>
    
                   <PAWN><POSITION COLUMN="A" ROW="4"  
                   /></PAWN>
   <PAWN><POSITION COLUMN="B" ROW="3" /></PAWN>
   <PAWN><POSITION COLUMN="C" ROW="2" /></PAWN>
   <PAWN><POSITION COLUMN="F" ROW="2" /></PAWN>
   <PAWN><POSITION COLUMN="G" ROW="2" /></PAWN>
   <PAWN><POSITION COLUMN="H" ROW="5" /></PAWN>
  </WHITEPIECES>
  <BLACKPIECES>
   <KING><POSITION COLUMN="B" ROW="6" /></KING>
   <QUEEN><POSITION COLUMN="A" ROW="7" /></QUEEN>
   <PAWN><POSITION COLUMN="A" ROW="5" /></PAWN>
   <PAWN><POSITION COLUMN="D" ROW="4" /></PAWN>
  </BLACKPIECES>
 </CHESSBOARD>
 <CHESSBOARD>
  ...
 </CHESSBOARD>
</CHESSBOARDS>

                

In a system 2 with components exchanging documents, the cost of validation can be efficiently reduced by taking into account the following observations (see Figure 1):

  1. Documents exchanged within the components of the system may not require validation.
  2. Documents coming from outside the system must be validated when entering.
  3. Documents coming from outside the system, once validated, may be exchanged freely between components without any other validation.

For example, a multitier e-business application exchanging documents with trading partners through a front-end will enforce validity at the web tier (front-end) of any incoming document. It will not only check the validity of the document against its schema, but also ensure that the document type is of one (or the one) it can accept. The documents may then be rerouted to other servers to be handled by the proper services. Since the documents have already been validated they do not require further validation.

In other words, when you own both the producer and the consumer of XML documents you may use validation only for debugging and turn it off when in production.

Figure 1: Validation is required when the source cannot be trusted. Once in the system, validation may be considered optional.

Validation is required when the source cannot be trusted. Once in the system, validation may be considered optional. Still, even without validating, the DTDs and entities referenced in the documents need to be loaded and parsed allowing entities to be substituted, attributes values to be normalized or their default values to be properly substituted.

At the extreme, documents without DTDs don't require (and don't stand) validation. Since they don't refer to any DTD or external entity, none is loaded or parsed and no validation can be done. Performance is therefore better. This extreme solution, while not viable as such for exchanges between XML applications, can be used between the components of an XML application. In this particular case, the document type declaration may be inserted during debugging to enable validation, and omitted when in production.

Still, a document conforming to a DTD can, after an optional validation, be converted to an equivalent document which will not require validation or external entity substitution by using the XML canonicalization process. This process, which is described below, was not originally intended to improve performance, but one may benefit from it under certain situations and with certain limitations.

Any document can be converted in an equivalent (with some limitations) to a DTD-less document through a process named XML Canonicalization. The generated document is called a Canonical XML document.

"Any XML document is part of a set of XML documents that are logically equivalent within an application context, but which vary in physical representation based on syntactic changes permitted by XML 1.0 and Namespaces in XML. This specification describes a method for generating a physical representation, the canonical form, of an XML document that accounts for the permissible changes. Except for limitations regarding a few unusual cases, if two documents have the same canonical form, then the two documents are logically equivalent within the given application context. Note that two documents may have differing canonical forms yet still be equivalent in a given context based on application-specific equivalence rules for which no generalized XML specification could account." - Extract from Canonical XML Version 1.0 - W3C Recommendation 15 March 2001

The XML canonicalization process results in some changes from the original document, among others:

  1. Encoding of the document in UTF-8.
  2. Normalization of line breaks to #xA.
  3. Normalization of the attribute values.
  4. Substitution of character and parsed entities.
  5. Removing of the XML declaration.
  6. Removing of the document type declaration.
  7. Addition of the default attributes.

Although canonicalization is not primarily meant for this purpose, we can use it to improve performance. The front-end of the e-business application from our previous example could be improved to validate the incoming documents and generate a canonical form of the documents that are routed to the proper backend services. The backend services are able to parse the document much faster since no validation is required and the document doesn't refer to any external entity. Generated canonical documents, while having the same logical structure, don't share the same physical structure as the original documents. The application may therefore require that the original version be archived early on in the processing pipeline.

Unfortunately, so far, there is no standard XSL output method which could be used with the identity transformer (presented above) to generate a canonical form of a source document. To generate the canonical form of a document you may have to write a custom SAX2 ContentHandler. The Xerces distribution includes sample programs ( sax.SAX2Writer or sax.Writer for example) which generate canonical XML.

The code sample below shows the canonical form of an XML document. Note the absence of the XML and DTD declarations and the replacement of every line break by a #xA. While being equivalent in logical structure to the original, it does not share the same physical structure and is far less readable. Therefore, if any archiving were to be done it would be done with the original document.

Code Sample 6: The canonical form of one of the Chessboards-[10-5000].xml documents (line breaks have been reintroduced after some of the &#xA; character references for readability purposes)

<CHESSBOARDS>&#xA; <CHESSBOARD>&#xA;  <WHITEPIECES>&#xA;
   <KING><POSITION COLUMN="G" ROW="1"></POSITION></KING>&#xA;
   <BISHOP><POSITION COLUMN="D" ROW="6"></POSITION>
</BISHOP>&#xA;
   <ROOK><POSITION COLUMN="E" ROW="1"></POSITION></ROOK>&#xA;
   <PAWN><POSITION COLUMN="A" ROW="4"></POSITION></PAWN>&#xA;
   <PAWN><POSITION COLUMN="B" ROW="3"></POSITION></PAWN>&#xA;
   <PAWN><POSITION COLUMN="C" ROW="2"></POSITION></PAWN>&#xA;
   <PAWN><POSITION COLUMN="F" ROW="2"></POSITION></PAWN>&#xA;
   <PAWN><POSITION COLUMN="G" ROW="2"></POSITION></PAWN>&#xA;
   <PAWN><POSITION COLUMN="H" ROW="5"></POSITION></PAWN>&#xA;
  </WHITEPIECES>&#xA;  <BLACKPIECES>&#xA;
   <KING><POSITION COLUMN="B" ROW="6"></POSITION></KING>&#xA;
   <QUEEN><POSITION COLUMN="A" ROW="7"></POSITION></QUEEN>&#xA;
   <PAWN><POSITION COLUMN="A" ROW="5"></POSITION></PAWN>&#xA;
   <PAWN><POSITION COLUMN="D" ROW="4"></POSITION></PAWN>&#xA;
  </BLACKPIECES>&#xA;</CHESSBOARD>&#xA;
...
</CHESSBOARDS>

The chart below shows the relative time to process an XML document in its original form and in its canonicalized form. Depending on the complexity of the schema and the number of referred external entities, the difference in performance can be even bigger.

Figure 2: Time to process an XML document (containing 1 chessboard configuration/processed 1000 times) and its canonicalized form with SAX, using Xerces without validation (JDK 1.2.2_06)

Like validation, canonicalization may be switched on in production only, when looking for the best performance. Canonicalization can only be used if the canonical XML documents and the original documents are equivalent. If the application relies, for example, on comment or any other lexical events generated by the parser, canonicalization can not be used. Any variant of this process can be applied as long as both forms of the XML document -- the original one and the refined one -- are equivalent for the application.

Figure 3: Since the document (post-validation) is canonical, it does not include a Document Type Declaration and does not refer to any external entities

Reducing the Cost of Referencing External Entities

External entities, including external DTD subsets, require to be loaded and parsed, even when not validating, in order to deliver the same information to the application regardless of the validation. Standalone documents don't reference any external entity but may still use internal DTD subsets. Therefore, by avoiding loading any external entity, the performance may be increased, especially compared to the cases where the DTD or the other external entities reside on a non-local repository.

Nevertheless, standalone documents may not be the solution of choice especially in the case of e-business document exchanges which rely on public XML schemas being published on a common registry or repository.

Code Sample 7: A standalone XML document, the DTD has been embedded as an internal DTD subset. Performance may be improved especially compared to a situation where the DTD is an external DTD subset located on a remote repository.

<?xml version="1.0" encoding="UTF-8"  
                   standalone="yes"?>

                    <!DOCTYPE CHESSBOARD [ 
 <!ELEMENT CHESSBOARD (WHITEPIECES, BLACKPIECES)>
 <!ENTITY % pieces
  "KING,
   QUEEN?,
   BISHOP?, BISHOP?, 
   ROOK?, ROOK?,
   KNIGHT?, KNIGHT?, 
   PAWN?, PAWN?, PAWN?, PAWN?,
   PAWN?, PAWN?, PAWN?, PAWN?"
 >
 <!ELEMENT WHITEPIECES (%pieces;)>
 <!ELEMENT BLACKPIECES (%pieces;)>
 <!ELEMENT POSITION EMPTY>
 <!ATTLIST POSITION
  COLUMN (A|B|C|D|E|F|G|H) #REQUIRED
  ROW (1|2|3|4|5|6|7|8) #REQUIRED
 >
 <!ELEMENT KING (POSITION)>
 <!ELEMENT QUEEN (POSITION)>
 <!ELEMENT BISHOP (POSITION)>
 <!ELEMENT ROOK (POSITION)>
 <!ELEMENT KNIGHT (POSITION)>
 <!ELEMENT PAWN (POSITION)>
                    ]>

<CHESSBOARD>
 <WHITEPIECES>
  <KING><POSITION COLUMN="G" ROW="1"/></KING>
  <BISHOP><POSITION COLUMN="D" ROW="6"/></BISHOP>
  <ROOK><POSITION COLUMN="E" ROW="1"/></ROOK>
  <PAWN><POSITION COLUMN="A" ROW="4"/></PAWN>
  <PAWN><POSITION COLUMN="B" ROW="3"/></PAWN>
  <PAWN><POSITION COLUMN="C" ROW="2"/></PAWN>
  <PAWN><POSITION COLUMN="F" ROW="2"/></PAWN>
  <PAWN><POSITION COLUMN="G" ROW="2"/></PAWN>
  <PAWN><POSITION COLUMN="H" ROW="5"/></PAWN>
 </WHITEPIECES>
 <BLACKPIECES>
  <KING><POSITION COLUMN="B" ROW="6"/></KING>
  <QUEEN><POSITION COLUMN="A" ROW="7"/></QUEEN>
  <PAWN><POSITION COLUMN="A" ROW="5"/></PAWN>
  <PAWN><POSITION COLUMN="D" ROW="4"/></PAWN>
 </BLACKPIECES>
</CHESSBOARD>
                

Caching External Entities

Caching Using a Proxy Cache

References to external entities located on a remote repository may be improved by setting up a proxy that caches any document retrieved and especially external entities -- provided the references to the external entities are URLs whose protocols are handled by the proxy.

Figure 4: Caching architecture. Entities still have to be resolved.

Caching With a Custom EntityResolver

SAX parsers allow XML applications to handle external entities in a customized way. Such applications have to register their own implementation of the org.xml.sax.EntityResolver interface with the parser using the setEntityResolver method. The applications are then able to intercept any external entities (including the external DTD subsets) before they are parsed.

This feature can be used to implement:

  1. A caching mechanism in the application itself, or
  2. A custom URI lookup mechanism that may redirect system and public references to a local copy of a public repository.

Both mechanisms can be used jointly to ensure even better performance. The first one may be used for static entities which have a lifetime greater than the application's. It's especially the case for public DTDs which usually evolve through successive versions, and which include the version in their public or system identifier. The second mechanism may first map public identifiers into system identifiers and then apply the same techniques as a regular cache proxy when dealing with system identifiers in the form of URL, especially checking for updates and avoiding caching dynamic content.

Code Sample 8: A simple cache implementation for external entities (this implementation is incomplete: it doesn't free unused entries in the entities hash map)

                    import java.lang.ref.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;
import javax.xml.parsers.*;
...

                    public class EntityCache {
 public static final int MAXCHUNCK = 5 * 1024 * 1024;
  
                   private Map entities = new HashMap();
 private byte[] buf = new byte[MAXCHUNCK];

  
                   public InputStream getEntity(String systemId) {
  byte[] entity = null;
   
                   SoftReference reference      = (SoftReference) entities.get(systemId);
  if (reference != null) {
   // Got a soft reference to the entity,
   // let's get the actual entity.
    
                   entity = (byte[]) reference.get();
  }
  if (entity == null) {
   // The entity has been reclaimed by the GC or was
   // never created, let's download it again!
    
                   return cacheEntity(systemId);
  }
   
                   return new ByteArrayInputStream(entity);
 }

 // Attempt to cache an entity - if it's too big 
 // just return an input stream to it
  
                   private InputStream cacheEntity(String systemId) {
  try {
    
                   BufferedInputStream stream      = new BufferedInputStream(new           URL(systemId).openStream());
   int count = 0;
   for (int i = 0; count < buf.length; count += i) {
    if ((i = stream.read(buf, count, buf.length - 
        count)) < 0) {
     break;
    }
   }
   byte[] entity = new byte[count];
   System.arraycopy(buf, 0, entity, 0, count);
   if (count != buf.length) { // Not a too big entity
    // Cache the entity for future use, using a soft
    // reference so that the GC may reclaim it
    // if the memory is running low.
     
                   entities.put(systemId, new SoftReference(entity));     return new ByteArrayInputStream(entity);
   }
   // Entity too big to be cached
   
                   return new SequenceInputStream(      new ByteArrayInputStream(entity), stream);
  } catch (Exception exception) {
   // The default EntityResolver will try to get it...
   
                   return null;
  }
 }
}
                

Code Sample 9: A sample program using the SAX API and implementing the entity resolver to look up entities in a in-memory cache.

import org.xml.sax.*;
import org.xml.sax.helpers.*;
import javax.xml.parsers.*;
...

public class ChessboardSAXPrinter {
 private SAXParser parser;
  
                   private EntityCache entityCache;   public class ChessboardHandler extends HandlerBase {

  public void startElement(String name, 
                           AttributeList attrs) {
   // XML processing
   ...
  }

   
                   public InputSource resolveEntity(String publicId,     String systemId) {
   if (entityCache != null) {
    if (systemId != null) {
      
                   InputStream stream =           entityCache.getEntity(systemId);
     if (stream != null) {
       
                   InputSource source = new InputSource(stream);
      source.setPublicId(publicId);
      source.setSystemId(systemId);
      
                   return source;
     }
    }
   }
   // Let the default entity resolver resolve this one...
    
                   return null;
  }
 }
  
 public ChessboardSAXPrinter(boolean validating, 
    boolean caching) throws Exception {
  this.entityCache = caching ? new EntityCache() : null;
  SAXParserFactory parserFactory 
    = SAXParserFactory.newInstance();
  parserFactory.setValidating(validating);
  parser = parserFactory.newSAXParser();
  return;
 }
}
                

The improvement in performance is quite significant, especially when external entities are located on the network:

Figure 5: Time to process an XML document (containing 1 chessboard configuration and referencing its DTD across a LAN) with SAX, using Xerces with validation (JDK 1.2.2_06)

When combining both suggested architectures for reducing the validation cost and reducing the cost of referencing external entities, the resulting architecture may look as follows:

Figure 6: An architecture to reduce the costs of validation and referencing external entities; caching only occurs on the front-end since the documents processed by the services don't refer to any external entities.

Caching Generated Content and Style Sheets

The eMobile sample end-to-end application demonstrated at the JavaOne Conference 2000 provided an example of how servlets can generate content targeted at different devices from value objects returned by an EJB application. Style sheets were applied to DOM trees built from the value objects in order to transform them to the targeted content type.

There are two places where the performance of the web tier of the eMobile sample application has been improved:

  1. The construction of DOM trees from the value objects returned by the EJB application
  2. The loading of style sheets from files

Figure 7: The two places where performance was improved in the eMobile application

When all the generated content could not fit on the device (WML deck or HTML page), the result was divided among several decks or pages to allow the user to browse the overall result. The decks or pages were generated one at a time upon the user's request from the DOM tree. To avoid invoking the EJB application again, the DOM tree was cached in the user's session. When loaded, the style sheets were also cached to avoid reloading them for every content generation.

Caching the result of a user request to serve subsequent related requests more quickly consumes memory. It must not be done to the detriment of the other users: the application must not fail because of memory shortage due to the cached results. Soft references introduced with Java 2 allow interacting with the garbage collector to implement caches.

In the context of a distributed web container, the reference to the DOM tree stored in the session may have to be declared as transient, first because not all DOM implementations are serializable (which is a requirement for any objects stored in a HttpSession) and second because as we will see next, the serialization of a DOM tree may be very expensive and may therefore counter the benefits of caching.

The following code sample shows how a query and its result are cached in the client's associated session, and how the result of a previously executed query may be retrieved from the session. This code sample from the eMobile application has been updated to take into account distributed web containers.

Code Sample 10: Caching the query and its result in the client's associated session

                    import java.ref.*;

                    private class CacheEntry implements Serializable {
 Object query;
 // If the entry is replicated through serialization
 // result will be reset to null
  
                   transient SoftReference result = null;

 CacheEntry(Object query, Object result) {
  this.query = query;
   
                   this.result = new SoftReference(result);
 }

 Object getResult() {
  if (result != null) {
    
                   return result.get();
  }
  return null;
 }
}

                

/**
 * Stores the query and its result in the client's 
 * associated session so that the response to the 
 * same query will be immediate.
 * A soft reference is used so that if the memory is 
 * running low the GC may reclaim the result objects.
 *
 * @param query the query, may be the HTTP request 
 * query string itself...
 * @param result the result to the query
 * @param session the HTTP session
 */

protected void cacheQueryResult(Object query, Object
    result, HttpSession session) {
  
                   session.setAttribute(QUERY_ATTRIBUTE,     new CacheEntry(query, result));
 return;
}

                

/**
 * Gets the result of a previously executed query  
 * that has been cached in the session. 
 * The previously cached result may be returned only 
 * if the cached query and the requested query match.  
 * Since soft references are used a matching result 
 * may not be returned if the result has already been 
 * reclaimed by the GC due to a shortage of memory.
 *
 * @param query the query, may be the HTTP request query
 *   string itself...
 * @param session the HTTP session
 * @return the associated matching result if it could be
 *   retrieved or null
 */

protected Object getCachedQueryResult(Object query,
    HttpSession session) {
  
                   CacheEntry entry    = (CacheEntry) session.getAttribute(QUERY_ATTRIBUTE);
 if (entry != null) { // A cached entry was retrieved
   
                   Object cachedResult = entry.getResult();
  if (cachedResult != null) {
   // The referred cached result was not reclaimed by the GC
    
                   Object cachedQuery = entry.query;
    
                   if (cachedQuery.equals(query)) {
    // The cached query and the requested query match
    if (TRACE) {
      System.err.println("Query cache hit.");
    }
     
                   return cachedResult;
   } else if (TRACE) {
    System.err.println("Query cache miss (mismatch).");
                  

   }
  } else if (TRACE) {
   System.err.println("Query cache miss (GC).");
  }
 } else if (TRACE) {
  System.err.println("Query cache miss (session).");
 }
 // The queries didn't match or the cached result could not
 // be retrieved, let's just clean
  
                   session.removeAttribute(QUERY_ATTRIBUTE);  return null;
}

                

Caching the style sheets relied on the same principle as caching the result DOM trees. When loaded, the style sheets were cached in a hashtable using soft references which were shared among all the servlets.

Along this line, JAXP 1.1 defines the javax.xml.transform.URIResolver interface which may be implemented to retrieve the resources referred to in the style sheets with xsl:import or xsl:include statements. In the context of an application using a large set of componentized style sheets, this may be used to implement a cache in much the same way as the EntityResolver above.

Using Java 2 SE v 1.3 (and Higher)

XML processing is very CPU- and memory-intensive. For a server-side application, better performance is obtained by using the HotSpot server system which can be activated by passing the -server option when launching the Java virtual machine 3 . Some other options are also available to configure the heap size and the garbage collection. Valuable information can be found in Frequently Asked Questions about the Java HotSpot Virtual Machine.

Using XML With Parsimony

XML documents are text documents. Therefore they can easily be exchanged between heterogeneous systems. But they require a parsing phase that, as we mentioned earlier, is very expensive. It is the price to pay for allowing loosely-coupled systems to work -- loosely-coupled not only technically, but also enterprise-wise. When system components are tightly-coupled, "regular" non document-oriented techniques (using RMI for example) are far more efficient not only in terms of performance but also in terms of coding complexity. With technologies like JAXB the two worlds can be efficiently combined to develop systems that are internally tightly-coupled, object-oriented and which interact together in a loosely-coupled document-oriented way.

Figure 8: A mixed architecture: Loosely-coupled document-oriented on the outside, and tightly-coupled object-oriented in the inside

To illustrate this statement, let's compare the cost of serializing/deserializing to/from:

  1. An XML document, through an intermediate DOM tree representing the "business objects"
  2. A Java serialized form of the "business objects"
  3. A Java serialized form of the DOM tree representing the "business objects"

We designed Java classes that implement all the pieces ( King, Queen, Bishop, Rook, Knight and Pawn classes) as well as classes that implement a chessboard and a set of chessboards ( Chessboard and Chessboards classes). All these classes implement methods to create an equivalent intermediate DOM representation. They also implement the Serializable interface so that their instances can be serialized to or deserialized from a Java serialized form. The Chessboards class additionally implements methods to serialize to XML and deserialize from XML.

The code fragment below shows the Chessboards class methods to serialize to XML, deserialize from XML, create an instance from a DOM representation, and create a DOM representation from an instance. The other classes ( Chessboard, King, Queen...) implement equivalent DOM methods.

Code Sample 11: Implementation of the Chessboards methods to serialize/deserialize to/from XML and DOM; the serialization/deserialization to/from a Java serialized object is simply enabled by implementing the Serializable interface

public void toXML(OutputStream out, DocumentBuilder builder,
    Transformer transformer) throws Exception {
 transformer.transform(new DOMSource( toDOM(builder)),
    new StreamResult(out));
 return;
}

public Document toDOM(DocumentBuilder builder)
  throws Exception {
 Document document = builder.newDocument();
 Element root
   = (Element) document.createElement("CHESSBOARDS");
 for (Iterator j = chessboards.iterator(); j.hasNext();) {
  Chessboard chessboard = (Chessboard) j.next();
  root.appendChild(chessboard. toDOM(document));
 }
 document.appendChild(root);
 return document;
}

public static Chessboards fromXML(String sourceURI,
    DocumentBuilder builder)
   throws Exception {
 Document document = builder.parse(sourceURI);
 return fromDOM(document);
}

public static Chessboards fromDOM(Document document)
   throws Exception {
 Element root = document.getDocumentElement();
 if (root.getTagName().equals("CHESSBOARDS")) {
  Node child = root.getFirstChild();
  Chessboards chessboards = new Chessboards();
  do {
   if (child.getNodeType() == Node.ELEMENT_NODE) {
    Element element = (Element) child;
    if (element.getTagName().equals("CHESSBOARD")) {
     chessboards.chessboards.add(Chessboard. fromDOM(element));
    }
   }
  } while ((child = child.getNextSibling()) != null);
  return chessboards;
 }
 throw new IllegalArgumentException(root.getTagName());
}


To measure the performances we implemented three structurally equivalent test programs. They first loaded the original XML document describing a set of chessboard configurations and, in a loop, wrote it into a file and read it back either as an XML document or as Java serialized objects. We ran these test programs on a set of 1000 chessboard configurations, which was processed 10 times for each of the 10 runs. The measured time was the sum of the user and system times, as returned by the ptime command, divided by the total number of serializations/deserializations.

Code Sample 12: The test program to serialize/deserialize to/from XML, through an intermediate DOM tree

DocumentBuilderFactory builderFactory
  = DocumentBuilderFactory.newInstance();
DocumentBuilder builder
  = builderFactory.newDocumentBuilder();
TransformerFactory transformerFactory
  = TransformerFactory.newInstance();
Transformer transformer;
transformer = transformerFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.DOCTYPE_SYSTEM,
    "file:dtd/Chessboards.dtd");
transformer.setOutputProperty(OutputKeys.METHOD,
    "xml");
// Reading the original XML document
Chessboards chessboards
  = Chessboards.fromXML(args[1], builder);
for (int k = 0; k < r; k++) {
 for (int i = 0; i < n; i++) {
  PrintStream out = new PrintStream(
    new BufferedOutputStream(new
        FileOutputStream(args[2])));
   // Serializing
  chessboards.toXML(out, builder, transformer);

  out.close();
   // Deserializing
  chessboards = Chessboards.fromXML(args[2], builder);

 }
}


Code Sample 13: The test program to serialize/deserialize to/from a serialized Java object

// Reading the original XML document
Chessboards chessboards = Chessboards.fromXML(args[1]);
for (int k = 0; k < r; k++) {
 for (int i = 0; i < n; i++) {
  ObjectOutputStream out = new ObjectOutputStream(
    new BufferedOutputStream(new
        FileOutputStream(args[2])));
   // Serializing
  out.writeObject(chessboards);

  out.close();
  ObjectInputStream in = new ObjectInputStream(
    new BufferedInputStream(new
        FileInputStream(args[2])));
   // Deserializing
  chessboards = (Chessboards) in.readObject();

  in.close();
 }
}


Code Sample 14: The test program to serialize/deserialize to/from a Java serialized DOM tree

DocumentBuilderFactory builderFactory
  = DocumentBuilderFactory.newInstance();
DocumentBuilder builder
  = builderFactory.newDocumentBuilder();
// Reading the original XML document
Chessboards chessboards
 = Chessboards.fromXML(args[1], builder);
for (int k = 0; k < r; k++) {
 for (int i = 0; i < n; i++) {
  ObjectOutputStream out = new ObjectOutputStream(
    new BufferedOutputStream(new
        FileOutputStream(args[2])));
   // Serializing
  out.writeObject(chessboards.toDOM(builder));

  out.close();
  ObjectInputStream in = new ObjectInputStream(
    new BufferedInputStream(new
        FileInputStream(args[2])));
   // Deserializing
  chessboards
    = Chessboards.fromDOM((Document)
in.readObject());
  in.close();
 }
}


The results of this show that not only is the direct Java serialization of the "business objects" faster than the XML serialization or the Java serialization of the DOM tree, but also that the resulting serialized object form is smaller than the serialized XML document or the Java serialized DOM tree form. The Java serialization of the DOM tree is the most expensive in processing time as well as in memory footprint; therefore it should be used with extreme care, especially in the context of Enterprise JavaBeans (EJB) where serialization occurs when accessing remote EJBs. When accessing local EJBs, DOM tree or DOM tree fragments can be passed along without incurring the same issue.

Figure 9: Average time to serialize/deserialize a set of 1000 chessboard configurations in its XML document form (through an intermediate DOM tree) , in its "Business Object" Java serialized form and in its DOM tree Java serialized form; Crimson's DOM implementation does not support Java serialization

Figure 10: Size of the serialized XML document, the Java serialized "business objects" form and the Java serialized DOM tree form

Applications which are internally document-oriented may be designed so that only the most relevant and most accessed information is extracted from the document to be processed and mapped to business objects. These business objects may keep a reference to the original document (in its original text form or in a cached DOM representation) so that more information can be queried when needed from the original document using XPath expressions or XQuery, for example.

Conclusion

In this article, we presented different performance improvement tips. The first question to ask when developing an XML-based application is "Should it be XML based?" If the answer is yes, then a sound and balanced architecture has to be designed, an architecture which only relies on XML for what it is good at: open inter-application communications, configuration descriptions, information sharing, or any domain for which a public XML schema may exist. It may not be the solution of choice for unexposed interfaces or for exchanges between components which should be otherwise tightly coupled. Should XML processing be just a pre or post-processing stage of the business logic or should it make sense for the application to have its core data structure represented as documents, the developer will have to choose between the different APIs and implementations considering not only their functionalities and their ease of use, but also their performance. Ultimately, Java XML-based applications are developed in Java, therefore any Java performance improvement rule will apply as well, especially, those regarding string processing and object creation.

Resources

Java Technology & XML Part 1 -- An Introduction to APIs for XML Processing
Java Technology & XML Part 2 -- API Benchmarks
Java Technology & XML
Java APIs for XML Processing (JAXP)
The Simple API for XML (SAX)
Document Object Model (DOM)
Extensible Stylesheet Language (XSL)
XML Path Language (Xpath)
Xerces Java Parser Readme
Crimson 1.1 Release
Xalan - Java Version 2.3.1
JDOM
dom4j: the flexible XML framework for Java
Canonical XML Version 1.0
eMobile End-to-End Application using the Java 2, Enterprise Edition - Part II

1 Depending on the implementation, a DOM tree may or may not be "Java" serializable: it's not a requirement from the specification.

2 A system here is understood as any set of hardware and software that composes your solution and which defines a boundary within which any exchange between components is considered secure and reliable.

3 As used on this web site, the terms Java virtual machine or Java VM mean a virtual machine for the Java platform.

Have a question about programming? Use Java Online Support.