Technical Article

Java Technology and XML-Part 3: Performance Improvement Tips

By Thierry Violleau
March 2002

Neither Java nor XML Technology need an introduction, nor the synergy between the two: "Portable Code and Portable Data." With the growing interest in web services and e-business platforms, XML is joining Java in the developer's toolbox. As of today, no less than six extensions to the Java Platform empower the developer when building XML-based applications:

Java API for XML Processing (JAXP)
Java API for XML/Java Binding (JAXB)
Long Term JavaBeans Persistence
Java API for XML Messaging (JAXM)
Java API for XML RPC (JAX RPC)
Java API for XML Registry (JAXR)

The first of the three articles in this series gave an overview of the different APIs available to the developer by presenting some sample programs. The differences in performance were addressed in the second article. This third article gives tips on improving the performance of XML-based applications from a programmatic and architectural point of view.

XML processing is very CPU, memory, and I/O or network intensive. XML documents are text documents that need to be parsed before any meaningful application processing can be performed. The parsing of an XML document may result either in a stream of events if the SAX API is used, or in an in-memory document model if the DOM API is used. During parsing, a validating parser may additionally perform some validity checking of the document against a predefined schema (a Document Type Definition or an XML Schema).

Processing an XML document means recognizing, extracting and directly processing the element contents and attribute values or mapping them to other business objects that are processed further on. Before an application can apply any business logic, the following steps must take place:

Parsing
Optionally, validating (which implies first parsing the schema)
Recognizing
Extracting
Optionally, mapping

Parsing XML documents implies a lot of character encoding and decoding and string processing. Then, depending on the chosen API, recognition and extraction of content may correspond to walking through a tree data structure, or catching the events generated by the parser and processing them according to some context. If an application uses XSLT to preprocess an XML document, even more processing is added before the real business logic work can take place.

Using the DOM API implies the creation in memory of a representation of the document as a DOM tree. If the document is large, so is the DOM tree and the memory consumption.

The physical structure and the logical structure of an XML document may be different. An XML document may contain references to external entities which are substituted in the document content while parsing and prior to validating. Those external entities and the schema itself (such as DTD) may be located on remote systems, especially if the document itself is originating from another system. In order to proceed with the parsing and the validation, the external entities must first be loaded (downloaded). Documents with a complex physical structure may therefore be very I/O or network intensive.

In this article, we will give some tips for improving performance when processing XML documents, articulated around improving the CPU, memory, and I/O or network consumption.

Using the Most Appropriate API: Choosing Between SAX and DOM

Both DOM and SAX have features that make them more suitable for certain tasks than others:

Table 1: SAX and DOM features

SAX	DOM
Event based model	Tree data structure
Serial access (flow of events)	Random access (in-memory data structure)
Low memory usage (only events are generated)	High memory usage (the document is loaded into memory)
To process parts of the document (catching relevant events)	To edit the document (processing the in-memory data structure)
To process the document only once (transient flow of events)	To process multiple times (document loaded in memory)

Omitting the impact of memory consumption on overall system performance, processing using the DOM API is usually slower than processing using the SAX API, mainly because the DOM API may have to load the whole document in-memory first in order to allow it to be edited or data to be easily retrieved, while the SAX API allows immediate processing as the document is being parsed. Therefore, DOM should be used when the source document is to be edited or processed multiple times.

SAX is very convenient when you want to extract information from an XML document (an element content or an attribute value) regardless of its overall context -- its position in the XML document tree, or when the document structure maps exactly to the business object structure. Otherwise, keeping track of the element nesting may be very tedious and one may better end up using DOM. Nevertheless, when the source document is to be mapped to a business object which is not primarily represented as a DOM tree, it's recommended to use SAX to map directly to the business object, avoiding an intermediate resource-consuming representation. Of course, if the business object has a direct representation in Java, technologies like XML Data Binding (JAXB) can be used.

Since high level technologies like XSLT rely on lower level technologies like SAX and DOM, the performance when using those technologies may be impacted by their use of SAX or DOM. JAXP provides support for XSLT engine implementations that accept source input and result output in the form of SAX events. When building complex XML processing pipelines, one can use JAXP SAXTransformerFactory to process the result of another style sheet transformation with a style sheet. Working with SAX events until the last stage in the pipeline will optimize performance by avoiding the creation of in-memory data structures like DOM trees.

Considering Alternative APIs

JDOM is not a wrapper around DOM, although it shares the same purpose as DOM with regard to XML. It has been made generic enough to address any document model. JDOM has been optimized for Java and moreover, by the use of the Java Collection API, it has been made straightforward for the Java developer. JDOM documents can be built directly from, and converted to, SAX events and DOM trees, allowing JDOM to be seamlessly integrated in XML processing pipelines and in particular as the source or result of XSLT transformations.

dom4j is another alternative API very similar to JDOM. It additionally comes with a tight integration to Xpath: the org.dom4j.Node interface for example defines methods to select nodes according to an Xpath expression. dom4j also implements an event-based processing model which allows it to efficiently process large XML documents. Handlers can be registered to be called back during parsing when Xpath expressions are matched, allowing you to immediately process and dispose of parts of the document without waiting for all the document to be parsed and loaded into memory.

If a document model fits the core data structure of an application, JDOM and dom4j should be seriously considered. Additionally, as opposed to DOM ¹ , JDOM or dom4j documents are serializable, which gives even more options when architecting complex inter-communicating applications.

Using alternative APIs like JDOM and dom4j, a developer may avoid some performance pitfalls like the one described in the second article, when accessing elements by their tag names, since the API through the support of the Java Collection API is more straightforward. Since it is lightweight and optimized for Java, you may often expect a sensitive gain in performance.

Be Aware of the Differences in the Implementations

As we highlighted in the second part of this series, implementations differ. Some emphasize functionality, others performance. The plugability feature of JAXP allows the developer to swap between implementations and select the most appropriate one to achieve the application requirements.

As an example, when using DOM, a common complaint is the lack of support in the API itself for serialization (that is, transformation of a DOM tree to a XML document). Therefore, it's tempting to step out of the standard API and call implementation-dependent serialization features at the cost of losing JAXP's plugability benefits. Below are code samples for serializing a DOM tree to an XML stream with both Xerces and Crimson.

Code Sample 1: Serialization with Xerces relies on a separate API which is packaged along with the DOM implementation

Technical Article

Java Technology and XML-Part 3: Performance Improvement Tips

Using the Most Appropriate API: Choosing Between SAX and DOM

Be Aware of the Differences in the Implementations

Tuning the Underlying Implementations

Reusing and Pooling Parsers

Partial Parsing with SAX

Reducing Validation Cost

Reducing the Cost of Referencing External Entities

Caching Generated Content and Style Sheets

Using Java 2 SE v 1.3 (and Higher)

Using XML With Parsimony

Conclusion

Resources