Java API for XML Processing (JAXP) Tutorial

Chapter 1

Introduction to JAXP

The Java API for XML Processing (JAXP) is for processing XML data using applications written in the Java programming language. JAXP leverages the parser standards Simple API for XML Parsing (SAX) and Document Object Model (DOM) so that you can choose to parse your data as a stream of events or to build an object representation of it. JAXP also supports the Extensible Stylesheet Language Transformations (XSLT) standard, giving you control over the presentation of the data and enabling you to convert the data to other XML documents or to other formats, such as HTML. JAXP also provides namespace support, allowing you to work with DTDs that might otherwise have naming conflicts. Finally, as of version 1.4, JAXP implements the Streaming API for XML (StAX) standard.

Designed to be flexible, JAXP allows you to use any XML-compliant parser from within your application. It does this with what is called a pluggability layer, which lets you plug in an implementation of the SAX or DOM API. The pluggability layer also allows you to plug in an XSL processor, letting you control how your XML data is displayed.

The JAXP APIs

The main JAXP APIs are defined in the javax.xml.parsers package. That package contains vendor-neutral factory classes, SAXParserFactory, DocumentBuilderFactory, and TransformerFactory, which give you a SAXParser, a DocumentBuilder, and an XSLT transformer, respectively. DocumentBuilder, in turn, creates a DOM-compliant Document object.

The factory APIs let you plug in an XML implementation offered by another vendor without changing your source code. The implementation you get depends on the setting of the javax.xml.parsers.SAXParserFactory, javax.xml.parsers.DocumentBuilderFactory, and javax.xml.transform.TransformerFactory system properties, using System.setProperties() in the code, <sysproperty key="..." value="..."/> in an Ant build script, or -DpropertyName="..." on the command line. The default values (unless overridden at runtime on the command line or in the code) point to Sun's implementation.

Overview of the Packages

The SAX and DOM APIs are defined by the XML-DEV group and by the W3C, respectively. The libraries that define those APIs are as follows:

  • javax.xml.parsers: The JAXP APIs, which provide a common interface for different vendors' SAX and DOM parsers.
  • org.w3c.dom: Defines the Document class (a DOM) as well as classes for all the components of a DOM.
  • org.xml.sax: Defines the basic SAX APIs.
  • javax.xml.transform: Defines the XSLT APIs that let you transform XML into other forms.
  • javax.xml.stream: Provides StAX-specific transformation APIs.

The Simple API for XML (SAX) is the event-driven, serial-access mechanism that does element-by-element processing. The API for this level reads and writes XML to a data repository or the web. For server-side and high-performance applications, you will want to fully understand this level. But for many applications, a minimal understanding will suffice.

The DOM API is generally an easier API to use. It provides a familiar tree structure of objects. You can use the DOM API to manipulate the hierarchy of application objects it encapsulates. The DOM API is ideal for interactive applications because the entire object model is present in memory, where it can be accessed and manipulated by the user.

On the other hand, constructing the DOM requires reading the entire XML structure and holding the object tree in memory, so it is much more CPU- and memory-intensive. For that reason, the SAX API tends to be preferred for server-side applications and data filters that do not require an in-memory representation of the data.

The XSLT APIs defined in javax.xml.transform let you write XML data to a file or convert it into other forms. As shown in the XSLT section of this tutorial, you can even use it in conjunction with the SAX APIs to convert legacy data to XML.

Finally, the StAX APIs defined in javax.xml.stream provide a streaming Java technology-based, event-driven, pull-parsing API for reading and writing XML documents. StAX offers a simpler programming model than SAX and more efficient memory management than DOM.

Simple API for XML APIs

The basic outline of the SAX parsing APIs is shown in Figure 1-1. To start the process, an instance of the SAXParserFactory class is used to generate an instance of the parser.

Figure 1-1 SAX APIs

The SAX APIs

The parser wraps a SAXReader object. When the parser's parse() method is invoked, the reader invokes one of several callback methods implemented in the application. Those methods are defined by the interfaces ContentHandler, ErrorHandler, DTDHandler, and EntityResolver.

Here is a summary of the key SAX APIs:

SAXParserFactory

A SAXParserFactory object creates an instance of the parser determined by the system property, javax.xml.parsers.SAXParserFactory.

SAXParser

The SAXParser interface defines several kinds of parse() methods. In general, you pass an XML data source and a DefaultHandler object to the parser, which processes the XML and invokes the appropriate methods in the handler object.

SAXReader

The SAXParser wraps a SAXReader. Typically, you do not care about that, but every once in a while you need to get hold of it using SAXParser's getXMLReader() so that you can configure it. It is the SAXReader that carries on the conversation with the SAX event handlers you define.

DefaultHandler

Not shown in the diagram, a DefaultHandler implements the ContentHandler, ErrorHandler, DTDHandler, and EntityResolver interfaces (with null methods), so you can override only the ones you are interested in.

ContentHandler

Methods such as startDocument, endDocument, startElement, and endElement are invoked when an XML tag is recognized. This interface also defines the methods characters() and processingInstruction(), which are invoked when the parser encounters the text in an XML element or an inline processing instruction, respectively.

ErrorHandler

Methods error(), fatalError(), and warning() are invoked in response to various parsing errors. The default error handler throws an exception for fatal errors and ignores other errors (including validation errors). This is one reason you need to know something about the SAX parser, even if you are using the DOM. Sometimes, the application may be able to recover from a validation error. Other times, it may need to generate an exception. To ensure the correct handling, you will need to supply your own error handler to the parser.

DTDHandler

Defines methods you will generally never be called upon to use. Used when processing a DTD to recognize and act on declarations for an unparsed entity.

EntityResolver

The resolveEntity method is invoked when the parser must identify data identified by a URI. In most cases, a URI is simply a URL, which specifies the location of a document, but in some cases the document may be identified by a URN - a public identifier, or name, that is unique in the web space. The public identifier may be specified in addition to the URL. The EntityResolver can then use the public identifier instead of the URL to find the document-for example, to access a local copy of the document if one exists.

A typical application implements most of the ContentHandler methods, at a minimum. Because the default implementations of the interfaces ignore all inputs except for fatal errors, a robust implementation may also want to implement the ErrorHandler methods.

SAX Packages

The SAX parser is defined in the packages listed in Table 1-1.

Table 1-1 SAX Packages

Packages Description
org.xml.sax Defines the SAX interfaces. The name org.xml is the package prefix that was settled on by the group that defined the SAX API.
org.xml.sax.ext Defines SAX extensions that are used for doing more sophisticated SAX processing-for example, to process a document type definition (DTD) or to see the detailed syntax for a file.
org.xml.sax.helpers Contains helper classes that make it easier to use SAX-for example, by defining a default handler that has null methods for all the interfaces, so that you only need to override the ones you actually want to implement.
javax.xml.parsers Defines the SAXParserFactory class, which returns the SAXParser. Also defines exception classes for reporting errors.

Document Object Model APIs

Figure 1-2 shows the DOM APIs in action.

Figure 1-2 DOM APIs

DOM APIs

You use the javax.xml.parsers.DocumentBuilderFactory class to get a DocumentBuilder instance, and you use that instance to produce a Document object that conforms to the DOM specification. The builder you get, in fact, is determined by the system property javax.xml.parsers.DocumentBuilderFactory, which selects the factory implementation that is used to produce the builder. (The platform's default value can be overridden from the command line.)

You can also use the DocumentBuilder newDocument() method to create an empty Document that implements the org.w3c.dom.Document interface. Alternatively, you can use one of the builder's parse methods to create a Document from existing XML data. The result is a DOM tree like that shown in Figure 1-2.

Note - Although they are called objects, the entries in the DOM tree are actually fairly low-level data structures. For example, consider this structure: <color>blue</color>. There is an element node for the color tag, and under that there is a text node that contains the data, blue! This issue will be explored at length in the DOM chapter of this tutorial, but developers who are expecting objects are usually surprised to find that invoking getNodeValue() on the element node returns nothing. For a truly object-oriented tree, see the JDOM API at http://www.jdom.org.

DOM Packages

The Document Object Model implementation is defined in the packages listed in Table 1-2.

Table 1-2 DOM Packages

Package Description
org.w3c.dom Defines the DOM programming interfaces for XML (and, optionally, HTML) documents, as specified by the W3C.
javax.xml.parsers Defines the DocumentBuilderFactory class and the DocumentBuilder class, which returns an object that implements the W3C Document interface. The factory that is used to create the builder is determined by the javax.xml.parsers system property, which can be set from the command line or overridden when invoking the new Instance method. This package also defines the ParserConfigurationException class for reporting errors.

Extensible Stylesheet Language Transformations APIs

Figure 1-3 shows the XSLT APIs in action.

Figure 1-3 XSLT APIs

XSLT APIs

A TransformerFactory object is instantiated and used to create a Transformer. The source object is the input to the transformation process. A source object can be created from a SAX reader, from a DOM, or from an input stream.

Similarly, the result object is the result of the transformation process. That object can be a SAX event handler, a DOM, or an output stream.

When the transformer is created, it can be created from a set of transformation instructions, in which case the specified transformations are carried out. If it is created without any specific instructions, then the transformer object simply copies the source to the result.

XSLT Packages

The XSLT APIs are defined in the packages shown in Table 1-3.

Table 1-3 XSLT Packages

Package Description
javax.xml.transform Defines the TransformerFactory and Transformer classes, which you use to get an object capable of doing transformations. After creating a transformer object, you invoke its transform() method, providing it with an input (source) and output (result).
javax.xml.transform.dom Classes to create input (source) and output (result) objects from a DOM.
javax.xml.transform.sax Classes to create input (source) objects from a SAX parser and output (result) objects from a SAX event handler.
javax.xml.transform.stream Classes to create input (source) objects and output (result) objects from an I/O stream.

Streaming API for XML APIs

StAX is the latest API in the JAXP family, and provides an alternative to SAX, DOM, TrAX, and DOM for developers looking to do high-performance stream filtering, processing, and modification, particularly with low memory and limited extensibility requirements.

To summarize, StAX provides a standard, bidirectional pull parser interface for streaming XML processing, offering a simpler programming model than SAX and more efficient memory management than DOM. StAX enables developers to parse and modify XML streams as events, and to extend XML information models to allow application-specific additions. More detailed comparisons of StAX with several alternative APIs are provided in Chapter 5, Streaming API for XML, in Comparing StAX to Other JAXP APIs.

StAX Packages

The StAX APIs are defined in the packages shown in Table 1-4.

Table 1-4 StAX Packages

Package Description
javax.xml.stream Defines the XMLStreamReader interface, which is used to iterate over the elements of an XML document. The XMLStreamWriter interface specifies how the XML should be written.
javax.xml.transform.stax Provides StAX-specific transformation APIs.

Finding the JAXP Sample Programs

A set of JAXP sample programs is provided in the JAXP download bundle. After you install JAXP, the sample programs are found in the directory INSTALL_DIR /jaxp- version /samples.

The sample programs are intended to be run on the Java Platform, Standard Edition (Java SE) version 6.

Where Do You Go from Here?

At this point, you have enough information to begin picking your own way through the JAXP libraries. Your next step depends on what you want to accomplish. You might want to go to any of these chapters:

  • If the data structures have already been determined, and you are writing a server application or an XML filter that needs to do fast processing, see Chapter 2, Simple API for XML.
  • If you need to build an object tree from XML data so you can manipulate it in an application, or convert an in-memory tree of objects to XML, see Chapter 3, Document Object Model.
  • If you need to transform XML tags into some other form, if you want to generate XML output, or (in combination with the SAX API) if you want to convert legacy data structures to XML, see Chapter 4, Extensible Stylesheet Language Transformations.
  • If you want a streaming Java technology-based, event-driven, pull-parsing API for reading and writing XML documents or want to create bidrectional XML parsers that are fast, relatively easy to program, and have a light memory footprint, then see Chapter 5, Streaming API for XML.