Easy and Efficient XML Processing: Upgrade to JAXP 1.3

By Neeraj Bajaj, October 11, 2005    

Articles Index

This article explains some of the new concepts and important features introduced in the Java API for XML Processing (JAXP) 1.3. JSR 206 was developed with performance and ease of use in mind. The new Validation Framework gives much more power to any application dealing with XML schema and improves performance significantly. XPath APIs provide access to the XPath evaluation environment. JAXP 1.3 brings richer XML Schema data type support to the Java platform by defining new data types that map to data types defined in W3C XML Schema: Datatypes specification.

Keeping pace with the evolution of XML standards, JAXP 1.3 also adds complete support for the following standards: XML 1.1, Document Object Model (DOM) L3, XInclude, and Simple API for XML (SAX) 2.0.2. All this has already gone into the Java platform in the latest release of the Java Platform, Standard Edition (J2SE) 5.0, code-named Tiger. If you are using J2SE 1.3 or 1.4, you can download a stand-alone stable implementation of JAXP 1.3 from java.net.

This article mainly concentrates on the work done as part of the JSR 206 effort and explains new Schema Validation Framework concepts, along with providing working code and diagrams. All the samples are available for download from here. The major new features introduced are the following:

Schema Validation Framework

JAXP 1.3 introduces a new schema-independent Validation Framework (called the Validation APIs). This new framework gives much more power to the application dealing with XML schema and can accomplish things that were not possible before. The new approach makes a fundamental shift in the way XML processing and validation are performed. Validation used to be considered an integral part of XML parsing, and previous versions of JAXP supported validation as a feature of an XML parser: a SAXParser or DocumentBuilder instance.

The new Validation APIs decouple the validation of an instance document as a process independent of parsing. This new approach has several advantages. Applications that rely heavily on XML schema can greatly improve the performance of schema validation. Perhaps more importantly, many previously unsolvable problems can now be solved in an efficient, easy, and secure way. Let's look at what you can do with the new Schema Validation Framework.

Validate XML Against Any Schema

Though JAXP 1.3 requires support only for W3C XML schema language, you can easily plug in support for other schema languages, such as RELAX NG. The Validation APIs provide a pluggability layer through which applications can provide specialized validation libraries supporting additional schema languages. This is achieved using a SchemaFactory class that is capable of locating implementations for the schema languages at runtime. The first step is to specify the schema language to be used and obtain the concrete factory implementation:

SchemaFactory sf = SchemaFactory.newInstance(<SCHEMA LANGUAGE>);
<SCHEMA LANGUAGE> could be W3C XML Schema, Relax NG etc.

If this function returns successfully, it means that an implementation capable of supporting specified schema language is available. Getting the SchemaFactory implementation is the entry point to the Validation APIs. This step goes through the pluggability mechanism that has long been at the core of JAXP. You can write the code in such a way that applications can switch between W3C XML Schema and RELAX NG validation without changing a single line of code.

Compile Schema

With the new Validation APIs, an application has the option to parse only the schema, checking schema syntax and semantics against the constraints that the particular schema language imposes. This is quite useful when you are writing a schema and want to make sure that the schema conforms to the specification. The SchemaFactory class does this job, loading the schemas and also preparing them in a special form represented as a javax.xml.validation.Schema object that can be used for validating instance documents against the schema. A schema may include or import other schemas. In that case, those schemas are also loaded.

When reading a schema, a SchemaFactory may need to resolve resources and can encounter errors. As Figure 1 indicates, LSResourceResolver and an ErrorHandler can be registered on SchemaFactory. The ErrorHandler is used to report any errors encountered during schema compilation. The LSResourceResolver is used to customize resolution of resources. This is a new interface introduced as part of DOM L3. Functionally, it is the same as SAX EntityResolver, except that it also provides the information about the namespace of the resource being resolved -- for example, the targetNamespace of the W3C XML schema.

Figure 1. Getting the Schema Object

Here is a code sample that shows how SchemaFactory can be used to compile schema and get a Schema object:

String language = XMLConstants.W3C_XML_SCHEMA_NS_URI;
SchemaFactory factory = SchemaFactory.newInstance(language); 
factory.setErrorHandler(new MyErrorHandler()); 
factory.setResourceResolver( new MyLSResourceResolver()); 
StreamSource ss = new StreamSource(new File("mySchema.xsd")));
Schema schema = factory.newSchema(ss);

A Schema object is an immutable memory representation of schema. A Schema instance can be shared with many different parser instances, even if they are running in different threads. You can write applications so that the same set of schema are parsed only once and the same Schema instance is passed to different instances of the parser.

Validate XML Using Compiled Schema

Before we look at this approach, let's look at how we have been doing schema validation using the schema properties that were defined in JAXP 1.2:


Here is an example showing how these two properties are used in JAXP 1.3:

SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setNamespace(true); 
spf.setValidating(true); 
SAXParser sp = spf.newSAXParser();
sp.setProperty("#",
"http://www.w3.org/2001/XMLSchema"); 
sp.setProperty("#",
"mySchema.xsd") ; 
sp.parse(<XML Document>, <ContentHandler);

The user sets the schemaLanguage and/or the schemaSource property on SAXParser and sets the validation to true. Generally, a business application defines a set of schemas containing the business rules against which XML documents must be validated. To accomplish this, an application sets the schema using the schemaSource property or relies on the xsi: schemaLocation attribute in the instance document to specify the schema location(s).

This approach works well, but there is a tremendous performance penalty: The specified schemas are loaded again and again for every XML document that needs to be validated! However, with the new Validation APIs, an application needs to parse a set of schemas only once. See Figure 2.

Figure 2. Set Compiled Schema on DocumentBuilder/SAXParserFactory

After the Compile Schema step, do the following.

SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setSchema(schema);
SAXParser saxParser = spf.newSAXParser();
saxParser.parse(new File("instance.xml"), myHandler);

Just set the Schema instance on the factory and you are done. There is no need to set the validation to true and no need to set the schemaLanguage or schemaSource property. Validation of XML documents is done against the compiled schema set on the factory. You will be amazed by the performance gain using this approach. Try it yourself.

Run the sample ComparePerformance.java, which can be downloaded from here. Performance gain largely depends on the ratio of the size of the XML schema to the size of the XML document. Larger ratios lead to a larger performance gain. Look at the Reusing a Parser Instance section to further improve the performance.

Note that it is an error to use either of the following properties:


in conjunction with a non-null Schema object. Such configuration will cause a SAXException when those properties are set on SAXParser or DocumentBuilderFactory.

Validate a SAXSource or DOMSource

As we mentioned earlier, there has been fundamental shift in XML parsing and validation. Now XML validation is considered a process independent from XML parsing. Once you have the Schema instance loaded into memory, you can do many things. You can create a ValidatorHandler that can validate a SAX stream or create a stand-alone Validator (see Figure 3). A stand-alone Validator can validate a SAXSource, a DOMSource, or an XML document against any schema. In fact, a Validator can still work if the SAX stream or DOM object comes from a different implementation.

Figure 3. Validate a SAXSource or DOMSource Using a Validator

To receive any errors during the validation, an ErrorHandler should be registered with the Validator. Let's look at some working code. (Note: For clarity, only a section of code is shown here. For the complete source, look at the sample Validate.java, which can be downloaded here.)

Validator validator = schema.newValidator();
validator.setErrorHandler( new ErrorHandlerImpl());
validator.validate(new StreamSource(<XML Document>));

Validator can also be used to validate the instance document or DOM object in memory, with the augmented result sent to DOMResult.

Document document = //DOM object
validator.validate(new DOMSource(document), new DOMResult());

The Validation APIs can validate a SAX stream and work in conjunction with Transformation APIs to achieve pipeline processing, as we will see in the next section.

Validate XML After Transformation

Transformation APIs are used to transform one XML document into another by applying a style sheet. There are times when we need to validate the transformed XML document against a schema. Should we feed that XML document to a parser and then use the schema feature to do the schema validation? No. The new Validation APIs give you the power to validate the transformed XML document against a different schema by allowing the application to create a pipeline and pass the output of a transformer to the Validation APIs to validate against the desired schema. It doesn't matter if the output of the transformation is a SAX stream or a DOM in memory.

Validate a SAX Stream

The following code snippet shows you how to use specially designed javax.xml.validation.ValidatorHandler to validate a SAX stream. In the downloadable source, look at the sample ValidateSAXStream.java for more detail. Also look at the sample TransformerValidationHandler.java, which shows how to chain the output of Transformer to ValidatorHandler. Here is a section of the code:

String language =  XMLConstants.W3C_XML_SCHEMA_NS_URI ;
SchemaFactory sf = SchemaFactory.newInstance(language);
Schema schema = sf.newSchema(new File(<SCHEMA>)); 
ValidatorHandler vh = schema.newValidatorHandler();
vh.setErrorHandler(new ErrorHandlerImpl());
vh.setContentHandler(new ApplicationContentHandler()); 
TransformerFactory tf = TransformerFactory.newInstance();
StreamSource ss = new new StreamSource(<STYLESHEET>);
Transformer t = tf.newTransformer(ss);
StreamSource xml = new StreamSource(<XML DOCUMENT>);
t.transform(new StreamSource(xml, new SAXResult(vh));

Figure 4 shows the whole flow, with an XML document and a style sheet given as input to a Transformer and a SAX stream as the output. We take advantage of the modular approach of doing validation independent from parsing. The ValidatorHandler is a special handler that is capable of working directly with a SAX stream. It validates the stream and passes it to the application.

Figure 4. Validating a SAX Stream

Validate DOM in memory

The Transformation APIs also allow a transformed result to be obtained as a DOM object. The DOM object in memory can be validated against a schema. This can be done as follows:

DOMResult dr = new DOMResult();
t.transform(xml , dr);
DOMSource ds = new DOMSource();
schema.newValidator().validate(ds(dr.getNode()));

So you see that the Validation APIs can be used with the Transformation APIs to do complex things easily. This approach also boosts performance because it avoids the step of parsing the XML again when validating a transformed XML document.

Validate a JDOM Document

The ValidatorHandler can be used to validate various object models such as JDOM against the schema(s). In fact, any object model ( XOM, DOM4J, and so on) that can be built on top of a SAX stream or can emit SAX events can be used with the Schema Validation Framework to validate an XML document against a schema. This is possible because ValidationHandler can validate a SAX stream.

Let's see how a JDOM document can be validated against schema(s):

SAXOutputter so = new SAXOutputter(vh);
so.output(jdomDocument);

It is that simple. JDOM has a way to output a JDOM document as a stream of SAX events. SAXOutputter fires SAX events that are validated by ValidatorHandler. Any error encountered is reported through ErrorHandler set on ValidatorHandler.

Obtain Schema Type Information

ValidatorHandler can give access to TypeInfoProvider, which can be queried to access the type information determined by the validator. This object is dynamic in nature and returns the type information of the current element or attribute assessed by the ValidationHandler during validation of the XML document. This interface allows an application to know three things:

  • Whether the attribute is declared as an ID type

  • Whether the attribute was declared in the original XML document or was added by Validator during validation

  • What type information of the element or attribute as declared in the schema is associated with the document

Type information is returned as an org.w3c.dom.TypeInfo object, which is defined as part of DOM L3. The TypeInfo object returned is immutable, and the caller can keep references to the obtained TypeInfo object longer than the callback scope. The methods of this interface may only be called by the startElement event of the ContentHandler that the application sets on the ValidatorHandler. For example, look at the section of the code below. (Note: For clarity, only part of the code is shown here. For the complete source, look at the sample SchemaTypeInformation.java, which can be downloaded from here.)

ValidatorHandler vh = schema.newValidatorHandler(); 
vh.setErrorHandler(eh);
vh.setContentHandler(new MyContentHandler(vh.getTypeInfoProvider()));
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setNamespaceAware(true);
XMLReader reader = spf.newSAXParser().getXMLReader();
reader.setContentHandler(vh);
reader.parse(<XML Document>);

Ensure Data Security

Validating an XML document against an untrusted schema could have serious consequences, as validation may modify the actual data by adding default attributes and possibly corrupting the data. Validation against an untrusted schema may also mean that an incoming instance document might not conform to your business's constraints or rules.

With the new Validation APIs, getting a Schema instance is the first step before being able to validate an instance document, and it is the application that determines how to create the Schema instance. Validation using the Schema instance makes sure that an incoming instance document is not validated against any other (untrusted) schema(s) but only against the schema(s) from which the instance is created. If the instance XML document has elements or attributes that refer to schema(s) from a different targetNamespace and are not part of javax.xml.validation.Schema representation, an error will be thrown. This approach protects you from accidental mistakes and malicious documents.

Reusing a Parser Instance

Is it possible to use the same parser instance to parse multiple XML documents? This was not clear, and the behavior was implementation dependent. JAXP 1.3 has added the new function reset() on SAXParser, DocumentBuilder, and Transformer. This guarantees that the same instance can be reused. The reset function improves the overall performance by saving resources, time associated with creating memory instances, and garbage collection time. Let's see how the reset() function can be used.

SAXParserFactory spf = SAXParserFactory.newInstance() ;
spf.setSchema(schema);
SAXParser saxParser = spf.newSAXParser();
for(int i = 0 ; i < n ; i++){
saxParser.parse(new File(args[i]), myHandler);
saxParser.reset(); }

The same function has also been added to newly designed javax.xml.validation.Validator, as well as to javax.xml.xpath.XPath. Applications are encouraged to reuse the parser, transformer, validator and XPath instance by calling reset() when processing multiple XML documents. Note that reset() sets the instance back to factory settings.

XPath Support

Accessing XML is made simple using XPath: A single XPath expression can be used to replace many lines of DOM API code. JAXP 1.3 has defined XPath APIs that conform to the XPath 1.0 specification and provide object-model-neutral APIs for the evaluation of XPath expressions and access to the evaluation environment. Though current APIs conform to XPath 1.0, the APIs have been designed with future XPath 2.0 support in mind.

To use JAXP 1.3 XPath APIs, the first step is to get the instance of XPathFactory. Though the default model is W3C DOM, it can be changed by specifying the object model URI:

XPathFactory factory = XPathFactory.newInstance();
XPathFactory factory = XpathFactory.newInstance(<OBJECT MODEL URI>);

Evaluate the XPath Expression

XpathFactory is used to create XPath objects. The XPath interface provides access to the XPath evaluation environment and expressions. XPath has overloaded the evaluate() function, which can return the result by evaluating an XPath expression based on the return type set by the application. For example, look at the following XML document:

<Books>
<Book>
     <Author> Author1 </Author>
     <Name> Name1 </Name>
     <ISBN> ISBN1 </ISBN>
</Book>
<Book>
     <Author> Author2 </Author>
     <Name> Name2 </Name>
     <ISBN> ISBN2 </ISBN>
</Book>
</Books>

Following is the working code to evaluate the XPath expression and print the contents of all the Book elements in the XML document:

XPath xpath = XpathFactory.newInstance().newXPath();
String expression = "/Books/Book/Name/text()";
NodeSet nameNodes = (NodeSet) xpath.evaluate(expression, new
 InputSource("Books.xml"), XpathConstants.NODESET);
//print all the names of the books
for(int i = 0 ; i < result.getLength(); i++){
    System.out.println("Book name " + (i+1) + " is " +
    result.item(i).getNodeValue());
}

Evaluate With Context Specified

XPath is also capable of evaluating an expression based on the context set by the application. The following example sets the Document node as the context for evaluation:

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document d = db.parse(new File("Books.xml"));
XPath xpath = XPathFactory.newInstance().newXPath();
String exp = "/Books/Book";
NodeSet books = (NodeSet) xpath.evaluate(exp,d,XpathConstants.NODESET);

With a reference to a Book element, a relative XPath expression can now be written to select the Name element as follows:

String expression = "Name";
Node book = xpath.evaluate(exp, books.item(0), XpathConstants.NODE);

NamespaceContext XPath Evaluation

What happens if the XML document is namespace aware? Look at the following XML document, in which the first Book element is in the publisher1 domain and the second in the publisher2 domain:

<Books >
<Book xmlns="www.publisher1.com">
     <Author>Author1</Author>
     <Name>Name1</Name>
     <ISBN>ISBN1</ISBN>
</Book>
<Book xmlns="www.publisher2.com">
     <Author>Author2</Author>
     <Name>Name2</Name>
     <ISBN>ISBN2</ISBN>
     <Cover>Hard</Cover>
</Book>
</Books>

In this case, the XPath expression /Books/Book/Name/text() won't give any result because the expression is not fully qualified. You can use an expression such as /Books/p1:Book/p1:Name with a p1 prefix. However, you should set NamespaceContext on the XPath instance so that the p1 prefix can be resolved. In the following sample, the NamespaceContext capable of resolving p1 is set on the XPath instance. Note that the two Book elements are in different namespaces, so the expression would result in only one node.

XPath xpath = XpathFactory.newInstance().newXPath();
String exp = "/Books/p1:Book/p1:Name" ;
xpath.setNamespaceContext(new MyNamespaceContext());
InputSource is = new InputSource("Books.xml");
NodeSet nn = (NodeSet)xpath.evaluate(exp, is, XpathConstants.NODESET);
// Print the count.
System.out.println("Node count = " + nn.getLength());

XPathVariableResolver

The XPath specification allows variables to be used in the XPath expressions. XPathVariableResolver is defined to provide access to the set of user-defined XPath variables. Here is an example of an XPath expression using Variable:

String exp = "/Books/j:Book[j:Name=$bookName]";
xpath.setXPathVariableResolver(new SimpleXPathVariableResolver());
InputSource is = new InputSource("Books.xml");
Node n = (Node) xpath.evaluate(exp, is, XPathConstants.NODE);
System.out.println("Node name is " + n.getNodeName());

A SimpleXPathVariableResolver can implement the resolveVariable() function as follows. (Note: For clarity, only the relevant code is shown here.)

public Object resolveVariable(javax.xml.namespace.QName qName) { 
    if(qName.getLocalPart().equals("bookName"))
        return "Name1";
         ....
    }
}

XML Schema Data Types

JAXP 1.3 has introduced new data types in the Java platform, the javax.xml.datatypes package, that directly map to some of the XML schema data types, thus bringing XML schema data type support directly into the Java platform.

The DatatypeFactory has functions to create different types of data types -- for example, xs:data, xs:dateTime, xs:duration, and so on. The javax.xml.datatype.XMLGregorianCalendar takes care of many W3C XML Schema 1.0 date and time data types, specifically, dateTime, time, date, gYearMonth, gMonthDay, gYear gMonth, and gDay defined in this XML namespace:

http://www.w3.org/2001/XMLSchema

These data types are normatively defined in W3C XML Schema 1.0, Part 2, Section 3.2.7-14.

The data type javax.xml.validation.Duration is an immutable representation of a time span as defined in the W3C XML Schema 1.0 specification. A Duration object represents a period of Gregorian time, which consists of six fields (years, months, days, hours, minutes, and seconds) as well as a sign field (+ or -).

Table 1 shows the mapping of XML schema data types to Java platform data types. Table 2 shows the mapping of XPath data types and Java Platform data types.

Table 1. XML Schema and Java Platform Data Type Mapping

W3C XML Schema Data Type

Java Platform Data Type

xs:date

XMLGregorianCalendar

xs:dateTime

XMLGregorianCalendar

xs:duration

Duration

xs:gDay

XMLGregorianCalendar

xs:gMonth

XMLGregorianCalendar

xs:gMonthDay

XMLGregorianCalendar

xs:gYear

XMLGregorianCalendar

xs:gYearMonth

XMLGregorianCalendar

xs:time

XMLGregorianCalendar



Table 2. XPath and Java Platform Data Type Mapping

XPath Data Type

Java Platform Data Type

xdt:dayTimeDuration

Duration

xdt:yearMonthDuration

Duration



These data types have a rich set of functions introduced to perform basic operations over data types, for example, addition, subtraction, and multiplication.

Also, there are ways to get the lexicalRepresentation of a particular data type that is defined at XML Schema 1.0, Part 2, Section 3.2.[7-14].1, Lexical Representation. There is no need to understand the complexities of XML schema data types such as what types of operations are allowed on a data type, how to write a lexical representation, and so on. The javax.xml.datatype APIs have defined a rich set of functions to make it easy for you.

XInclude Support

JAXP 1.3 has also defined the support for XInclude. SAXParserFactory/DocumentBuilderFactory should be configured to make it XInclude aware. Do this by setting setXIncludeAware() to true.

Security Enhancements

JAXP 1.3 has defined a security feature:

http://javax.xml.XMLConstants/feature/secure-processing

When set to true, this operates the parser in secure manner and instructs the implementation to process XML securely and avoid conditions such as denial-of-service attacks. Examples include restricting the number of entities that can be expanded, the number of attributes an element can have, and the XML schema constructs that would consume large amounts of resources, such as large values for minOccurs and maxOccurs. If XML processing is limited for security reasons, it will be reported by a call to the registered ErrorHandler.fatalError().

Summary

This article has introduced you to some of the new features in JAXP 1.3. You have seen the benefits of the Schema Validation Framework and seen how it can be used to improve the performance of schema validation. Developers working with applications using JAXP 1.2 schema properties to validate XML document against schemas should upgrade to JAXP 1.3 and use this framework. Remember to reuse the parser instance by calling the reset() method to improve performance.

New object-model-neutral XPath APIs bring XPath support and can work with different object models. XML schema data type support is brought directly into the Java platform with the introduction of new data types. Security features introduced in JAXP 1.3 can help protect the application from denial-of-service attacks. Also, JAXP 1.3 provides complete support for the latest standards: XML 1.1, DOM L3, XInclude, and SAX 2.0.2. These are enough reasons to upgrade to JAXP 1.3, and the implementation is available for downloading from java.net.

For More Information

W3C XML Schema: Datatypes
RELAX NG home page
XPath APIs
XML 1.1 specification
DOM L3 specification
XInclude specification
SAX 2.0.2 home page

About the Author

Neeraj Bajaj is a member of the technical staff in the Web Technology and Standards group at Sun Microsystems. He has been working in the area of core XML processing-related technologies for more than four years. He is the architect of the Sun Java Streaming XML Parser and the co-specification lead of JAXP 1.4. He has contributed to the development of Apache's open-source Xerces2-J project and to the implementation of JSR 60 (JAXP 1.2), JSR 206 (JAXP 1.3), JSR 173 (StAX), and JAXP 1.4.

Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Comments:
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.