Load, Save and Filter XML Documents Using the DOM Level 3 API

by Deepak Vohra
07/19/2006

The final method in the interface, startElement(), specifies if an Element node is to be accepted, rejected, or skipped. As the specification indicates, "The parser will call this method after each Element start tag has been scanned, but before the remainder of the Element is processed. The intent is to allow the element, including any children, to be efficiently skipped." This may be why three methods are available for filtering—to improve efficiency.

The return values of the startElement() method are also listed in Table 1. Only an Element and the Element's attributes are input to the startElement() method. The method may be used to modify the attributes of an element. The difference between the acceptNode() method and the startElement() method is:

Only the Element nodes are input to the startElement() method, as compared to the acceptNode() method in which all the nodes except the Document, DocumentType, Notation, Entity, DocumentFragment, and Attribute nodes may be input. The Attribute nodes may be input to the acceptNode() method of the LSSerializerFilter interface.
The Element node input to startElement() will include all the Element's attributes but none of the children nodes. The nodes input to the acceptNode() method of the LSParserFilter include all the children nodes but none of the attribute nodes. The nodes input to the acceptNode() method of the LSSerializerFilter include all the children nodes, and may include the attribute nodes.

In the following example InputFilter class, I specify the return type of the getWhatToShow() method as NodeFilter.SHOW_ELEMENT. In other words, I only want to show Element nodes to the filter. The return type of acceptNode() and startElement() methods is LSParser.FILTER_ACCEPT:

Copy


private class InputFilter implements LSParserFilter {
   public short acceptNode(Node node) {
     return NodeFilter.FILTER_ACCEPT;
   }

   public int getWhatToShow() {
     return NodeFilter.SHOW_ELEMENT;
   }

   public short startElement(Element element) {
     System.out.println("Element Parsed " + element.getTagName());
     return NodeFilter.FILTER_ACCEPT;
   }
}

The example input filter inputs only the Element nodes to the filter's acceptNode() method; the other nodes are included in the DOM document without filtering. In this example, the acceptNode() method of the filter accepts all the nodes that are input. The startElement() method prints out the Element nodes as they are parsed in the XML document. To use a filter, create an instance of the InputFilter class and set the filter on the LSParser:


InputFilter inputFilter=new InputFilter();
parser.setFilter(inputFilter);

Parse and filter XML document:


Document document=parser.parse(input);

Now I'll show how to create an output filter. As an example, I'll filter a node from the Document in the output filter. Create an OutputFilter class that implements the LSSerializerFilter interface. In addition to the return values listed in Table 2, the getWhatToShow() method of the LSSerializerFilter interface may also be SHOW_ATTRIBUTE.

In the following example OutputFilter class, I specify the return type of the getWhatToShow() method as NodeFilter.SHOW_ELEMENT and the return type of the acceptNode() method as FILTER_ACCEPT for all journal nodes other than the journal node with date attribute April 2005, which I reject:

Copy


private class OutputFilter implements LSSerializerFilter {
   public short acceptNode(Node node) {
      Element element = (Element) node;

      if (element.getTagName().equals("journal")) {
         if (element.getAttribute("date").equals("April 2005")) {
            return NodeFilter.FILTER_REJECT;
         }
      }

      return NodeFilter.FILTER_ACCEPT;
   }

   public int getWhatToShow() {
      return NodeFilter.SHOW_ELEMENT;
   }
}

Create an instance of the LSSerializerFilter:


LSSerializer domWriter = impl.createLSSerializer();

Create an instance of the OutputFilter and set the filter on the LSSerializer:


OutputFilter outputFilter = new OutputFilter();
domWriter.setFilter(outputFilter);

Create a LSOutput object and set the OutputStream for the LSOutput object:

Copy


LSOutput lsOutput = impl.createLSOutput();
OutputStream outputStream = 
          new FileOutputStream(new File("c:/output/filter.xml"));
lsOutput.setByteStream(outputStream);

Output the filtered XML document:


domWriter.write( document, lsOutput);

Run the filter application DOM3Filter.java. The input filter lists the elements as they are parsed:

Copy


Element Parsed journal
Element Parsed article
Element Parsed title
Element Parsed author
Element Parsed journal
Element Parsed article
Element Parsed title
Element Parsed author

The output from the output filter is listed in the following code listing:

Copy


<?xml version="1.0" encoding="UTF-8"?>
<catalog title="dev2dev"> 
 <journal date="May 2005"> 
  <article section="WebLogic Server">
    <title>Session Management for Clustered Applications</title>
    <author> Jon Purdy</author> 
   </article>
  </journal>
</catalog>

As illustrated in the code listing, the journal node with date="April 2005" has been removed.

DOM3Filter.java, the Java class used to filter an XML document, is available in the Additional Reading section at the end of this article.

Prior to the DOM Level 3 Load and Save specification, an XML document could not be filtered as the document was parsed or output. In the DOM Level 2 API, nodes are removed with the remove methods of the Node interface.

Download

Download the source code of the examples found in this article: resources.zip

Conclusion

With the DOM3 Load and Save API, an XML document may be loaded, saved, and filtered. In this tutorial, the DOM Level 3 specification implementation in the Xerces2-j 2.7.0 is used to load, save, and filter an example XML document. JAXP 1.3 also includes a reference implementation of the DOM 3.0 Load and Save API. JAXP 1.3 is included in JDK 5.0. In this article I have shown you how to load an XML document (with schema validation), save an XML document or a node to a file or a String, and filter nodes from an XML document.

Additional Reading

W3C DOM 3 - the W3C DOM Level 3 Specification
Xerces2 Java Parser 2.7.1 - download site for the Xerces 2j Parser
Xerces2 DOM - programming with DOM
Xerces2 Java Parser 2.7.1 DOM Level 3 - Xerces 2j DOM Level 3 implementation

Deepak Vohra is a NuBean consultant and web developer. He is a Sun Certified Java 1.4 Programmer and Sun Certified Web Component Developer for J2EE.