|
Implementation
Two Java classes handle the normalization chores: XMLNormalization
takes a file name as input and converts it to a URL, then passes the URL to
MyDocumentBuilder which removes the whitespace characters. When MyDocumentBuilder
returns the normalized document, XMLNormalization.main prints the
results to the standard output device.
public static void main(String [] args) throws Exception
{
SAXParser saxParser = new SAXParser();
DocumentBuilder docBuilder = new MyDocumentBuilder();
saxParser.setContentHandler(docBuilder);
saxParser.parse(fileNameToURL(args[0]));
XMLDocument doc = docBuilder.getDocument();
doc.print(System.out);
}
MyDocumentBuilder extends DocumentBuilder, which provides
the getDocument method called above. DocumentBuilder also provides
the characters method, which receives notification of character
data inside an element. The following code from MyDocumentBuilder overrides
the DocumentBuilder.characters method to handle specific whitespace
characters. First, it replaces tabs, newlines, and return characters with spaces,
then it calls String.trim to remove the spaces.
/** * Receive notification of character data inside an element. * @param ch The characters. * @param start The start position in the character array. * @param length The number of characters to use from the * character array. * @exception org.xml.sax.SAXException Any SAX exception, possibly * wrapping another exception. * @see org.xml.sax.DocumentHandler#characters */ public void characters(char ch[], int start, int length) throws SAXException { String str = new String(ch, start, length); //replace str = str.replace('\t',' '); str = str.replace('\n',' '); str = str.replace('\r',' '); // collapse str = str.trim(); char[] ca = str.toCharArray(); int i, j; boolean seenWS = false; for (i=0,j=0; j< str.length(); j++) { if (ca[j] != ' ' || !seenWS) { ca[i++] = ca[j]; if (ca[j] == ' ') seenWS = true; else seenWS = false; } } super.characters(ca,0,i); }
|