Cross-Referencing HTML the Tiger Way

   
By Dr. Matthias Laux, September 2004  

The enormously widespread adoption that HTML enjoys as the markup language of the internet is a testimony to its unparalleled usefulness in getting content published easily. HTML offers a variety of tags to structure content, among them tags to mark headers at different levels like <h1> through <h6>, or tags to define lists like <ul> or <ol>. HTML documents are simple text documents which can be created using any editor or by exporting a document from any current office productivity suite. What HTML lacks, however, is real support for cross-referencing for items like chapters, figures, or tables. Maintaining an ordered numbering scheme for such items, plus making sure that references to these items from elsewhere in the document are kept up-to-date should items be moved to a different location, added, or removed entirely, is beyond the scope of HTML. Yet such capabilities come in very handy, especially when managing such references across multiple documents. This article describes a simple Java tool called Xref which provides such capabilities, and some others beyond that as well. This tool demonstrates the benefits of the new ease-of-development features that have been added to the upcoming J2SE 5.0 release ("Tiger"). The main features used here are generics, typesafe enums, and other language extensions like the enhanced for loop for collections.

I use HTML documents to manage information daily. This page description language is very simple to use, and storing useful data in a web page is just so convenient -- no paper documents lying around and getting lost between piles, never to be found again. The more complex HTML documents are typically structured using header tags like <h1> through <h6> (although the latter is admittedly not used too frequently) or lists, be they ordered (<ol>) or unordered (<ul>). In addition, there can be figures and tables in the document, among other structural elements. Typically, elements like chapter headings, figures, and tables get numbers assigned to be able to refer to them, such as

...
3. Configuring a proxy server
...
3.1 Setting up the server
...
3.2 Configuring proxy rules
...

Somewhere in the text, we could find something like

...
For further details on how to set up a proxy server, please see chapter 3.1.
...

It works the same way with figures and tables. The problem here becomes obvious: if I insert something in between the existing structure, for example a new chapter 3.1, all the chapter headings below need to be renumbered. In addition, the references to the chapter number need to be modified as well. This is obviously both boring and -- worse yet - error-prone, especially if such numberings and references span multiple HTML files comprising an actual document.

Your everyday word processor in your office suite offers another nice capability: automatically creating a table of contents, or maybe also a list of figures and a list of tables. In HTML, such lists have to be created and maintained manually. This is also not something we would typically enjoy doing.

The tool, Xref, described in this article, offers a simple solution to these problems, and some additional capabilities beyond them. In addition, it is also a useful example of how to make productive use of some of the new J2SE 5.0 capabilities.

While Xref itself can be used in multiple ways to simplify the creation of HTML content, its primary target is the area of web authoring, where texts are created that span one or more HTML documents. This area is supported at several levels of complexity:

  1. In its simplest form of application, existing documents which are structured using HTML header tags can be processed, and the headers will automatically be numbered. This can be achieved without any changes. If desired, a table of contents with all the chapters and hyperlinks pointing to all chapters can be generated by simply adding one new tag to the document. This would be the only necessary change.
  2. A more complex form of application requires changes in the HTML document(s): by adding new HTML tags, an additional automated numbering can be achieved for other objects like figures or tables. Again, the creation of lists (list of figures, list of tables) is supported.
  3. The full scale of Xref's capabilities is exploited by using the referencing capabilities. Using another new tag ( <ref>),references to any labelled object (such as chapter headers, figures, or tables) can be created in the document, and the correct numbering will be inserted at that location. In addition, a hyperlink pointing to the referred object will be created.

All the source code described in this article and the JAR file with the compiled classes are available for download -- see the Resources section.

Some of the functionality described here could also be implemented based on counters as specified in CSS2 (see the Resources section) although many popular browsers don't yet support it. These counters can also be used for the automated numbering of, for example, chapter headers. The tool described here is more focused on the cross-referencing task, and offers several capabilities beyond CSS2 counters:

  • Numbering of items like figures and tables (and other, arbitrary sequences) based on new pseudo-HTML tags inserted directly into the document.
  • Automatic cross-referencing capabilities, including resolution of references to numbered items like chapters or figures and the creation of named anchors for the targets and <a href> references to these.
  • Automatic generation of cross-reference lists in the document, including <a href> references to the referenced items.
  • Cross-document capabilities (numbering and references can span multiple documents).

Besides that, it is also intended to be a showcase application for some of the new capabilities of the J2SE 5.0 release.

 
Tags for the Cross-Referencing Process

Three new tags have been defined to achieve the purpose of this tool:

  1. Tag <lab name="myLabel" type="figure">...</lab>:

    This tag sets a label of a given type in the HTML document which can later be referred to. This tag also accepts an optional text attribute which can be used to explicitly set the text used for this label in auto-generated lists (per default, the text contained in between the opening and closing <lab> tag is used for that purpose). More details on this will follow.
  2. Tag <ref name="MainData" type="table">...</ref>:

    This tag refers to a label of a given type by its name.
  3. Tag <list type="chapter"></list>:

    This tag inserts an automatically generated list of known labels into the document, for example a table of contents, or a list of figures.

In addition, the existing header tags <h1> through <h6> have been extended to accept an additional name and a text attribute, which are both optional. This is useful since these header tags are the already existing structuring mechanism for HTML, and rather than defining an additional <lab> tag for each chapter heading, it is more useful to rely on the existing tags. So basically instead of

<lab name="Chapter1" type="chapter">The first chapter</lab>

we would use

<h1 name="Chapter1"> The first chapter</h1>

This offers one additional, extremely important benefit: if an existing document which uses HTML header tags to structure its content is to be processed with this tool, there is no need to actually modify the document to achieve the automated numbering of chapter headings! XrefParser uses a dummy value for the name attribute of header tags, if no such attribute is present. The numbering of headers is solely based on the document structure, and not on the attributes of the header tags, and thus such an unmodified document would be rewritten with the correct header numbers. It would not be possible, of course, to add <ref> references pointing to such headers. Such additional information could, however, be inserted into an existing document as required -- so it is definitely not necessary to rewrite all HTML documents to get some of the benefits of this tool. The full range of the benefits requires some rewriting, however.

A few remarks are in order here:

  1. The tags can be inserted into documents using any standard text editor. HTML editors may also support the insertion of HTML tags (Netscape Composer is an example of an editor providing this capability).
  2. The tag names used here are admittedly borrowed in part from Leslie Lamport's LaTeX macro package, which is based on Donald Knuth's wonderful text processor TeX. This is still probably the best tool under the sun for creating technical and scientific documents, especially if they contain many formulas and equations.
  3. We can of course define any number of HTML tags and insert them into documents -- this doesn't mean that a browser can make any sense out of them. Fortunately, browsers couldn't care less about these tags, and just ignore them. This is actually good since it means that these are not displayed in the source document. In the target document, which is created by the tool, appropriate text (heading, table, or figure numberings) has been added where these tags were specified.
  4. The good news is of course that the org.htmlparser.Parser we use here does not ignore these tags. In fact, it provides an instance of org.htmlparser.lexer.nodes.TagNode for each such tag, and the method getRawTagName() provides the name of the tag as a string, which can then be used for further processing.
  5. XrefParser not only resolves all references with the correct numberings, it also inserts HTML anchors ( <a name> ... </a>) for the reference targets, and hyperlinks for the references to them, so that navigation is very much simplified. Hyperlinks are also created for each item within a list created by the <list> tag.
  6. The closing tags are not a must in all cases for the newly introduced <ref> and <list> tags. It is, however, good style to use closing tags to be xHTML compliant. For <lab> tags, it is a must to have such closing tags to allow for the automated collection of text data.
 
Technical Implementation

The approach chosen here is to define a few new HTML tags which are added to the HTML document, and to extend the standard header tags <h1> through <h6> with additional attributes. Such an HTML document is then processed using an instance of org.htmlparser.Parser, which is part of the HTMLParser Sourceforge project (see the Resources section). This class provides the capabilities to parse an HTML document and provide the document information in the form of a tree-structure. This tree basically consists of three different node elements:

  1. org.htmlparser.lexer.nodes.TagNode: A node containing all information about an HTML tag that the parser encountered.
  2. org.htmlparser.lexer.nodes.StringNode: A node containing plain HTML text.
  3. org.htmlparser.lexer.nodes.RemarkNode: A node containing the information about an HTML comment that the parser encountered.

This tree structure can be processed recursively, and appropriate action can be taken, depending on the type of node encountered. Obviously, for the purpose of the Xref tool, the TagNodes are the most relevant ones since we are looking for specific tags to trigger certain actions, but the StringNodes are also very important to add additional information to the document.

The org.htmlparser.Parser class -- or,  more precisely, the entire HTMLParser package -- is really a fine piece of software, and I have to applaud the authors who implemented it! It is a robust parser that can handle not only perfectly xHTML-compliant documents, but also correctly parses documents with a lot of not-so-conformant HTML, which is nevertheless acceptable to today's browsers. Probably the most common example of such HTML markup are missing end tags like </td> or </p>. org.htmlparser.Parser nevertheless correctly parses such documents and produces the document tree structure. This structure can be processed in code similar to this example:

//.... Parse a file whose name is stored in the String fileName

org.htmlparser.Parser parser = new org.htmlparser.Parser(fileName);

for (org.htmlparser.util.NodeIterator i = parser.elements();
i.hasMoreNodes(); ) {
  org.htmlparser.Node node = i.nextNode();
  nodes.add(node);
  recurse(node);
}

where the recursive path through the tree is implemented as

private void recurse(org.htmlparser.Node node) throws
org.htmlparser.util.ParserException {

//.... Do processing for the given node (depending on the node type)

  ...

//.... Recurse into children

  org.htmlparser.util.NodeList nodeList = node.getChildren();
  if (nodeList != null)
    for (org.htmlparser.util.NodeIterator i = nodeList.elements(); i.hasMoreNodes(); )
      recurse(i.nextNode());
                    

                     


}
                  

One very convenient feature of the org.htmlparser.Parser class is that it is not limited to a set of known HTML tags, but uses more of a pattern-matching approach to identify tags. This allows us to sneak the newly introduced tags into the documents, which are then reported as instances of TagNode, just like other standard HTML tags are.

Once the document tree structure is available, it is processed twice: in pass 1, all the necessary information about referenceable objects is collected, and in pass 2, all references to such objects are resolved in the document, and an automated numbering of such objects is performed. All of this is implemented in the ml.htmlkit.XrefParser class, which is the crucial component of the Xref tool.

So, to summarize the process:

  1. Create an instance of XrefParser:

    XrefParser parser = new XrefParser();
  2. Run pass 1 for a file inpfile:

    parser.collect(inpfile);

    This creates an org.htmlparser.Parser under the hood to create the document tree, and then uses exactly the recursive procedure outlined above to traverse this tree and inspect all its elements.
  3. Run pass 2 for a file inpfile:

    parser.resolve(inpfile, outfile);

    This step again recursively traverses the document tree (which has been stored in the XrefParser instance in pass 1) to resolve all references and to insert numberings where necessary. The output is written to the given file outfile.

This is admittedly a somewhat simplified view, but it demonstrates the basic approach.

Xref implements this process in an easy to use way. The application accepts an XML file describing the task to accomplish in the form of a project, the details of which will be described later in this article. At this point, it is important to point out a few aspects:

  1. XrefParserand Xref support the processing of multiple HTML documents within a project. References can span multiple documents, and the links generated and inserted into the documents will correctly resolve to the appropriate document.
  2. XrefParser accepts a set of parameters to control certain aspects of the output it generates. The defaults chosen for these parameters can be overridden within the XML file describing the cross-referencing project.
  3. Xref can also be instantiated and the actual cross-referencing task can also be achieved by invoking the

    public void xrefFiles(java.util.List<String> inpfiles,
                         java.util.List<String> outfiles,
                         XrefParser            parser) throws java.io.IOException,
                                                        org.htmlparser.util.ParserException
    

    method. This allows for using the capabilities of this tool also from within other Java applications.

It should also be pointed out that initial versions of this tool relied on the HTML editor toolkit that is provided in the Swing packages that are part of the standard JDK ( javax.swing.text.html.HTMLEditorKit). This toolkit also contains an HTML parser which can be readily applied to the task at hand. Unfortunately, however, this parser currently uses an HTML 3.2 DTD under the hood (also in the upcoming J2SE 5.0), which means that HTML 4.0 (or later) features can cause problems. One example of such problems that were observed is the incorrect processing of entities like &alpha; which were introduced in HTML 4.0. Since some of these newly introduced entities are deemed useful for the Xref tool, eventually the Swing parser was replaced by the HTMLParser package, which does not have these limitations.

 
Namespace Concept

XrefParser supports the concept of namespaces. This basically means that the names for labels are assigned to different namespaces, and labels -- while being unique within a namespace - can have the same name when they belong to different namespaces. Namespaces are also directly related to the actual numbering being used, e. g. for tables or figures.

Namespaces are implemented using an enum, one of the many new and cool ease-of-development features in J2SE 5.0:

  private enum NameSpace {
    chapter,
    figure,
    table,
    sequence1,
    sequence2,
    sequence3,
    sequence4;
  }

During pass 1, XrefParser collects the information for header and label tags. Each such tag belongs to a namespace: header tags implicitly belong to the chapter namespace, whereas the namespace for a label tag is chosen with the type attribute.

Using an enum adds a great deal of type safety to the coding since we have a Java type which controls exactly the set of allowed values. This is also exploited in the data structures holding the collected information. This information is stored in an instance of a helper class TagInfo for each namespace. Note the use of J2SE 5.0 generics here to control the type of data that can be stored in this hash map:

private java.util.HashMap<NameSpace, TagInfo> tagInfo = new java.util.HashMap<NameSpace, TagInfo>();

On a side note, there is often the requirement in the code described here to check whether a given text string in the HTML document (here the namespace as specified in the type attribute) is a valid value, and secondly, we would like to get an instance of the enum to which this string maps in order to work with the enum in the code (for example in a switch statement). This can be easily achieved using a hash map like

java.util.HashMap<String, NameSpace> map = new
java.util.HashMap<String, NameSpace>();

This hash map can be filled with strings (namespace names) mapping to instances of the NameSpace enum:

for (NameSpace nameSpace: NameSpace.values())
  map.put(nameSpace.toString(), nameSpace);

Note the use of the J2SE 5.0 enhanced for loop here, plus the capabilities of the new enum construct. Now achieving the goals described above is straightforward:

  • An existence check is done through the containsKey() method of the hash map.
  • Getting an enum instance is done through the get() method of the hash map.

Note that the generics feature in J2SE 5.0 ensures that only the right type of arguments can be used when working with this hash map. In addition, no casting is necessary when retrieving objects from that hash map:

NameSpace nameSpace = map.get("chapter");

Using this approach ensures that typos or just unsupported data in the input document can be easily discovered and reported. This safety can be achieved in a very generic way; we can easily add or remove namespaces simply by modifying the NameSpace enum, and we don't have to maintain chains of string comparisons in endless if-then-else constructs for each supported namespace.

As already mentioned, the namespace concept is closely related to the numbering scheme used. So if we have, say, 12 figures and 7 tables within a document (or a set of documents), we would add a label tag for each one of them, but use the figure namespace for the figures, and the table namespace for the tables. XrefParser would then insert numbers ranging from 1 to 12 for the figures, and 1 to 7 for the tables.

Note that this concept is not limited to chapters, figures, or tables - it can be applied to any sequence, where necessary. HTML of course offers the <ol> tag for automated numbering of items, but if there should ever be a need to have a dynamically created and updated sequence number in a document (or a set of documents) which can not be satisfied by using an <ol> tag (which can become quite nasty or impossible across a large document or a set of documents), then XrefParser offers a simple solution in the form of the namespaces. Four generic multi-purpose namespaces have already been defined ( sequence1 through sequence4), and more can be added as necessary simply by adding an item to the NameSpace enum. So one could simply use

...
<lab type="sequence1" name="step1"></lab>
...
<lab type="sequence1" name="step2">
                     </lab> ... <lab type="sequence1" name="step3">
                     </lab> ... 
                  

throughout a document, and XrefParser would reformat these tags as

...
<lab type="sequence1" name="step1"></lab><a name="sequence1:step1">1</a>
...
<lab type="sequence1" name="step2"></lab><a name="sequence1:step2">2</a>
...
<lab type="sequence1" name="step3"></lab><a name="sequence1:step3">3</a>
...

The names given here could also be referred to using <ref> tags, if necessary:

...
<ref type="sequence1" name="step2"></ref>
...



...
<ref type="sequence1" name="step2"></ref><a href="data.html#sequence1:step2">2</a>
...

where we assume that the actual HTML output document is named data.html. In addition, a <list> tag can also be used for such sequences, should the need arise.

One possible application of such a sequence is a list of literature references in a document. Typically, an HTML ordered list can do the job of making sure that numbers are added in the real order, since such references are typically collected at the end of a document. Referring to such references from within the text using their real number in the list, however, is not possible without maintaining these reference numbers manually. Xref can provide this capability easily, as in this example:

...
<lab type="sequence1" name="sun">
  Sun's home page at http://www.sun.com
</lab><br>
<lab type="sequence1" name="sap">
  SAP's home page at http://www.sap.com
</lab><br>
<lab type="sequence1" name="java">
  The main Java page at http://www.oracle.com/technetwork/java/index.html
</lab><br>
...

A reference could then look like this:

... a lot of useful information on Java can be obtained from Sun's main Java
page [<ref type="sequence1" name="java"></ref>].

which would be translated to

...
<lab type="sequence1" name="sun"><a name="sequence1:sun">1</a>
  Sun's home page at http://www.sun.com
</lab><br>
<lab type="sequence1" name="sap"><a name="sequence1:sap">2</a>
  SAP's home page at http://www.sap.com
</lab><br>
<lab type="sequence1" name="java"><a name="sequence1:java">3</a>
  The main Java page at http://www.oracle.com/technetwork/java/index.html
</lab><br>
...

... a lot of useful information on Java can be obtained from Sun's main
Java page [<ref type="sequence1" name="java"></ref><a
href="doc.html#sequence1:java">3</a>].

Chapters are all part of one namespace, and as such, dealing with them is slightly more complicated, since there is a hierarchy within the different header tag levels: the main numbering of the namespace applies to <h1>, but the numbering of the lower level header tag levels is dependent on their higher level tags: if we encounter a new <h(n)> tag, then the numbering for <h(n+1)> starts all over at 1. This requires some additional logic in the code for this namespace.

 
The XrefParser Class

The class containing the main cross-referencing logic is, as already mentioned, XrefParser. Nevertheless, several helper classes are used which need to be described before delving into the XrefParser details.

Parameters
XrefParserml.htmlkit.SimpleParameterml.htmlkit.SetParameterml.htmlkit.Parameter

public abstract class Parameter {

  public enum Style {

    Number,
    RomanLower,
    RomanUpper,
    AlphaLower,
    AlphaUpper,
    GreekLower,
    GreekUpper,
    Custom,
    None;

  }

}

This class serves two purposes: It is a marker for all derived parameter classes, and it holds an enum for the supported header styles.

Parameters are pre-initialized with default values taken from two configuration property files contained in the ml.htmlkit package. XrefParser offers additional methods to change these default values, and the project XML file described later also offers XML tags to control every such parameter.

An excerpt of SimpleParameter is shown here:

public class SimpleParameter extends Parameter {

  private SimpleParameter.Name name  = null;
  private String              value = null;

  public enum Name {

    H1Style,
    H2Style,
    H3Style,
    H4Style,
    H5Style,
    H6Style,

    H1H2Separator,
    H2H3Separator,
    H3H4Separator,
    H4H5Separator,
    H5H6Separator,

    TextAfterChapter,
    TextAfterFigure,
    TextAfterTable;

  }

  ...

}

Again, an enum is used to specify the supported simple parameter names. This is exploited when reading the property files with the default values, and also in the XrefParser API methods controlling access to the parameter's value:

// A hash map mapping the names of simple parameters to enum instances

HashMap<String, SimpleParameter.Name> names
  = new HashMap<String, SimpleParameter.Name>();
...
for (SimpleParameter.Name name: SimpleParameter.Name.values())
  names.put(name.toString(), name);
...

// A hash map mapping enum instances to instances of SimpleParameter

HashMap<SimpleParameter.Name, SimpleParameter> params
  = new HashMap<SimpleParameter.Name, SimpleParameter>();
...
Properties prop = new Properties();
properties.load(...);                   // Load the property file

// Setup the parameter data with the data read from the property file

for (Enumeration e = prop.propertyNames(); e.hasMoreElements(); ) {
  String n = (String)e.nextElement();
  if (names.containsKey(n)) {
    params.put(names.get(n), new SimpleParameter(names.get(n), prop.getProperty(n)));
  } else {
    writeError("Unknown parameter: " + n);
  }
}

Note how simple and convenient it is to check whether the property names in the property file are valid and to make sure that the data stored in the two HashMap instances is exactly of the type desired. The compiler will flag any attempt to store an unsupported object type into these maps.

The other API methods of SimpleParameter are:

public SimpleParameter(SimpleParameter.Name name, String value);
public SimpleParameter.Name getName();
public String getValue();
public void setValue(String value);

The Name enum is very useful here to ensure already in the constructor that only valid parameter names can be specified. The SetParameter class is slightly more complex:

public class SetParameter extends Parameter {

  private java.util.ArrayList<String> values = null;
  private SetParameter.Name          name   = null;
  
  public enum Name {
  
    RomanLowerSet,
    RomanUpperSet,
    GreekLowerSet,
    GreekUpperSet,
    AlphaLowerSet,
    AlphaUpperSet,
    CustomSet;
        
  }
  
  ...
  
}

Again, an enum is used for the valid parameter names, while the API methods are slightly more complex for this type of parameter:

public SetParameter(SetParameter.Name name, String valueString);
public SetParameter.Name getName();
public void addValue(String value);
public void removeValue(String value);
public java.util.ArrayList<String> getValues();
public void setValues(String valueString);

The approach chosen here is that SetParameter is initialized using a valueString that contains the valid values separated by blanks. An example would be a list of tag names such as "h1 h2 h3 h4 h5 h6". Internally, this string is split into individual strings which are stored in the ArrayList. Values can be added or removed from this list as necessary.

Setting or modifying set parameters is again also supported through the project XML format. The following table lists all the currently supported parameters:

Table 1: Parameter Names and Their Meaning
Name
Type
Meaning
Valid Values
  H1Style  Simple  The style used for <h1> header tags   Parameter.Style
  H2Style  Simple  The style used for <h2> header tags   Parameter.Style
  H3Style  Simple  The style used for <h3> header tags   Parameter.Style
  H4Style  Simple  The style used for <h4> header tags   Parameter.Style
  H5Style  Simple  The style used for <h5> header tags   Parameter.Style
  H6Style  Simple  The style used for <h6> header tags   Parameter.Style
  H1H2Separator  Simple  The separator used in between numberings for <h1> and <h2>  Any string
  H2H3Separator  Simple  The separator used in between numberings for <h2> and <h3>  Any string
  H3H4Separator  Simple  The separator used in between numberings for <h3> and <h4>  Any string
  H4H5Separator  Simple  The separator used in between numberings for <h4> and <h5>  Any string
  H5H6Separator  Simple  The separator used in between numberings for <h5> and <h6>  Any string
  TextAfterChapter  Simple  The text that follows chapter heading numberings  Any string
  TextAfterFigure  Simple  The text that follows figure numberings  Any string
  TextAfterTable  Simple  The text that follows table numberings  Any string
  RomanLowerSet  Set  The literals used for the numbering style Parameter.Style.RomanLower  i, ii, iii, iv, ...
  RomanLowerSet  Set  The literals used for the numbering style Parameter.Style.RomanUpper  I, II, III, IV, ...
  AlphaLowerSet  Set  The literals used for the numbering style Parameter.Style.AlphaLower  a, b, c, ...
  AlphaUpperSet  Set  The literals used for the numbering style Parameter.Style.AlphaUpper  A, B, C, ...
  GreekLowerSet  Set  The literals used for the numbering style Parameter.Style.GreekLower  α, β, γ, ...
  GreekUpperSet  Set  The literals used for the numbering style Parameter.Style.GreekUpper  Α, Β, Γ, ...
  CustomSet  Set  The literals used for the numbering style Parameter.Style.Custom  Any set of strings
 

This class structure for holding the parameters is designed to be extensible to allow the easy addition of new simple or set parameters should this become necessary.

TagInfo Class
TagInfo

public boolean add(String name, String value, String prefix, String text);
public boolean containsName(String name);
public void reset();
public boolean hasNext();
public int size();
public void next();
public String getName();
public String getValue();
public String getValue(String name);
public String getPrefix(String name);
public String getText(String name);
public void setText(String name, String text);
public java.util.ArrayList<String> getNames();

There is one TagInfo instance holding all the label information for each namespace.

The add() method takes four arguments for each label:

  1. name: this is the name under which the label can be referenced by a <ref> tag. For header tags (<h1>, <h2>, ...), this is either specified in the name attribute, or a dummy value is automatically generated if no name attribute is found (which can then not be used for referencing purposes). For <lab> tags, this value is obtained from the mandatory name attribute.
  2. value: this is the value inserted for a label. For header tags, this is the numbering inserted prior to the chapter title (like "3.1" in the example above), for label tags this is the current number to be used instead of the label. In addition, this is the text inserted for each reference created to the name, and this text is also used in any auto-generated list created for a given namespace.
  3. prefix: this text is used as a prefix for the automatically generated hyperlinks, so it typically will be the filename to precede a reference to an anchor, like "chapter3.html" in <a href="chapter3.html#mySubTitle">. This is a must when a text spans multiple HTML documents, and Xref fully supports this in the <ref> tags, and also in any list generated using the <list> tag.
  4. text: this is the text used in auto-generated lists as the name for a label in the list.
Determining the Text in Lists
Xref

For example, given a text fragment like

<h1>The first chapter</h1>
...
<lab type="figure" name="fig1>Average population density</lab>
...

we would like to see a table of contents that contains a line like

1 The first chapter

and a list of figures that contains a line like

1 Average population density

How can this be accomplished? The task is not trivial, but let's look at the first line in the example above to explain how this can be solved. For this line of HTML text, org.htmlparser.Parser creates a sequence of node elements:

  1. TagNode: The opening <h1> tag
  2. StringNode: The actual text
  3. TagNode: The closing </h1> tag

Since we're parsing one node at a time, a text collection process is activated whenever a relevant opening tag like <h1> is encountered:

Tag tag = ...     // Derive from tag name of tag found by HTML parser
switch (tag) {
  case h1:    endTag = Tag.h1End; collectBuffer = new StringBuffer(); break;
  case h2:    endTag = Tag.h2End; collectBuffer = new StringBuffer(); break;
  case h3:    endTag = Tag.h3End; collectBuffer = new StringBuffer(); break;
  case h4:    endTag = Tag.h4End; collectBuffer = new StringBuffer(); break;
  case h5:    endTag = Tag.h5End; collectBuffer = new StringBuffer(); break;
  case h6:    endTag = Tag.h6End; collectBuffer = new StringBuffer(); break;
  case label: endTag = Tag.labelEnd; collectBuffer = new StringBuffer();
}

Tag here is an enum which contains all the opening and closing tags observed by the Xref tool. This will be described in more detail in the next chapter.

The text contained in any subsequently encountered StringNode is then appended to the collection buffer:

...
if ((node instanceof org.htmlparser.lexer.nodes.StringNode) && (endTag != null)) {
  collectBuffer.append(' ');
  collectBuffer.append(((org.htmlparser.lexer.nodes.StringNode)node).getText());
...

until the matching end tag is encountered:

if (endTag != null) {      // In this case, we are in text collection mode
  switch (tag) {           // See if we have found the end tag we look for
    case h1End:
    case h2End:
    case h3End:
    case h4End:
    case h5End:
    case h6End:
    case labelEnd:
      String s = collectBuffer.toString().trim();
      if (s.length() > 0 )
       tagInfo.get(lastNameSpace).setText(lastName, s);
      endTag = null;
   }
  }
}

If the required end tag is encountered, several things happen:

  • The collected text is stored in the appropriate TagInfo element using the setText() method. The namespace of this TagInfo instance and the name of the actual element for which we need to store the text (for example, in the case above, the opening <h1> tag) have been stored previously in lastNameSpace and lastName, respectively, to allow direct access to the correct element to store the text value.
  • It is checked whether the collected text is empty. This can happen for constructs such as <lab type="figure" name="fig1"></lab> since the parser will report no StringNode element in between the opening and the closing tag. In this case, the default text value will be used for the text, which is the name of the actual namespace.
  • Further text collection is disabled by setting the endTag to null.

The approach taken here has an additional benefit: sometimes there might be additional markup used in a chapter heading, for example

<h1>The <i>real</i> McCoy</h1>

Typically, such additional markup would not be desirable in a table of contents, and we get this effect here for free: the text collection process only accounts for StringNode nodes until the requiring closing tag is encountered. The additional tags <i> and </i> are simply ignored for the text collection process.

Note that Xref accepts an additional text attribute for all HTML header tags and the label tags to allow overriding the text collection process:

<h1 name="ChapterProxy" text="A Proxy server">
A lengthy heading describing a proxy server and what it's good for
</h1>

Sometimes it might be desirable to not use the actual text between the opening and the closing tags, for example for the sake of readability and clarity. Using the text attribute, this purpose can be easily achieved, and in the example above, "A Proxy server" would be used in generated lists.

The Actual XrefParser Class
XrefParser

First of all, there are two enums defined within this class. The NameSpace enum has already been described:

  private enum NameSpace {
    chapter,
    figure,
    table,
    sequence1,
    sequence2,
    sequence3,
    sequence4;
  }

whereas the Tag enum has been used in several places, but not yet fully explained:

private enum Tag {

  h1("h1"),
  h2("h2"),
  h3("h3"),
  h4("h4"),
  h5("h5"),
  h6("h6"),
  list("list"),
  label("lab"),   
  ref("ref"),
  h1End("/h1"),
  h2End("/h2"),
  h3End("/h3"),
  h4End("/h4"),
  h5End("/h5"),
  h6End("/h6"),
  labelEnd("/lab");
  
  private String name = null;
  
  private Tag(String name) {
    this.name = name;
  }

  public String getName() {
    return name;
  }
  
}

The NameSpace enum, as the name implies, defines the set of valid namespaces. Note that four additional sequences have been added for convenience, and any further namespace can be defined as necessary by extending this enum in the source code. The default behaviour for a namespace is to support cross-document auto-numbering as described in the example in the previous chapter "Namespace Concept". Only the chapter namespace gets a special treatment to account for the different levels of header tags.

The Tag enum defines the tags that XrefParser is interested in. This enum can be conveniently used in switch statements to support an easily readable programming style. Note that this enum uses some of the additional capabilities that the J2SE 5.0 implementation offers: enums can have their own methods (including constructors) and member variables, and we make use of that here to allow for an abstraction between the names of the enum elements and the HTML tag names: the enum names (such as h1, h2, or label) can be used for example in switch statements, and the actual tag names they map to can be retrieved using the getName() method. This comes in very handy here, for example to conveniently refer to the <lab> tag as XrefParser.Tag.label, and also to allow for an easy redefinition of the tag names, should this be desired: just changing one line in this enum is sufficient to change an HTML tag name Xref uses, since everywhere else only the enum instances are used.

The XrefParser class supports two main operations: collection of label and header tag information (pass 1), and resolution of reference information (pass 2). The most interesting methods are consequently

public void collect(String inpfile)                   // Pass 1
public void resolve(String inpfile, String outfile)   // Pass 2

The node data which is returned by the HTML parser is stored in a hash map during the collection phase. This hash map holds one instance of java.util.List of nodes for each input file:

private java.util.HashMap<String, java.util.List<org.htmlparser.Node>> nodeLists
  = new java.util.HashMap<String, java.util.List<org.htmlparser.Node>>();

This data is then reused in the resolution step.

A high-level perspective of the collection process is given in the following figure:

 
Figure 1: The Collection Process - High-Level View
Figure 1: The Collection Process - High-Level View

The collect() method has the following structure:

public void collect(String fileName) throws java.io.IOException, org.htmlparser.util.ParserException {

  ...

//.... Setup the structure to hold the node information for this particular file

  java.util.List<org.htmlparser.Node> nodes = new java.util.ArrayList<org.htmlparser.Node>();
  nodeLists.put(fileName, nodes);

//.... Parse the file

  org.htmlparser.Parser parser = new org.htmlparser.Parser(fileName);
  org.htmlparser.Node   node   = null;

  for (org.htmlparser.util.NodeIterator i = parser.elements(); i.hasMoreNodes(); ) {
    node = i.nextNode();
    nodes.add(node);
    collectRecurse(node);
  }

}

All org.htmlparser.Node instances reported by the parser are stored in the list for this file. Then a recursive method invocation is used to traverse all existing subnodes of each node:

private void collectRecurse(org.htmlparser.Node node) throws org.htmlparser.util.ParserException {

  if ((node instanceof org.htmlparser.lexer.nodes.StringNode) && (endTag != null)) {

//.... Collect text data if we are in text collection mode

  } else if (node instanceof org.htmlparser.lexer.nodes.TagNode) {

//.... Activate text collection if necessary

    ...

//.... Handle text collection if necessary (setText())

    ...

//.... Collect data for the tags. Store it in TagInfo instances.

      switch (tag) {
        case h1: ...
        case h2: ...
        case h3: ...
        case h4: ...
        case h5: ...
        case h6: ...
        case label: ...
      }
    }
  }

//.... Recurse into children

  ...

}

After all data has been collected for all input files, the resolution step is processed. A high-level perspective of the resolution process is given in the following figure:

 
Figure 2: Resolution Process -- High-Level View
Figure 2: Resolution Process -- High-Level View

The resolve () method has the following structure:

public void resolve(String inpfile, String outfile)
  throws java.io.IOException, org.htmlparser.util.ParserException {

//.... Retrieve the previously stored document structure

  java.util.List<org.htmlparser.Node> nodes = nodeLists.get(inpfile);

//.... Resolve

  for (org.htmlparser.Node n : nodes)
    resolveRecurse(n);

//.... Output

  java.io.BufferedWriter writer = new java.io.BufferedWriter(new java.io.FileWriter(outfile));
  for (org.htmlparser.Node n : nodes)
    writer.write(n.toHtml());
  writer.flush();
  writer.close();

}

We also use a recursive procedure here to deal with the resolution of references for all the nodes:

private void resolveRecurse(org.htmlparser.Node node) throws org.htmlparser.util.ParserException {

  ...

    switch (tag) {

//.... Chapter headings

    case h1:
    case h2:
    case h3:
    case h4:
    case h5:
    case h6:

      ...   // Insert numbering and <a name ...> ... </a>

//.... Labels

    case label:

      ...   // Insert numbering and <a name ...> ... </a>

//.... References

    case ref:
        
      ...   // Insert reference number and <a href ...> ... </a>
          
//.... Lists

    case list:
        
      ...   // Insert a <table>
with <a
href ...> ... </a> elements

    }
        
  ...
  
//.... Recurse into children

    org.htmlparser.util.NodeList nodeList = node.getChildren();
    if (nodeList != null)
      for (org.htmlparser.util.NodeIterator i = nodeList.elements(); i.hasMoreNodes(); )
        resolveRecurse(i.nextNode());

  }
}

One little trick should be mentioned here: in several cases, additional HTML markup is inserted into the document structure, for example in the form of auto-generated HTML anchors. Rather than modifying the actual document tree by creating and inserting additional TagNode instances, we simply prepend plain text to the next StringNode element that follows a tag that triggers the creation of such additional markup.

So, for example, a name anchor would be assembled into a StringBuffer using these lines:

sb.append("<a name=\"");
sb.append(NameSpace.chapter.toString());
sb.append(":");
sb.append(tagInfo.get(NameSpace.chapter).getName());
sb.append("\">");
sb.append(tagInfo.get(NameSpace.chapter).getValue());
sb.append("</a>");
sb.append(simpleParameters.get(SimpleParameter.Name.TextAfterChapter).getValue());

and then stored in prev using prev = sb.toString();

The next time a org.htmlparser.lexer.nodes.StringNode is found, we use this construct:

if (prev != null) {
  org.htmlparser.lexer.nodes.StringNode stringNode = (org.htmlparser.lexer.nodes.StringNode)node;
  stringNode.setText(prev + stringNode.getText());
  prev = null;
}

to modify the text for this string node by injecting the additional markup for the name anchor.

It is important to note that there are different styles supported for the actual numbering of chapter headings. Based on the H*Style parameters, different output styles will be created, and chapter headings like

...
1.2.a
1.2.b
1.3
...

or

...
1.I
1.II
1.III
1.IV
...

are possible. The style can be chosen separately for each header tag level. See the Parameter.Style enum documentation for further details on supported styles.

Quite some effort was invested in making the cross-referencing process bulletproof in the sense that erroneous input data is detected and handled appropriately. If such problems are found, useful workarounds are employed to ensure that the process can continue irrespective of the problem, and a warning is written to a java.io.Writer instance, which defaults to STDERR (but can be set to any desired writer).

 
Xref and the Project XML File Format

Finally, all the classes to perform the cross-referencing are in place, but using them could be easier. Rather than having to write a Java program using XrefParser for each document or set of documents to process, a generic approach was developed in the form of the Xref application. This application is fully controlled by an XML file which allows for the specification of the input and output files as well as the values of the parameters to use.

An example XML file is given here:

<?xml version="1.0" encoding="ISO-8859-1"?>

<project>

  <filepair inp="paper_01.html" out="paper_01_out.html"/>
  <filepair inp="paper_02.html" out="paper_02_out.html"/>
  <filepair inp="paper_03.html" out="paper_03_out.html"/>

  <parameter name="H2H3Separator"    value="-"/>
  <parameter name="TextAfterChapter" value=". "/>

  <set name="CustomSet" mode="set"    value="10 11 12 13 14"/>
  <set name="CustomSet" mode="add"    value="15"/>
  <set name="CustomSet" mode="remove" value="10"/>

</project>

Within an enclosing <project> tag, three further tags are supported:

  1. filepair: a pair of an input file and the file where the output for this file is written
  2. parameter: set a simple parameter (override default value)
  3. set: set or modify a set parameter. The mode attribute accepts the values "add", "remove", or "set".

The most important method of the Xref application is xrefFiles():

public void xrefFiles(java.util.List<String> inpfiles,
                      java.util.List<String> outfiles,
                      XrefParser             parser) throws java.io.IOException,
                                                    org.htmlparser.util.ParserException

The main() method of Xref performs the following tasks:

  1. Create an Xref instance
  2. Create an XrefParser instance
  3. Parse the project XML file provided as argument using the XML parser built into the JDK. Setup the lists of input and output files, and set or modify the parameters of the XrefParser instance, if necessary.
  4. Invoke the xrefFiles() method of the Xref instance.

The actual cross-referencing is then performed by xrefFiles(), and the design approach chosen here ensures that Xref can also be instantiated within other applications, such that these can also invoke xrefFiles().

The source code of this method is given here:

public void xrefFiles(java.util.List<String> inpfiles,
                     java.util.List<String> outfiles,
                     XrefParser            parser) throws java.io.IOException,  
                                                           org.htmlparser.util.ParserException {

  ...

//.... Pass 1

  for (int ifile = 0; ifile < inpfiles.size(); ifile++) {
    parser.setLinkPrefix(outfiles.get(ifile));
    parser.collect(inpfiles.get(ifile));
  }

//.... Pass 2

  for (int ifile = 0; ifile < inpfiles.size(); ifile++) {
    parser.resolve(inpfiles.get(ifile), outfiles.get(ifile));
  }

}

Note that using special HTML characters like "&auml;" in the XML file requires the ampersand to be escaped. The ampersand is a reserved character in XML to designate entity references, and thus the correct syntax for such characters would be for example

<parameter name="T
                     extAfterChapter" value=".&amp;nbsp;"/> 
                  

 
Example

A stripped-down version of this paper itself is used here as an example to illustrate the capabilities of Xref. The original file can be seen here, whereas the processed file is available from here.

Note the changes in the modified file:

  • The table of contents, list of figures, and list of tables has been automatically generated with all the text and hyperlinks.
  • All headings, tables, and figures have been automatically numbered.
  • The demo reference (preceded by the text "The resources are in chapter") has been resolved and the a hyperlink to the correct chapter heading has been introduced.

The project XML file used for this process looked like:

<?xml version="1.0" encoding="ISO-8859-1"?>
<project>
  <filepair inp="orig.html" out="mod.html"/>
  <parameter name="H3Style"         value="Number"/>
  <parameter name="H2H3Separator"   value=""/>
  <parameter name="TextAfterChapter" value=".&amp;nbsp;"/>
</project>

In this case, the headings started at level <h3>, not at <h1>, just because the <h1> formatting would have been too large. Setting H2H3Separator to an empty string ensures that we don't end up with a period preceding the h3-level numbering, instead of

.1. Introduction

we now get

1. Introduction

In addition, the period after the heading number was created by setting TextAfterchapter to ". ". Per default, there is just a blank space being used.

In the example above, the heading for the introduction chapter was declared like this

<h3 name="h1">Introduction</h3>

in the input HTML file. Xref transformed this into

<h3 name="h1"><a name="chapter:h1">1</a>.&nbsp;Introduction</h3>

in the output file. Note the auto-generated HTML anchor, which can be referred to in a <ref> tag.

One final note: Xref heavily relies on the parsing capabilities of the org.htmlparser.Parser parser. Given the gazillions of web pages which are out there (and which obviously haven't been tested within the scope of this development project), it is not unlikely that there are cases where this parser has trouble handling certain data. So the version number of 1.0 assigned to the release described here is justified in the sense that all the test cases created during development were handled successfully, but at the same time getting feedback on cases where there are problems would be appreciated to improve the next version. In addition, Xref can not repair documents with a broken structure: while the parser does a good job in handling for example missing closing tags (like </td> or </p> which are quite frequently forgotten), the numbering added by Xref will not be as expected if the nesting of header tags is not correct, like <h2> inside an <h3> or the like. Such problems can not be corrected automatically, neither by the parser, nor by Xref, and the TITO (Trash-In -Trash-Out) principle is applied.

 
Summary

Xref is a -- hopefully useful -- tool to primarily support authors of structured textual web content. It offers several levels of complexity, ranging from the simple automated generation of chapter heading numbers to a full automatic cross-referencing solution including auto-generated reference lists for arbitrary objects in texts that can span multiple HTML documents. The implementation of Xref also illustrates several of the powerful new ease-of-development features in the upcoming J2SE 5.0 (Tiger) like generics, enums, and enhanced for loops. I have found these extremely useful for this project, and I would estimate that -- even when factoring in the initial learning curve -- the development effort was reduced by approximately 30%, and the number of lines of code by about 40%. The latter is mostly due to the simple use of enums and the ensuing type safety.

 
Resources
  1. The sources, Javadocs, and the package JAR file can be downloaded from here. This package also contains the original and the processed HTML sources for this article.
  2. The documentation on counters in CSS2 can be found here: http://www.w3.org/TR/CSS21/generate.html#counters
  3. The HTMLParser Sourceforge project: http://htmlparser.sourceforge.net
  4. LaTeX -- A Document Preparation System. http://www.latex-project.org/intro.html
  5. J2SE 5.0 (Tiger) is required to use Xref. It can be downloaded from here.
  6. The full set of the Tiger documentation is available from here.
  7. For information on the new features of Tiger, see: New Language Features for Ease of Development in the Java 2 Platform, Standard Edition 1.5: A Conversation with Joshua Bloch here.
  8. Another great article on Tiger is J2SE 1.5 in a Nutshell by Calvin Austin, which is available from here
 
About the Author

Dr. Matthias Laux is a senior engineer for Sun Microsystems working in the Global SAP-Sun Competence Center in Walldorf, Germany. His main interests are Java and J2EE technology, architecture, and programming, as well as web services and XML technology in general, databases, and performance and benchmarking. Although he also has a background in aerospace engineering and HPC / parallel programming, today his languages of choice are Java and Perl. He is a certified Solaris Administrator, Java Programmer, and Java Enterprise Architect.

Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Comments:
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.