Keeping Those Links Up-to-Date

   
   

Download

Frequent users of the Internet face a common problem: there are so many interesting links which they not only would like to come back to at some point, but which are also interesting to others -- for example, colleagues who share a corporate intranet, and who are interested in similar areas. Browser bookmarks are suitable for storing only small numbers of such links, since they are only visible on one particular machine (unless the browser configuration data is shared via some mechanism like NFS), and are not shareable with other users of the network.

So sooner or later, many of us create simple HTML pages with our favorite links, typically organized by main area and subtopic, and often including a personal note in the design to make them even more interesting to other users.

But the web is a very dynamic medium, and links are bound to be migrated to some other location, or vanish entirely for any number of reasons. So these web pages have to be maintained. This process, however, is almost impossible to handle manually, since you'd have to click on each link, see what happens, and update the HTML code manually depending on the response received. With hundreds of links in a typical collection, this is clearly not practical. The lack of a solution to this problem sooner or later leads to frustrating user experiences because of many broken links. If the web page creator is lucky, s/he may get a friendly email from someone who detected such a broken link; in the more unlucky cases, the emails are not friendly, or users simply do not return to that collection, thereby making it essentially obsolete.

But there is a solution. I have developed three tools to store link information in a database, allowing for automated link validity checks (including a corresponding update of the database). The tools also provide a powerful mechanism to automatically update the web pages that contain the link collections, using the open-source Velocity template engine.

(To give credit where credit is due: I got the idea to create this set of tools after reading John Zukowski's article "Validating URL Links," in which he describes a technique to check whether a HTTP URL is still available.)

  • URLManage manages all interactions with the link data in the backend storage, which is a relational database system. This RDBMS stores the data in some tables, and either a new database needs to be set up, or an existing database can be used (using a separate schema or instance) to hold these tables. URLManage relies on the DBAccessor package, a JDBC wrapper that I developed earlier. URLManage creates, modifies, deletes, and retrieves link data, and manages the database schema itself. 
  • URLCheck automatically checks the links stored in the database against the network; that is, URLCheck tries to connect to that network resource and determine its status (availability). The actions taken (such as updating or removing a link in the database) depend on the status response received.
  • URLPublish creates web pages using the Velocity template engine, an Apache Jakarta project. The link data from the database is combined with a template web page to create the actual HTML output required for publishing.
URLManagerdownload

URLManage

Data Representation

The first design decision I made was to store the link data in a database to provide a clean structure for the information, and to take advantage of all the usual benefits of an RDBMS (such as transaction or backup support).

The next step was to define the database schema. The information we want to store for a link consists of the actual URL, description text, and some additional information: the date this entry was created, the date it was last checked, and the response code obtained during this check. A link can be known in one or more contexts, where a context is basically a topic area to which this link belongs.

Here's an example to shed more light on my approach. We have a link with the URL and the description "Sun's main Java entry page" (plus the additional information described above). In a link collection, this link could be known in the context of "Programming Languages," but also in the context of "Sun Microsystems." Looking at the actual web pages created, this could result in the link appearing in two different chapters on that page, or on different pages -- all depending on the output template(s) chosen (details on this will follow later).

This data structure is represented by two tables in the database. Their schema is described by an XML file ( schema.xml), and it has the following properties:

Table URLMGR_LINK
Column Type Length
URL VARCHAR 200
DESCRIPTION VARCHAR 200
CREATED VARCHAR 20
LASTCHECK VARCHAR 20
LASTCODE INTEGER  
Table URLMGR_CONTEXT
Column Type Length
Column Type Length
URL VARCHAR 200
CONTEXT VARCHAR 200

Note that the column names shown in bold form the primary key for the respective tables, indicating which entries are unique.

An interesting detail is that the column lengths can be modified, if necessary, in the XML file above (which is part of the URLManager package) before you actually create the database schema. The DBAccessor package is used internally to transform this XML schema description into SQL statements, which are then used to create these tables in the database using the URLManage tool. When creating or updating data in these tables, the chosen lengths of these fields are retrieved directly from the database (again using DBAccessor capabilities). This information is then used to check whether or not the link and context data provided by the user fits into the space provided in the database tables. Note that once the field lengths have been chosen and the schema created, these lengths cannot be modified using the URLManager package.

Links and Contexts

The two core data components -- links and contexts -- are modeled with the classes Link and Context, both of which are subclasses of DBAccessor's RowData class. This class provides built-in persistence support via the insert(), update(), select(), and delete() methods, which automatically create and issue the corresponding SQL statements to perform these actions in the database.

The Context class is actually very simple, holding just two string-valued properties -- for the URL and the name of the context. To create this class, only a few convenience extensions to the services provided by RowData were implemented; all other required capabilities (such as persistence or setter/getter methods) are provided by RowData. The Link class is slightly more complex, since it holds references to all contexts defined for a link in a hash map. For example, to insert a link into the database, the insert() method of the Link class would first call its own insert() method (in the superclass), and then call insert() for all of its contexts:

                     public class Link extends ml.jdbc.RowData {

  
                      java.util.HashMap contexts = new java.util.HashMap()

  
                      ...

  
                     public void insert(ml.jdbc.DBUser user) throws ml.jdbc.AccessorException  {                     
super.insert(user); ml.jdbc.Transfer.exportTableData(contexts.values(), user); }
... }

DBAccessor offers another convenience bulk method here ( exportTableData()), which performs the insert step for the contexts under the cover.

Using URLManage

URLManage is a simple Java application whose main task is to first validate the command line options provided by the user, and then issue these commands to the database. This database interaction is handled by the PersistenceManager class, an instance of which is created within URLManage. PersistenceManager offers all the capabilities to manage links and contexts (creation, deletion, updates, and selections of individual entries or lists of entries). Other capabilities include bulk import of data from input files (which is useful for the import of existing link collections) and database schema management (creation and deletion of the database tables used to hold links and contexts). Since PersistenceManager provides all the persistence capabilities required, it could also be instantiated and used by other tools, such as a graphical front end (rich client or web UI) instead of the URLManage command line tool.

Before you use the URLManager package, an important first step is to create the database schema required to hold the link information. URLManage also supports schema management, and the database tables can be created (or deleted, should this ever be necessary) using these commands:

  java -classpath $CP ml.urlmgr.URLManage -init db.properties
  java -classpath $CP ml.urlmgr.URLManage -drop db.properties

The property file db.properties contains the description of the database connection parameters, as required by the DBAccessor package. This is a Java property file, that is, a text file with key|value pairs, where key and value are separated by an equals sign. A typical file might look like this:

HOST = host.company.com # Database host
PORT = 3306             # JDBC port
TYPE = msc              # DB type (see DBAccessor API docs for details
                        # - these are included with the URLManager download)
NAME = urlmgr           # The name/id of the database
USER = myuser           # DB user holding the URLManager tables
PASS = passcode         # Password for this user

This property file is passed on to DBAccessor to set up the database connectivity, and no further data is required, as long as the database is one of the types currently supported by DBAccessor's default configuration. In the example above, msc indicates a MySQL database. Other supported types are Oracle, DB2, Cloudscape, and PostgreSQL. Naturally, DBAccessor also offers additional capabilities to configure support for other database types.

Once the database schema has been established, a typical usage example of URLManage would look like this:

java -classpath $CP ml.urlmgr.URLManage -v -c db.properties
  "#" "Sun's main Java entry page" "Programming Languages"

Here, the (optional) flag -v enables verbose output, and -cselects link creation. The string parameters are pretty much self-explanatory and correspond to the example described above: the first argument is the actual HTTP URL, the second argument is the description for this URL, and the third argument is the context in which this URL is to be known.

If you wanted to add the additional context "Sun Microsystems" for this link, the command would be:

  java -classpath $CP ml.urlmgr.URLManage -ac db.properties
    "#" "Sun Microsystems"

When migrating link collections to the URLManager package, the facilities provided by URLManage for bulk imports come in handy:

  java -classpath $CP ml.urlmgr.URLManage -bc db.properties data.link

This command would import all the link data provided in the text file data.link. This is much more efficient than importing many links using separate URLManage invocations, since only one Java VM process needs to be created; it can insert this data into the database using the same JDBC connection, as opposed to creating a new VM process and database connection for each link.

Here is a typical input file for bulk link data creation:

http://www.sun.com
Sun Microsystems home page
Computer companies
http://www.sap.com
SAP AG home page
Computer companies
http://www.google.com
Google - a cool search engine for the WWW
Search Engines
..

The three string parameters required to create a link in the database are provided on a separate line each: URL, description, and context.

Since link collections typically refer to some links in different contexts, URLManage also provides a bulk import method for additional contexts:

  java -classpath $CP ml.urlmgr.URLManage -bac db.properties data.context

An example bulk context data file might look like this:

  http://www.sun.com
  Leading UNIX Vendors
  http://www.sap.com
  ERP Vendors
  ...

The format is quite simple, expecting one line for the URL (which is unique in the database), followed by one line for the additional context for this URL.

You can create the text files for bulk data import based on existing link collections using tools like Perl or the Java regular expression package (as of JDK 1.4), and then import into the database used by the URLManager package.

You can obtain the complete usage description for the tool by invoking it without (or with an illegal number of) arguments. The description is also contained in the file doc/Manage.usage, which is part of the distribution.

URLCheck

Now that we have stored all the required data in the database, the next step is to provide a tool to check all of these links against the network and to take appropriate action, depending on the outcome of this check. This is what URLCheck was developed for.

The CheckManager Concept

A CheckManager is a class that allows a specific protocol (HTTP, HTTPS, FTP, LDAP, IMAP, and so on) to check whether the resource identified by a link is still available on the network. The methods required by every CheckManagerare:

  public abstract CheckResult check(Link link) throws URLManagerException;

  public abstract boolean update(Link link, CheckResult result,
    PersistenceManager persistence) throws URLManagerException;

  public abstract void init(java.util.Properties config, boolean verbose,
    boolean update) throws URLManagerException;

Besides the check() method, the init()method is required to transfer configuration data to an actual instance, whereas the update() method implements the updates in the database, depending on the result of a check. It is assumed here that the different protocols (such as HTTP or FTP) return an integer-valued response code, which is encapsulated in a CheckResult instance here. This helper class holds the response code, but can also hold any number of additional properties (via a generic mechanism based on a java.util.Properties member variable), as required by the specific protocol checked. For HTTP, as an example, CheckResult also holds the value of the Location header in the HTTP response, which is required to properly handle HTTP redirect responses.

All of this can vary, depending on the protocol to be checked. And (even though, admittedly, most of the links encountered in link collections probably use either HTTP or HTTPS) URLCheck was designed to allow for the inclusion of any protocol, provided a CheckManager is implemented for it. You can implement and add additional CheckManager subclasses to the URLCheck tool without any code changes, just by specifying a corresponding property file which contains all the required parameters for such a protocol, especially the class name which implements this protocol's CheckManager.

For HTTP, this file could look like:

  MANAGER    = ml.urlmgr.HttpCheckManager
  HTTP_PROXY = webcache.germany.sun.com
  HTTP_PORT  = 8080
  UPDATE     = 301|303
  REMOVE     = 404|410|500|505

(these properties are actually passed to the init()method as the first argument)

MANAGER is the only mandatory property for all protocol properties files. This property specifies the class name implementing the CheckManager for the protocol. All other configuration parameters specified in these property files are completely dependent on the CheckManager implementation used for a specific protocol. In the example above, other configuration parameters for the HTTP CheckManager include the proxy configuration (if required) and (optionally) lists of HTTP response codes for which the link data needs to be updated in the database due to redirections (UPDATE) or removals (REMOVE) -- for example, due to the much-dreaded HTTP 404 response ("Not found"). These response codes are specified in RFC 2616, and the approach chosen here allows for a very flexible handling of update/remove actions, depending on the user's requirements for the different response codes.

You define the different CheckManagers which are to be used within URLCheck (and thus the different supported protocols to check for) by using a property file specified as a command line argument to URLCheck. An example for such a file would be:

  http = config/http.properties

which basically means that the properties for the HTTP protocol are specified in the given property file. Any number of protocols can be handled in this way, where the names of the protocols are those returned by java.net.URL.getProtocol().  

CheckManager also provides some basic services deemed useful for all subclasses, the most important of which is the management of protocol response codes. As mentioned before, all protocols are expected to return some integer-valued response code. These are typically specified in RFCs such as

Protocol RFC
HTTP/HTTPS 2616
LDAP 2251
FTP 640
IMAP 2060

and CheckManager offers the following methods to support generic response code management:

  public void addCode(int code, String description)
  public int getMinCode()
  public int getMaxCode()
  public String getCodeText(int code)

The idea here is that CheckManager subclasses define the response codes for the protocol they handle in their init() method using these methods.

Using URLCheck

The following code invokes a complete check of links using URLCheck:

  java -classpath $CP ml.urlmgr.URLCheck -u -v db.properties web.properties

Again, -v (optional) enables verbose output. The optional flag -u causes the specified update/remove actions to be actually executed (without this flag, the links would only be checked, but no changes would be made to the database). The database properties file is again required to access the database, and the web properties file is the master properties file described above, which contains references to the properties files for the individual supported protocols.

To obtain a complete usage description, you invoke URLCheck without (or with the wrong number of) arguments. URLCheck uses the following operation sequence:

  • First, URLCheck reads the web properties file and instantiates a CheckManager instance for all the protocols specified, then calls the init() method for these instances with the protocol-specific properties as an argument. These CheckManager instances are stored in another helper class, ProtocolHandler. This class also holds integer counters which are used to collect statistics on the responses received to a protocol during the checks; statistics are printed after all links have been checked.
  • Next, a PersistenceManager is created using the database properties file specified on the command line. This instance is then used to retrieve all the links from the database.
  • All these links are treated in a loop, and the CheckManager.check() method is called for each link, provided that such an instance exists for the link's protocol (if not, a warning message is printed). Statistics are collected based on the CheckResult object received using the ProtocolHandler's count() method.
  • The link columns containing the date of the last check and the response code obtained during that check are updated
  • The CheckManager's update() method is called. It is now that instance's responsibility to effect any changes in the database required, depending on the settings provided in the properties file for this protocol.
  • After all links have been checked, statistics are printed for each protocol, using ProtocolHandler.printStatistics().

Developing other CheckManagers

Currently, the only CheckManager actually implemented is based on the HTTP protocol to cover the most important case. Additional CheckManagers are fairly easy to implement, since they can extend the abstract CheckManager class and use their services. The following issues need to be addressed before you develop a new implementation class:

  • First, determine how a specific protocol can actually be accessed on the network from within the Java code. Although the Java class library is very large, it doesn't contain handler classes to support all the relevant web protocols; you'll need to find (or create) a Java package that provides the required services for a protocol.
  • Identify and define the parameters that are required to configure the handling of the protocol. These need to be defined in the protocol-specific properties file, and will be provided through the java.util.Properties argument to the init() method.
  • Create the actual implementation class by extending CheckManager and providing the init(), update(), and check() methods. The HttpCheckManager class can serve as an example of how this can be done.
  • Add a property to the web properties file to enable the handling of the protocol and to identify the protocol-specific properties that the implementation class needs to perform its tasks.

One additional complication when working with non-HTTP protocols is the use of proxy servers -- for example, when accessing the WWW from within corporate networks. Since both the client's browser and the proxy server forwarding the request use HTTP as their protocol, the HTTP response codes are visible on the client side. Other protocols, however, will be wrapped within an HTTP request in between the client and the proxy, and only the proxy will then use the actually chosen (non-HTTP) protocol to access the network resource. One such example is the FTP protocol: an FTP request of the form ftp://server.acme.com would be sent to the proxy as an HTTP GET request for the URL ftp://server.acme.com. The proxy will then access that FTP server directly by connecting to port 21 (the default FTP port). The response back to the client browser will again be wrapped into a message transferred by HTTP. The problem here is to identify the actual response code of the FTP server, since this will also be wrapped within an HTML message transferred by HTTP. This is something a CheckManager implementation for FTP would need to address.

URLPublish

The Velocity Template Engine

The final step in the process of keeping link collections up-to-date is to recreate the web pages based on the data that has been checked (and possibly updated) by URLCheck. The Java-based Velocity template engine is a very convenient tool to achieve this goal, with only a few lines of additional code required.

The approach Velocity uses is very simple, yet powerful:

  • A text file ("template") is instrumented with tags that Velocity recognizes. This template is then processed within a Java application using the Velocity API, and Velocity parses these tags and fills them with data obtained from the Java application, where required. From within Velocity tags, Java objects and methods can be accessed directly using reference names. In addition, these tags provide some capabilities available in other programming languages -- for example, flow-control structures. Velocity also allows for the definition of macros, which is very useful to avoid repeating the same tag structures several times in a template.
  • To establish the connection between the actual data required to fill the template and the template itself, we use the VelocityContext class. Using this context, Java objects are assigned to the reference names used in the template. Velocity provides very powerful capabilities (based on Java Reflection) to figure out what capabilities such a Java object has. This covers, for example, automatically identifying property getter methods for JavaBeans, or providing iteration capabilities for objects based on the Java Collections API with a very simple syntax.
  • While Velocity can be used to create any kind of text file -- producing, for example, SQL, PostScript, or even Java output -- in the case of the URLManager, the primary focus is HTML.

One additional benefit of Velocity is that it directly supports the MVC approach by letting the web designer focus on the View (the template), and the application programmer focus on the Model (in our case, the URLManage tool and the database) and the Controller (URLPublish).

Here is an example of a Velocity template that would create a web page with all the links in the database, grouped by context:

#macro( list $context )
<p><b>$context</b> </p>
<ul>
#foreach( $link in  
                   $links.get($context) )
<li> <a href="/developer/technicalArticles/Programming/linkupdate/$link.url"> $link.Description </a> </li>
#end
</ul> <p>
#end

<html>
<body>

#foreach( $context in  
                   $contexts )
#list($context)
#end

</body>
</html>
                

Here, $links and $contexts are names which are linked with Java objects using the VelocityContext in URLPublish:

               PublishManager  manager = new PublishManager(...);   VelocityContext context = new VelocityContext();   context.put("                links", manager.getMap());   context.put("                contexts", manager.getContexts());             
          


$links references a java.util.HashMap that maps context names to instances of java.util.TreeSet. Each TreeSet instance holds all the Link objects for that context. $contexts represents a java.util.TreeSet instance holding just context names. Note that TreeSets provide sorting capabilities: while the natural sort order is used for the context names, a custom comparator has been implemented to sort links according to various criteria. Currently, you can sort by creation date and by link description, selected via URLPublish command line flags.

In the example above, the list macro takes the name of a context as argument ( $context) and uses it to first print a header line with the context name, and then create an HTML list with all the links available for this context. Velocity automatically determines that $links.get($context) is a java.util.Collection and iterates over it. You can access the link data using an abbreviated syntax ( $link.url) which Velocity translates to a call to the Link.getUrl() method. Note that -- apart from the Velocity tags which will be replaced by plain HTML code -- this template is a simple HTML page, and thus all the fancy layout techniques required to make a page visually attractive can be employed in the usual way by a web designer, if necessary.

Running URLPublish with the template described above and some simple test data results in this output web page:

<html>
<body>
 <p><b>Companies</b> </p>
 <ul>
   <li> <a href="http://www.sun.com"> Sun Microsystems Inc. </a> </li>
 </ul> <p>
 <p><b>Programming Languages</b> </p>
 <ul>
   <li> <a href="/j2se"> J2SE home page </a> </li>
   <li> <a href="/index.jsp"> Sun's main Java page </a> </li>
 </ul> <p>
 <p><b>Sun Microsystems</b> </p>
 <ul>
   <li> <a href="/index.jsp"> Sun's main Java page </a> </li>
 </ul> <p>
</body>
</html>

Using URLPublish

You invoke URLPublish with this command:

  java -classpath $CP ml.urlmgr.URLPublish -d -r -v db.properties web.vm page.html

The optional -v flag enables verbose output, whereas -d enables sorting of the links by creation date (the default is to sort links by their description text). The optional -r flag enables reverse sorting (that is, it toggles between ascending and descending sort order). The database properties file is again required to access the database, and the Velocity template file to use is the second argument (here: web.vm). The output file to create (here: page.html) completes the set of arguments.

As with the other tools, you can obtain a complete usage description by invoking URLPublish without (or with the wrong number of) arguments.

Internally, URLPublish uses a PublishManager helper class to assemble the data structures holding the link and context data with the input taken from the database. These data structures are then merged into the template through the VelocityContext.

It's simple to create several web pages for different topic areas: the contexts to be included on a web page are selected through the template, so you create several templates, each containing only the contexts deemed suitable to the topic covered by the individual web page. The command described above can then be used for each template (with a different output file, of course) to create the set of HTML pages. The appropriate data to include in the pages is automatically selected by Velocity. An alternative approach would be to store the different data sets in separate database schemata and use a simple default template for all of them.

Future Directions

The set of tools contained in the URLManager bundle is fairly complete for handling the tasks it was designed for. One really nice-to-have feature would be a web GUI component to control the tools from within a browser; for example, the user could enter new link data using a standard HTML form. Such a web user interface could be designed using one of the popular frameworks like Struts or Java Server Faces (JSF), and, in fact, the class structure of the URLManager bundle has been designed with other applications using them in mind. It should be straightforward to implement such a solution. Other than that, CheckManagers for HTTPS and FTP would also be nice to have.

About the Author

Dr. Matthias Laux is a senior engineer working in the Global SAP-Sun Competence Center in Walldorf, Germany. His main interests are Java and J2EE technology, architecture, and programming, web services and XML technology in general, databases, and performance and benchmarking. Although he also has a background in aerospace engineering and HPC/parallel programming, today his languages of choice are Java and Perl.

See Also

Download the URLManager package.
The Velocity Template Engine
RFC 640: Revised FTP Reply Codes.
RFC 2060: Internet Message Access Protocol
RFC 2251: Lightweight Directory Access Protocol (v3)
RFC 2616: Hypertext Transfer Protocol
DBAccessor - A JDBC Wrapper Package
Validating URL Links
Download the URLManager software

Java, J2EE, J2SE, J2ME, and all Java-based marks are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries.