Frequent users of the Internet face a common problem: there are so many interesting links which they not only would like to come back to at some point, but which are also interesting to others -- for example, colleagues who share a corporate intranet, and who are interested in similar areas. Browser bookmarks are suitable for storing only small numbers of such links, since they are only visible on one particular machine (unless the browser configuration data is shared via some mechanism like NFS), and are not shareable with other users of the network.
So sooner or later, many of us create simple HTML pages with our favorite links, typically organized by main area and subtopic, and often including a personal note in the design to make them even more interesting to other users.
But the web is a very dynamic medium, and links are bound to be migrated to some other location, or vanish entirely for any number of reasons. So these web pages have to be maintained. This process, however, is almost impossible to handle manually, since you'd have to click on each link, see what happens, and update the HTML code manually depending on the response received. With hundreds of links in a typical collection, this is clearly not practical. The lack of a solution to this problem sooner or later leads to frustrating user experiences because of many broken links. If the web page creator is lucky, s/he may get a friendly email from someone who detected such a broken link; in the more unlucky cases, the emails are not friendly, or users simply do not return to that collection, thereby making it essentially obsolete.
But there is a solution. I have developed three tools to store link information in a database, allowing for automated link validity checks (including a corresponding update of the database). The tools also provide a powerful mechanism to automatically update the web pages that contain the link collections, using the open-source Velocity template engine.
(To give credit where credit is due: I got the idea to create this set of tools after reading John Zukowski's article "Validating URL Links," in which he describes a technique to check whether a HTTP URL is still available.)
The first design decision I made was to store the link data in a database to provide a clean structure for the information, and to take advantage of all the usual benefits of an RDBMS (such as transaction or backup support).
The next step was to define the database schema. The information we want to store for a link consists of the actual URL, description text, and some additional information: the date this entry was created, the date it was last checked, and the response code obtained during this check. A link can be known in one or more contexts, where a context is basically a topic area to which this link belongs.
Here's an example to shed more light on my approach. We have a link with the URL "http://java.sun.com" and the description "Sun's main Java entry page" (plus the additional information described above). In a link collection, this link could be known in the context of "Programming Languages," but also in the context of "Sun Microsystems." Looking at the actual web pages created, this could result in the link appearing in two different chapters on that page, or on different pages -- all depending on the output template(s) chosen (details on this will follow later).
This data structure is represented by two tables in the database. Their schema is described by an XML file (
schema.xml), and it has the following properties:
Note that the column names shown in bold form the primary key for the respective tables, indicating which entries are unique.
An interesting detail is that the column lengths can be modified, if necessary, in the XML file above (which is part of the URLManager package) before you actually create the database schema. The DBAccessor package is used internally to transform this XML schema description into SQL statements, which are then used to create these tables in the database using the URLManage tool. When creating or updating data in these tables, the chosen lengths of these fields are retrieved directly from the database (again using DBAccessor capabilities). This information is then used to check whether or not the link and context data provided by the user fits into the space provided in the database tables. Note that once the field lengths have been chosen and the schema created, these lengths cannot be modified using the URLManager package.
The two core data components -- links and contexts -- are modeled with the classes
Context, both of which are subclasses of DBAccessor's
RowData class. This class provides built-in persistence support via the
delete() methods, which automatically create and issue the corresponding SQL statements to perform these actions in the database.
Context class is actually very simple, holding just two string-valued properties -- for the URL and the name of the context. To create this class, only a few convenience extensions to the services provided by
RowData were implemented; all other required capabilities (such as persistence or setter/getter methods) are provided by
Link class is slightly more complex, since it holds references to all contexts defined for a link in a hash map. For example, to insert a link into the database, the
insert() method of the
Link class would first call its own
insert() method (in the superclass), and then call
insert() for all of its contexts:
DBAccessor offers another convenience bulk method here (
exportTableData()), which performs the insert step for the contexts under the cover.
URLManage is a simple Java application whose main task is to first validate the command line options provided by the user, and then issue these commands to the database. This database interaction is handled by the
PersistenceManager class, an instance of which is created within URLManage.
PersistenceManager offers all the capabilities to manage links and contexts (creation, deletion, updates, and selections of individual entries or lists of entries). Other capabilities include bulk import of data from input files (which is useful for the import of existing link collections) and database schema management (creation and deletion of the database tables used to hold links and contexts). Since
PersistenceManager provides all the persistence capabilities required, it could also be instantiated and used by other tools, such as a graphical front end (rich client or web UI) instead of the URLManage command line tool.
Before you use the URLManager package, an important first step is to create the database schema required to hold the link information. URLManage also supports schema management, and the database tables can be created (or deleted, should this ever be necessary) using these commands:
java -classpath $CP ml.urlmgr.URLManage -init db.properties
java -classpath $CP ml.urlmgr.URLManage -drop db.properties
The property file
db.properties contains the description of the database connection parameters, as required by the DBAccessor package. This is a Java property file, that is, a text file with key|value pairs, where key and value are separated by an equals sign. A typical file might look like this:
HOST = host.company.com # Database host PORT = 3306 # JDBC port TYPE = msc # DB type (see DBAccessor API docs for details # - these are included with the URLManager download) NAME = urlmgr # The name/id of the database USER = myuser # DB user holding the URLManager tables PASS = passcode # Password for this user
This property file is passed on to DBAccessor to set up the database connectivity, and no further data is required, as long as the database is one of the types currently supported by DBAccessor's default configuration. In the example above,
msc indicates a MySQL database. Other supported types are Oracle, DB2, Cloudscape, and PostgreSQL. Naturally, DBAccessor also offers additional capabilities to configure support for other database types.
Once the database schema has been established, a typical usage example of URLManage would look like this:
java -classpath $CP ml.urlmgr.URLManage -v -c db.properties "http://java.sun.com" "Sun's main Java entry page" "Programming Languages"
Here, the (optional) flag
-v enables verbose output, and
-cselects link creation. The string parameters are pretty much self-explanatory and correspond to the example described above: the first argument is the actual HTTP URL, the second argument is the description for this URL, and the third argument is the context in which this URL is to be known.
If you wanted to add the additional context "Sun Microsystems" for this link, the command would be:
java -classpath $CP ml.urlmgr.URLManage -ac db.properties
"http://java.sun.com" "Sun Microsystems"
When migrating link collections to the URLManager package, the facilities provided by URLManage for bulk imports come in handy:
java -classpath $CP ml.urlmgr.URLManage -bc db.properties data.link
This command would import all the link data provided in the text file
data.link. This is much more efficient than importing many links using separate URLManage invocations, since only one Java VM process needs to be created; it can insert this data into the database using the same JDBC connection, as opposed to creating a new VM process and database connection for each link.
Here is a typical input file for bulk link data creation:
http://www.sun.com Sun Microsystems home page Computer companies http://www.sap.com SAP AG home page Computer companies http://www.google.com Google - a cool search engine for the WWW Search Engines ..
The three string parameters required to create a link in the database are provided on a separate line each: URL, description, and context.
Since link collections typically refer to some links in different contexts, URLManage also provides a bulk import method for additional contexts:
java -classpath $CP ml.urlmgr.URLManage -bac db.properties data.context
An example bulk context data file might look like this:
Leading UNIX Vendors
The format is quite simple, expecting one line for the URL (which is unique in the database), followed by one line for the additional context for this URL.
You can create the text files for bulk data import based on existing link collections using tools like Perl or the Java regular expression package (as of JDK 1.4), and then import into the database used by the URLManager package.
You can obtain the complete usage description for the tool by invoking it without (or with an illegal number of) arguments. The description is also contained in the file
doc/Manage.usage, which is part of the distribution.
Now that we have stored all the required data in the database, the next step is to provide a tool to check all of these links against the network and to take appropriate action, depending on the outcome of this check. This is what URLCheck was developed for.
CheckManager is a class that allows a specific protocol (HTTP, HTTPS, FTP, LDAP, IMAP, and so on) to check whether the resource identified by a link is still available on the network. The methods required by every
public abstract CheckResult check(Link link) throws URLManagerException;
public abstract boolean update(Link link, CheckResult result,
PersistenceManager persistence) throws URLManagerException;
public abstract void init(java.util.Properties config, boolean verbose,
boolean update) throws URLManagerException;
check() method, the
init()method is required to transfer configuration data to an actual instance, whereas the
update() method implements the updates in the database, depending on the result of a check. It is assumed here that the different protocols (such as HTTP or FTP) return an integer-valued response code, which is encapsulated in a
CheckResult instance here. This helper class holds the response code, but can also hold any number of additional properties (via a generic mechanism based on a
java.util.Properties member variable), as required by the specific protocol checked. For HTTP, as an example,
CheckResult also holds the value of the
Location header in the HTTP response, which is required to properly handle HTTP redirect responses.
All of this can vary, depending on the protocol to be checked. And (even though, admittedly, most of the links encountered in link collections probably use either HTTP or HTTPS) URLCheck was designed to allow for the inclusion of any protocol, provided a
CheckManager is implemented for it. You can implement and add additional
CheckManager subclasses to the URLCheck tool without any code changes, just by specifying a corresponding property file which contains all the required parameters for such a protocol, especially the class name which implements this protocol's
For HTTP, this file could look like:
MANAGER = ml.urlmgr.HttpCheckManager
HTTP_PROXY = webcache.germany.sun.com
HTTP_PORT = 8080
UPDATE = 301|303
REMOVE = 404|410|500|505
(these properties are actually passed to the
init()method as the first argument)
MANAGER is the only mandatory property for all protocol properties files. This property specifies the class name implementing the
CheckManager for the protocol. All other configuration parameters specified in these property files are completely dependent on the
CheckManager implementation used for a specific protocol. In the example above, other configuration parameters for the HTTP
CheckManager include the proxy configuration (if required) and (optionally) lists of HTTP response codes for which the link data needs to be updated in the database due to redirections (UPDATE) or removals (REMOVE) -- for example, due to the much-dreaded HTTP 404 response ("Not found"). These response codes are specified in
RFC 2616, and the approach chosen here allows for a very flexible handling of update/remove actions, depending on the user's requirements for the different response codes.
You define the different
CheckManagers which are to be used within URLCheck (and thus the different supported protocols to check for) by using a property file specified as a command line argument to URLCheck. An example for such a file would be:
http = config/http.properties
which basically means that the properties for the HTTP protocol are specified in the given property file. Any number of protocols can be handled in this way, where the names of the protocols are those returned by
CheckManager also provides some basic services deemed useful for all subclasses, the most important of which is the management of protocol response codes. As mentioned before, all protocols are expected to return some integer-valued response code. These are typically specified in RFCs such as
CheckManager offers the following methods to support generic response code management:
public void addCode(int code, String description)
public int getMinCode()
public int getMaxCode()
public String getCodeText(int code)
The idea here is that
CheckManager subclasses define the response codes for the protocol they handle in their
init() method using these methods.
The following code invokes a complete check of links using URLCheck:
java -classpath $CP ml.urlmgr.URLCheck -u -v db.properties web.properties
-v (optional) enables verbose output. The optional flag
-u causes the specified update/remove actions to be actually executed (without this flag, the links would only be checked, but no changes would be made to the database). The database properties file is again required to access the database, and the web properties file is the master properties file described above, which contains references to the properties files for the individual supported protocols.
To obtain a complete usage description, you invoke URLCheck without (or with the wrong number of) arguments. URLCheck uses the following operation sequence:
CheckManagerinstance for all the protocols specified, then calls the
init()method for these instances with the protocol-specific properties as an argument. These
CheckManagerinstances are stored in another helper class,
ProtocolHandler. This class also holds integer counters which are used to collect statistics on the responses received to a protocol during the checks; statistics are printed after all links have been checked.
PersistenceManageris created using the database properties file specified on the command line. This instance is then used to retrieve all the links from the database.
CheckManager.check()method is called for each link, provided that such an instance exists for the link's protocol (if not, a warning message is printed). Statistics are collected based on the
CheckResultobject received using the
update()method is called. It is now that instance's responsibility to effect any changes in the database required, depending on the settings provided in the properties file for this protocol.
Currently, the only
CheckManager actually implemented is based on the HTTP protocol to cover the most important case. Additional
CheckManagers are fairly easy to implement, since they can extend the abstract
CheckManager class and use their services. The following issues need to be addressed before you develop a new implementation class:
java.util.Propertiesargument to the
CheckManagerand providing the
HttpCheckManagerclass can serve as an example of how this can be done.
One additional complication when working with non-HTTP protocols is the use of proxy servers -- for example, when accessing the WWW from within corporate networks. Since both the client's browser and the proxy server forwarding the request use HTTP as their protocol, the HTTP response codes are visible on the client side. Other protocols, however, will be wrapped within an HTTP request in between the client and the proxy, and only the proxy will then use the actually chosen (non-HTTP) protocol to access the network resource. One such example is the FTP protocol: an FTP request of the form ftp://server.acme.com would be sent to the proxy as an HTTP GET request for the URL ftp://server.acme.com. The proxy will then access that FTP server directly by connecting to port 21 (the default FTP port). The response back to the client browser will again be wrapped into a message transferred by HTTP. The problem here is to identify the actual response code of the FTP server, since this will also be wrapped within an HTML message transferred by HTTP. This is something a
CheckManager implementation for FTP would need to address.
The final step in the process of keeping link collections up-to-date is to recreate the web pages based on the data that has been checked (and possibly updated) by URLCheck. The Java-based Velocity template engine is a very convenient tool to achieve this goal, with only a few lines of additional code required.
The approach Velocity uses is very simple, yet powerful:
VelocityContextclass. Using this context, Java objects are assigned to the reference names used in the template. Velocity provides very powerful capabilities (based on Java Reflection) to figure out what capabilities such a Java object has. This covers, for example, automatically identifying property getter methods for JavaBeans, or providing iteration capabilities for objects based on the Java Collections API with a very simple syntax.
One additional benefit of Velocity is that it directly supports the MVC approach by letting the web designer focus on the View (the template), and the application programmer focus on the Model (in our case, the URLManage tool and the database) and the Controller (URLPublish).
Here is an example of a Velocity template that would create a web page with all the links in the database, grouped by context:
#macro( list $context ) <p><b>$context</b> </p> <ul> #foreach( $link in $links.get($context) ) <li> <a href="/developer/technicalArticles/Programming/linkupdate/$link.url"> $link.Description </a> </li> #end </ul> <p> #end <html> <body> #foreach( $context in $contexts ) #list($context) #end </body> </html>
$contexts are names which are linked with Java objects using the
VelocityContext in URLPublish:
PublishManager manager = new PublishManager(...); VelocityContext context = new VelocityContext(); context.put(" links", manager.getMap()); context.put(" contexts", manager.getContexts());
$links references a
java.util.HashMap that maps context names to instances of
TreeSet instance holds all the
Link objects for that context.
$contexts represents a
java.util.TreeSet instance holding just context names. Note that
TreeSets provide sorting capabilities: while the natural sort order is used for the context names, a custom comparator has been implemented to sort links according to various criteria. Currently, you can sort by creation date and by link description, selected via URLPublish command line flags.
In the example above, the
list macro takes the name of a context as argument (
$context) and uses it to first print a header line with the context name, and then create an HTML list with all the links available for this context. Velocity automatically determines that
$links.get($context) is a
java.util.Collection and iterates over it. You can access the link data using an abbreviated syntax (
$link.url) which Velocity translates to a call to the
Link.getUrl() method. Note that -- apart from the Velocity tags which will be replaced by plain HTML code -- this template is a simple HTML page, and thus all the fancy layout techniques required to make a page visually attractive can be employed in the usual way by a web designer, if necessary.
Running URLPublish with the template described above and some simple test data results in this output web page:
<html> <body> <p><b>Companies</b> </p> <ul> <li> <a href="http://www.sun.com"> Sun Microsystems Inc. </a> </li> </ul> <p> <p><b>Programming Languages</b> </p> <ul> <li> <a href="/j2se"> J2SE home page </a> </li> <li> <a href="/index.jsp"> Sun's main Java page </a> </li> </ul> <p> <p><b>Sun Microsystems</b> </p> <ul> <li> <a href="/index.jsp"> Sun's main Java page </a> </li> </ul> <p> </body> </html>
You invoke URLPublish with this command:
java -classpath $CP ml.urlmgr.URLPublish -d -r -v db.properties web.vm page.html
-v flag enables verbose output, whereas
-d enables sorting of the links by creation date (the default is to sort links by their description text). The optional
-r flag enables reverse sorting (that is, it toggles between ascending and descending sort order). The database properties file is again required to access the database, and the Velocity template file to use is the second argument (here:
web.vm). The output file to create (here:
page.html) completes the set of arguments.
As with the other tools, you can obtain a complete usage description by invoking URLPublish without (or with the wrong number of) arguments.
Internally, URLPublish uses a
PublishManager helper class to assemble the data structures holding the link and context data with the input taken from the database. These data structures are then merged into the template through the
It's simple to create several web pages for different topic areas: the contexts to be included on a web page are selected through the template, so you create several templates, each containing only the contexts deemed suitable to the topic covered by the individual web page. The command described above can then be used for each template (with a different output file, of course) to create the set of HTML pages. The appropriate data to include in the pages is automatically selected by Velocity. An alternative approach would be to store the different data sets in separate database schemata and use a simple default template for all of them.
The set of tools contained in the URLManager bundle is fairly complete for handling the tasks it was designed for. One really nice-to-have feature would be a web GUI component to control the tools from within a browser; for example, the user could enter new link data using a standard HTML form. Such a web user interface could be designed using one of the popular frameworks like Struts or Java Server Faces (JSF), and, in fact, the class structure of the URLManager bundle has been designed with other applications using them in mind. It should be straightforward to implement such a solution. Other than that,
CheckManagers for HTTPS and FTP would also be nice to have.
Dr. Matthias Laux is a senior engineer working in the Global SAP-Sun Competence Center in Walldorf, Germany. His main interests are Java and J2EE technology, architecture, and programming, web services and XML technology in general, databases, and performance and benchmarking. Although he also has a background in aerospace engineering and HPC/parallel programming, today his languages of choice are Java and Perl.
Java, J2EE, J2SE, J2ME, and all Java-based marks are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries.