's Web Crawler: A Massively Scaled Java Technology Application

By Janice J. Heiss, August 25, 2005  

Articles Index's innovative shopping search engine has done what many in the search engine community thought impossible: The company has successfully created a Java technology web crawler that may be the most sophisticated, massively scaled Java technology application in existence, obtaining information on over 3 billion web pages and writing well over 8 terabytes of data (and growing) on 30 fully distributed servers in seven days.

"We made the radical decision to implement a crawler using Java technology. No one believed it was possible, but we were able to build the prototype crawler in three months using two developers, which was a major achievement."
Yeogirl Yun
CTO, Chairman, and Cofounder,'s decision to deploy Java technology followed the experience of the company's CTO, chairman, and cofounder, Yeogirl Yun, at, where Wisenut spent a year creating a C++ web crawler that had significant memory and threading problems.

"We needed to do it faster this time," observes Yun. "So we made the radical decision to implement a crawler using Java technology. No one believed it was possible, but we were able to build the prototype crawler in three months using two developers, which was a major achievement. The built-in network library, multithreading framework, and RMI [remote method invocation] saved a lot of development time. The performance is pretty good. My experience with Wisenut made it clear that managing uncertain data on the web is a big challenge. There can be memory leaks and threading issues. We're very pleased with the performance of the Java platform."

After writing the first crawler -- Crawler A -- entirely in the Java programming language, the company completed a second -- Crawler B -- by writing the fetcher in the Java language and the controller in C++. The fetcher does I/O and gathers, parses, and analyzes the content of web pages. It extracts links and sends data to the controller, which manages data structures and records data to disks. Fetchers communicate with controllers but not with each other.

Each crawler uses 200 Mbps lines and writes more than 8 terabytes of data. One run takes roughly a week. Crawler A was written first, starting in June 2004. Crawler B was begun in November 2004. Both crawlers are pure Java software, with no Java Native Interface (JNI). Crawlers share some packages for content analysis, and all content-analysis software used during the crawl is written entirely in the Java language.

The Affinity Index Ranking's patented Affinity Index Ranking (AIR) algorithm provides highly targeted search results by understanding the context of pages on the web. AIR integrates advanced concepts from applied physics and engineering dynamics that will eventually make public.

AIR identifies exceptional web pages by assessing the level of interconnection between valuable sites from within specific fields of interest. In other words, AIR evaluates a web page based on what other "knowledgeable" sites in a specific field "say" about it and on what the page "says" about other "knowledgeable" sites in the specific field. This contrasts with Google's PageRank, which estimates the popularity of a given web page by looking only at links into the page and ignoring both the page's outlinks and context.

"By obtaining the standing of a web page within a given topic,'s AIR technology is able to provide exceptionally focused search results for users," says Yun.

Building the Web Index began in 2004 with the mission of helping consumers make ideal buying decisions.'s founders recognized that general-purpose search engines frequently provided search results consisting of information unrelated to shopping. Price comparison sites failed to provide comprehensive research information such as buying guides, articles, forums, and expert reviews.'s innovative approach to search engine shopping technology addresses these problems.

Founded by Michael Yang and Yeogirl Yun, who created the online comparison-shopping site mySimon, launched the beta version of its comparison-shopping service on July 19, 2005, and will be adding services later in 2005. They are eager to receive user feedback.'s crawlers build a web index, a searchable database, roughly every two weeks, based on what the crawler discovers on the web. The crawl controller first looks at seed pages wherein it finds links that identify more pages on the web, in a search for valuable shopping-related information. Next, the fetcher, which itself stores no information, classifies information by running several checks on every page it locates. It looks for page type and language and filters out duplicates or spam. It identifies links, buying guides, expert reviews, forums, articles, and other relevant materials. Then it sends information back to the crawl controller, which guides the crawl. Once the process is finished, it forms a database of all pages visited, in order by URL. Although searches are currently limited to English, the crawler is constructed so that it can scale easily to other languages.

The gathered information then goes to an "inverted" index, currently of 3.2 billion web pages, in order not by URLs but by keywords. Finally, the index is fine-tuned to both expert feedback from the research team and page-value connectivity analysis, which notes the frequency with which other pages on the same topic link to a page. The crawler takes about a week to complete its task. Finally, all of this information goes into the next crawl.

C++ is used to build the index, which is highly CPU-intensive. The crawler, machine learning, classification, and spell-checker use Java technology. initially stored the information in an SQL database, which required replacement because it lacked the requisite performance characteristics.'s new database, which uses an internal format and includes APIs for both the Java language and C++, provides far better throughput than did SQL.

JFreeChart, a free open source charting package that takes advantage of Java Foundation Classes/Swing (JFC/Swing) and Java 2D technology, is used to talk to the controllers, with primitive interactive abilities. Crawler A, the pure Java technology crawler, has 39,000 lines of Java code functioning on 40 to 50 machines, with a total of 160 to 180 GB of memory, with roughly 5000 threads. The controllers are RMI servers, while the fetchers are RMI clients. Controllers have a "state," and if one stops, human intervention is required to restart the crawler from the checkpoint. Fetchers are semiautonomous and can crash and restart. JVM* errors have been rare.

Developing's Java Technology Crawlers
Figure 1: Crawler A's developers, Adrian King and Bart Niechwiej, made use of both built-in and free third- party Java technologies libraries to speed up development. "We spent zero time debugging low-level memory problems," explains King, the developer of Crawler A, which he wrote entirely in the Java language.

"We made use of Java 2 Platform, Standard Edition 5.0 [J2SE 5.0] features from the start, relying on different releases, and had fewer problems than we expected. Generics made it easier to organize code and make it readable, and we made extensive use of blocking queue classes to move the work flow from one thread to the next because the pages arrive asynchronously. The RMI model was simple to use and informs us when a call has successfully executed. Garbage collection was good. The fetchers have a lot of asynchronous behavior, which could have made memory management difficult. The Javadoc tool helped a lot. It's good to have a standard way for everyone to understand and document code."

Although creating the prototype for Crawler A took a couple of months, approximately six months passed before the crawler had the desired degree of stability.

Challenges of Crawling

Creating the Java technology crawler was not without its challenges. "Early on, we found that fetching processes crashed fairly frequently," explains King. "Some of that apparently had to do with problems in the prerelease version of JDK 5.0, which have been subsequently fixed in production releases. ZLIB compression was necessary because data wouldn't fit on disk uncompressed. There appeared to be native memory leaks in the built-in compression routines in the prerelease version, which disappeared in the production release."

"We made use of Java 2 Platform, Standard Edition 5.0 features from the start and had fewer problems than we expected. Generics made it easier to organize code and make it readable, and we made extensive use of blocking queue classes."
Adrian King
Software Developer,

King continued: "JVM errors were rare but at times problematic, because when they occur in the controllers, they require manual intervention to restart the crawl. We filed one bug regarding controller JVM crashes and were told by Sun to send source code, which was impractical because Sun would have had to run it for a month on 40 machines to have a chance of observing the problem. But this problem only happened twice in 210 days, which is pretty good stability."

The developers successfully worked around fetcher crashes by using Perl script to restart any fetcher process that died. is running all Intel machines on Linux. Hardware and operating system problems have occurred more frequently than have JVM problems. Controllers run on SuSE 9.1, while fetchers run on White Box 3.'s developers have encountered some problems in, which occasionally misparses URLs and divides them into parts incorrectly. This is a time-critical part of the application that's developers have rewritten both to improve performance and prevent accidentally fetching the same page twice under slightly different names. Java.swing.html also caused some problems because it lacked a robust enough parser to deal with HTML.

King had one request: "Most of our RMI calls send data structures that have no loops. It would be nice if RMI let us tell it that a structure being sent has no loops." The logic behind this request is as follows: The RMI mechanism can save time if it doesn't have to check whether it has already included a particular object in the structure it builds to send to the target process. If a data structure has no loops, the same object cannot appear more than once in the structure.

Figure 2: Crawler B

In developing Crawler B, Bart Niechwiej tried out the java.nio library (NIO) and got better performance than with a multithreaded version. Unfortunately, some classes -- such as URL -- did not support the NIO, so he implemented a URL connection. He used Tomcat for his statistics server and required 20 GB of memory for fetchers, which ran on 10 separate 32-bit machines of 2 GB each.

Java Architecture for XML Binding (JAXB) was good for configuration files stored in XML. Niechwiej made use of 19,000 lines of Java technology code and 20,000 of JAXB-generated code. He wrote the HTTP client-HTTP parser from scratch because HttpURLConnection lacked support for NIO. He designed a communication protocol between the fetcher and controller based on TCP protocol.

Only 5 to 10 machines are used for fetchers, each with 4 GB of memory. This has resulted in good stability and scalability. There were no problems with JVM crashes.

Niechwiej also used Java technology in another project to facilitate's comparison-shopping service. In this project, a server designed to manage a large number of product images, he made use of Eclipse, IntelliJ, and ultimately, the NetBeans IDE 4.1, which he found "extremely fast compared to Eclipse."

Looking Ahead

As enhances its offerings, the company will continue to rely on the Java platform, particularly in areas where rapid development is of greater concern than precise control of memory. Java technology has played a significant role in building the comparison-shopping service, which launched in July 2005. And as scales its web index in both U.S. and international markets, Java technology will continue to figure prominently in its development plans.

For More Information

The terms 'Java Virtual Machine' and 'JVM' mean a Virtual Machine for the Java(TM) platform.

Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.