BioJava -- Java Technology Powers Toolkit for Deciphering Genomic Codes

by Steven Meloan

(June 2004)

The 2003 completion of the Human Genome Project marked a true milestone in the history of the human race--a 13-year, global enterprise to map the entirety of our genetic blueprint. But the real work of this reverse engineering of our genetic makeup has only just begun. The next step is one of meaningfully interpreting this massive volume of data--a task that amounts to deciphering a text of three billion characters, in a language that's only passably understood even by molecular biologists. Such analyses present bioinformatics developers with computational tasks on scales, and at levels of complexity, rarely seen before.

Using the scalable, cross-platform, network-aware power of Java technology, researchers at Great Britain's famed Sanger Institute for genetic study have spawned BioJava--an open-source project dedicated to providing genomic researchers with a Java technology-based developer's toolkit. BioJava offers bioinformatics developers over 1200 classes and interfaces for manipulating genomic sequences, file parsing, CORBA interoperability, and more. The facility is already being used at major research and pharmaceutical centers, and in over 85 countries around the world.

DNA and RNA provide the genetic "code" for all living things. Unlike the binary code of computers, DNA has a basic alphabet of four nucleotides: A (Adenine), G (Guanine), C (Cytosine), and T (Thymine).

While the computer byte contains eight bits, the logical unit of DNA is the "triplet code." Each group of three nucleotides (say, C-T-A) codes for a particular amino acid, the building blocks of all proteins. These triplet codes (or "codons") instruct the step-by-step addition of amino acids, until, finally, a completed protein chain emerges.

Market on the Move

In this era following the completion of the Human Genome Project, bioinformatics is clearly a field whose time has come. In an earlier time, genetics was a field of research whose major discoveries were typically made at the lab bench. But many, if not a majority, of such discoveries are now being made in silicon. With the sheer volume of data and analysis, spanning multi-national pharmaceutical companies and academic collaborative networks, the work simply could not be done without computers. It's not at all uncommon for larger research labs to be generating several hundred Gbytes (or more) of new data, per day.

Bioinformatics Facts

  • There are currently over 1400 biotech companies in the U.S., with total revenues of $28.5 billion.
  • 1000 genomes have been studied at some level of detail (including mammals, plants, insects, viruses, and bacteria).
  • The region of the Human Genome coding for proteins comprises less than 2% of the total. The remainder performs currently unknown functions.
  • GenBank, a public database of DNA, RNA, and protein sequences, is doubling in size every six months.
  • Larger genomic research facilities generate upwards of several hundred Gbytes of data per day.
  • The pharmaceutical industry is the most profitable sector of the Fortune 500/


BioJava was spawned in the mid-90s as a result of the computational needs of Matthew Pocock and Thomas Down, two Ph.D. students at the Sanger Institute in Cambridge, England. As a major sequencing center of the Human Genome Project, Sanger attracts many of the best and the brightest from around the globe. "At the time we arrived," says Pocock, "the Sanger Centre was just getting into sequencing the Human Genome, and it was a very exciting time. The whole world passed through the door. Anybody who was anybody was there at some point during the weekly lab meetings."

Pocock arrived at Sanger with C++ and Perl coding experience already under his belt, but soon found the languages lacking for his tasks. "With Perl, I just couldn't get the performance I needed," says Pocock. "When you're working with Genomic data sets, you're often dealing with Gigabytes of data. And Perl didn't handle that very well. C++ could handle that amount of data, but the language really didn't help you to write portable, robust code."

Enter Java Technology

Pocock began meeting regularly with other developers at Sanger, exchanging bioinformatics coding techniques and design patterns. Then, a year or two into his PhD program, word got out that the Java 2 platform had been ported to the various platforms being used at the center. "At that time, we had Sun Solaris nodes, Pentium Pro boxes running Red Hat Linux, and DEC Alpha systems running OSF1," he says.

As a result of the diverse systems being used at the Sanger Center, cross-platform compatibility was of the utmost importance. "Once we had a Java virtual machine 1 (JVM) for all of our various platforms, I could develop and run Java applications at my desktop, but also have them run on our large compute clusters. And not only was the code portable, the Java language kept me from making terrible coding mistakes."


In the same way that genomic data tends to be shared and networked, so does genomic software development. And this made the cross-platform compatibility of Pocock's Java applications all the more valuable to colleagues. "If anyone wrote a piece of software, there were typically ten people who immediately wanted to use it," he reports. "But you have no control over what hardware they may be using."

Once Pocock became office-mates with Thomas Down, the two began sharing software and development tips. "We'd regularly find that we'd both written parsers for the same types of files, or had created object models for the same types of data. After a while, it all got a bit silly." So they joined forces, sharing code, and creating a CVS repository for the code. "That's when the BioJava project was really born," says Pocock.

The first set of classes was officially released in the Fall of 2000. "Just in time to catch the new wave of students arriving at University," says Pocock. "We had about 100 classes at that point."

From its inception, BioJava was designed by interface, but providing working implementations so that developers could extend or replace behavior and implementations.

Culture of Collaboration

While BioJava was originally hosted out of a hardware box in someone's bedroom, the facility is now hosted and maintained under the auspices of the Open Bionformatics Foundation, which also hosts such companion bioinformatics facilities as BioPerl, BioSQL, and BioPython.

The OBF handles the member facilities' modest requirements of hardware ownership, domain name management, and funding for conferences and workshops. While the foundation does not participate directly in the development or structure of the facilities, the members of the foundation are drawn from the member projects, so there is a clear commonality of direction and purpose. "Once a year or so, we all meet up on a sort of coding/hackathon vacation," notes Pocock, "to make sure all the different projects interoperate, as best as possible."

The collaborative and shared nature of genomic research all but cried out for BioJava and the OBF's open source model. BioJava is distributed under an LGPL (GNU Lesser General Public License). The LGPL allows developers to modify the code and fix bugs, and to use the facility as a library foundation upon which to build both free software and commercial packages.

Every attempt has been made to keep BioJava usable in almost any environment. "A major constraint is that it has to be possible for anyone to download it and use it, and compile it from scratch if they want to," says Pocock. "So we couldn't tie it into any one IDE. I personally use InteliJ, but other people use Emacs, Eclipse, NetBeans, and more. Meanwhile, we use the standard javac compiler, Apache ANT for the build, and JUnit for testing. And we also use that for regression testing. Every night, we build the entire project and run all the tests."

Growth Curve

BioJava has grown tremendously since its beginnings. The most recent site statistics show 1,264 public classes and interfaces, with over 200,000 lines of code, and over 14 people regularly contributing to the code. "The total number of classes sounds a bit scary when you count it," explains Pocock, "but there are really only about 15 interfaces. And pretty much everything you ever write is to those 15 interfaces. So there's a frightening amount of complexity that you never see, and are happy not to see!"

Both Pocock and Down keep an active hand in maintaining and enhancing the BioJava code, but it is a truly collaborative open source effort. "Someone like myself, or Thomas, or Mark Schreiber, who is now a major contributor to the site, would approve anything that touched the core object model. And we would also discuss that on the mailing list or the IRC. But the project is actually quite modular. There are two people who are involved with the sequence-searching algorithm code. And they would be in charge of making sure that anything committed to that was safe and sane. In the current era, there is no one person who knows the entire library, or who has responsibility for it."

The most recent monthly site statistics (for April of 2004) show a hit rate of over 170,000, with greater than 400 downloads of the BioJava package, comprising a total of over 130,000 files. At peak times, the site receives over 10,000 hits an hour.

Agent of Change

The cross-platform compatibility of BioJava has definitely served it well, particularly in an industry where vendor hardware can change at the stroke of a politician's pen, or at the dictate of a biotech corporate merger. "We're currently running on Wintel boxes," says Pocock, "on three or four flavors of Linux, on Sun Sparc, and on Mac OS X-based Xserve arrays--pretty much anywhere that can run a Java 1.2 JVM."

And while cross-platform compatibility is a key requirement for the BioJava facility, of nearly equal importance are such features of Java technology as ease of development, scalability, and compatibility with legacy applications and systems.

"Scalability is a huge one for us," says Pocock. "You can have genetic sequences that are seven or eight characters long, or that are 3 Gbytes long. And ideally, you want the same API to manipulate them both."

The bioinformatics realm is clearly still somewhat in "wild west" mode, prone to disparate and home-brewed formats and facilities. So BioJava has to be flexible enough to accommodate this rapidly changing industry. "In bioinformatics, I can sit in my office, write a program that turns out to be useful, give it to a friend, and before I know it, it's being used by 20,000 people. And suddenly, people around the world are writing programs that consume a data format that I made up in my room at home."

In addition to more newly-created home-brewed formats, developers in the genomics realm must also contend with internationally-recognized genomic data banks, as well as formats created by other major software facilities being used in the industry.

In terms of genetic databanks, there are a handful of major entities. "There are three historical ones," reports Pocock, "EMBL, which is the European databank, Genbank, which is the American version, and DDBJ, which is the Japanese version. Then, for storing protein sequences, which is important in the growing field of proteomics, there's Swiss-Prot."

In addition to established data banks, there is also a plethora of genetic analysis facilities, often with their own proprietary formats. "You have Fasta, which just stores the sequences raw," says Pocock, "and then Blast, SSearch, T-Coffee--and every one can have a different file format."

BioJava has to seamlessly handle all of these disparate data types and data formats. "We try to make the memory representation format-agnostic, so that it doesn't matter how you read or write the sequence while you're manipulating it. You have the same sequence in memory, but can then dump it out into multiple formats. You have to enable people to drop-in their own particular formats without having to get into the guts of BioJava."

And in a realm driven by both massive data storage facilities and globally networked systems, and interfacing with disparate systems and facilities, features of the Java platform such as JDBC, JNDI, RMI, and CORBA interoperability also play a vital role. "We have huge data sets," says Pocock, "and it's not practical to load all of that into memory at any one instant. So we use JDBC a lot."

For Pocock, even computational throughput can sometimes be best facilitated by using the Java language. "When you're processing entire genomes, you can never have enough computing power," he admits. "But with the Java language, because you have the flexibility of using objects by API, by interface, and to delegate and shuttle things around, you can take a generic algorithm that's optimized to the particular problem such that it often runs faster than a comparable C program. If you're doing a raw compute problem, then Java won't go faster than C. But if it's a problem of complexity, Java can often, particularly with the HotSpot compiler, perform quite well."

Getting Started

While the BioJava facility is large and offers great flexibility and features, it also offers extensive documentation and help content. The site's "Getting Started" section details the location of the downloadable binary files, CLASSPATH setup, how to obtain the modifiable source code, and how to compile and run the provided demo programs. There is also a tutorial area, and extensive JavaDoc content. Finally, thanks to Mark Schreiber, a Principal Scientist for the Novartis Institute for Tropical Diseases in Singapore, there is the "BioJava In Anger" area on the site, an extensive, generously documented "How To" section.

BioJava In Anger offers a cookbook, "How Do I...?" approach to using BioJava, which was precipitated by Schreiber's own early experiences with the facility, and by frequently asked question on the mailing list. "BioJava can be hard to understand at first," says Schreiber, "particularly for novice programmers. The API is huge, so it's a bit hard to find a starting point to begin learning, and to figure out how to instantiate some of the objects. Also, it uses many advanced concepts--like interface-based design, singleton objects, and objects with private and protected constructors."

Schreiber recognized that a great majority of users might have relatively simple tasks they'd like to perform. "BioJava is enormously flexible, and you can do some pretty complex things with it," he says. "But 95% of users will, at least initially, want to do fairly standard tasks."

He began putting up generously-documented solutions to common tasks using BioJava. And the content has proven so popular, it's now been translated into other languages. "The most incredible thing for me is that it's now been translated into French, Japanese, and Chinese," he enthuses.

Schreiber's only regret is being unable to always find the time to add desired new content. "I'm very appreciative of any contributions that other people make," he says. "I'd also like to start developing advanced pages for the site, to demonstrate some of the more complex things you can do with the API."

Out For a Spin

It doesn't require a geneticist to take BioJava out for a test drive. Both Genbank and EMBL offer publicly accessible databases of genomic data. Using these, and examples from BioJava In Anger, it's relatively straightforward to put together sample BioJava-powered applications.

To review, DNA has a basic alphabet of four nucleotides: A (Adenine), G (Guanine), C (Cytosine), and T (Thymine). While the computer byte contains eight digital bits, the logical unit of DNA is the "triplet code." Each group of three nucleotide bases (say, C-T-A) codes for a particular amino acid, the building blocks of all protein. These triplet codes (or "codons") instruct the methodical, step-by-step addition of amino acids, until, finally, a completed protein chain emerges.

In the cell, DNA is first "transcribed" into RNA, and this complementary RNA is then "translated" into protein (through the step-by-step addition of amino acids coded for by the RNA). In the transcription of DNA, each nucleotide (A, G, C, and T) codes for a complement in the resultant RNA: A->U(Uracil), G->C, C->G, and T->A). It's a bit like translating a number from octal to hex. The same information is contained in both, it's just a matter of different storage systems.

Below is an example from BioJava In Anger to transcribe a DNA sequence to RNA, and to then translate that sequence to a protein.

/* *NOTE: if you try to create a 'triplet view' on a SymbolList or *Sequence whose length is not evenly divisible by three an *IllegalArgumentException will be thrown. See 'how to get a *subsequence' for a description of how to get a portion of a *Sequence for translation. */ import*; import*; public class Translate { public static void main(String[] args) { try { //create a DNA SymbolList SymbolList symL = DNATools.createDNA("atggccattgaatga"); //transcribe to RNA symL = RNATools.transcribe(symL); //prove that it worked System.out.println(symL.seqString()); //translate to protein symL = RNATools.translate(symL); //prove that it worked System.out.println(symL.seqString()); } /* * this will occur if you try to transcribe a non DNA sequence, or translate * a sequence that isn't an RNA sequence. */ catch (IllegalAlphabetException ex) { ex.printStackTrace(); } /* * this will occur if non IUB characters are used to create the DNA SymbolList */ catch (IllegalSymbolException ex) { ex.printStackTrace(); } /* * This will occur if you try to translate a SymbolList using a length that is not * evenly divisible by three. * To resolve this issue make a sub-list of only the part you wish to translate. */ catch (IllegalArgumentException ex) { ex.printStackTrace(); } }

Career Opportunities

The exploding field of bioinformatics offers Java developers an array of exciting new career opportunities. While a majority of current developers have come to the field from the biology world, picking up coding expertise along the way, the reverse path is also a very real possibility. "I would say that the industry is currently about 2/3 biologists who got into bioinformatics, and 1/3 computer people," says Pocock. "You don't need to be an expert in biology to be a bioinformatics person, but you need to know the basic processes that go on in the cell. If you're an individual who's already an experienced developer, I'd recommend reading the first few chapters of a couple of genetics textbooks, and then writing some code."

Looking Forward

BioJava 1.3 is the current official release of the facility, built upon the Java 1.2 platform. In late 2004 or early 2005, BioJava 2 is slated for release, with new features built upon the Java 1.5 platform.

Pocock see the generics support of the Java 1.5 platform as a particularly big plus for their upcoming release. "It allows us to re-use much more utility code, such as from the Collections framework, without the current type safety issues," he says. "For example, using generics when processing a sequence file, we can return an Iterator over Sequence Object, whereas in earlier Java platform releases, we would have had to return an Iterator over Object. About a quarter of the bugs in the last year were due to these types of type-safety issues. The other big feature that Java 1.5 offers us is source code annotations, which allow a developer to tag source code (such as a class or method) with extra annotations."

Business Opportunities

While BioJava is an open source toolkit, there are already ample business opportunities in and around the facility. In addition to his academic teaching duties, Pocock offers training sessions in the use of BioJava. And in partnership with Down and several other colleagues, he has formed Symference, a company providing customized life-science computing solutions to biotech companies. In addition to working with BioJava, Symference offers a companion Java technology based facility called Taverna, which is a web services oriented workflow management tool. Taverna unifies biological discovery on one platform, integrating disparate computing resources.

Meanwhile, Chris Dagdigian (who was instrumental in forming the Open Bioinformatics Foundation and is on its board of directors), has partnered with several colleagues to form BioTeam, a consulting collective dedicated to delivering vendor-neutral informatics solutions to the life sciences industry. The team offers everything from needs assessment, to application optimization, to infrastructure design, to platform installation and tuning.

Just the Beginning

And the bioinformatics field is really just getting started. Less than 2% of the Human Genome codes for protein. The rest, like the dark matter of cosmology, is a relative unknown, seemingly comprised of various regulatory and "epigenetic" regions of information. Then there's the burgeoning field of proteomics (the study of protein structure and function), which may eventually dwarf genomics in terms of data and computational analysis needs. And more recently, bioinformatics has even entered into the realm of "in-silico" computational simulations of living systems--beginning at the cellular level, then, theoretically, on up to the organ level, and eventually to the level of entire organisms. Such simulations offer the promise of predictive models, where the effects of a potential new drug can be tested computationally, prior to any actual animal or human testing.

In short, the more we learn in bioinformatics, the more we discover there is to learn! The computational tasks, and developer opportunities, are virtually endless.

For More Information