No results found

Your search did not match any results.

 

Java Developer | Libraries

jsoup HTML Parsing Library for Java Developers

Easily parse HTML, extract specified elements, validate structure, and sanitize content.

By Mert Çalişkan


Today, enterprise Java web application developers use HTML in every aspect of a project. This work is made difficult at times because parsing HTML content is a tedious task. Doing so without a parser framework is a most undesirable chore. Fortunately, there are a handful of Java-based HTML parsers publicly available. In this article, I will focus on one of my favorites, jsoup, which was first released as open source in January 2010. It has been under active development since then by Jonathan Hedley, and the code uses the liberal MIT license.

What It Is

jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors.

jsoup can manipulate the content: the HTML element itself, its attributes, or its text. It updates older content based on HTML 4.x to HTML5 or XHTML by converting deprecated tags to new versions. It can also do cleanup based on whitelists, tidy HTML output, and complete unbalanced tags automagically. I will demonstrate these features with some working examples.

All the examples in this article are based on jsoup version 1.10.2, which is the latest available version at the time of this writing. The complete source code for this article is available on GitHub.

The DOM and jsoup Essentials

DOM is the language-independent representation of the HTML documents, which defines the structure and the styling of the document. Figure 1 shows the class diagram of jsoup framework classes. Later, I’ll show you how they map to the DOM elements.

The org.jsoup.nodes.Node abstract class is the main element of jsoup. It represents a node in the DOM tree, which could either be the document itself, a text node, a comment, or an element—that is, form elements—within the document. The Node class refers to its parent node and knows all the parent’s child nodes.

The Element class represents an HTML element, which consists of a tag name, attributes, and child nodes. The Attributes class is a container for the attributes of the HTML elements and is composed within the Node class.

Figure 1. jsoup class diagram

Figure 1. jsoup class diagram

Getting Started

You can obtain the latest version of jsoup from Maven’s Central Repository with the following dependency definition. The current release will run on any version of Java since Java 5.

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.10.2</version>
</dependency>

Gradle users can retrieve the artifact with

org.jsoup:jsoup:1.10.2

The main access point class, org.jsoup.Jsoup, is the principal way to use the functionality of jsoup. It provides base methods that can parse an HTML document passed to it as a file or an input stream, a string, or an HTML document provided through a URL. The example in Listing 1 parses HTML text and outputs first the node name of the element and then the HTML text owned by the element, as shown immediately below the code.

Listing 1.
public class Example1Main {

    static String htmlText = "<!DOCTYPE html>" +
            "    <html>" +
            "    <head>" +
            "       <title>Java Magazine</title>" +
            "    </head>" +
            "    <body>" +
            "       <h1>Hello World!</h1>" +
            "    </body>" +
            "</html>";

    public static void main(String... args) {
        Document document = Jsoup.parse(htmlText);
        Elements allElements = 
            document.getAllElements();
        for (Element element : allElements) {
            System.out.println(element.nodeName() 
            + " " + element.ownText());
        }
    }
}

The output is

#document 
html 
head 
title Java Magazine
body 
h1 Hello World!

 CSS and jQuery-like selectors are powerful compared with DOM-specific methods. They can be combined together to refine selection.  

Ways to select DOM elements. jsoup provides several ways to iterate through the parsed HTML elements and find the requested ones. You can use either the DOM-specific getElementBy* methods or CSS and jQuery-like selectors. I will demonstrate both approaches by parsing a web page and extracting all links that have HTML <a> tags. The code in Listing 2 parses the Java Champions bio page and extracts the link names for all the Java Champions marked as “New!” (see Figure 2).

Figure 2. Part of the HTML page to be parsed

Figure 2. Part of the HTML page to be parsed

The marking was done by adding a <font> tag with text New! right next to the link. So, I will be checking for the content of the next-sibling element of each link.

Listing 2.
public class Example2Main {

    public static void main(String... args) 
        throws IOException {
        Document document = Jsoup.connect(
            "https://java.net/website/" + 
            "java-champions/bios.html" )
            .timeout(0).get();

        Elements allElements = 
            document.getElementsByTag("a");
        for (Element element : allElements) {
            if ("New!".equals(
                 element.nextElementSibling()!=null 
                 ? element.nextElementSibling()
                   .ownText()
                 : "")) {
                   System.out.println(
                       element.ownText());
            }
        }
    }
}

The same extraction of the links can also be done with selectors, as shown in Listing 3. This code extracts the links that start with href value #.

Listing 3.
public class Example3Main {

    public static void main(String... args) 
            throws IOException {
        Document document = Jsoup.connect
                ("https://java.net" + 
            " /website/java-champions/bios.html")
            .timeout(0).get();
        Elements allElements = document.select
                ("a[href*=#]");
        for (Element element : allElements) {
            if ("New!".equals(element
                    .nextElementSibling() != null
                    ? element.nextElementSibling
                    ().ownText() : "")) {
                System.out.println(element
                        .ownText());
            }
        }
    }
}

Selectors are powerful compared with DOM-specific methods. They can be combined together to refine selection. In the previous code examples, we are doing the New! text check by ourselves, which is trivial. The example in Listing 4 selects the <font> tag that contains the New! text, which resides after a link that has an href starting with the value #. This really shows the power of selectors.

Listing 4.
public class Example4Main {

    public static void main(String... args) 
            throws IOException {
        Document document = Jsoup.connect
                ("https://java.net" +
            ".website/java-champions/bios.html")
            .timeout(0).get();
        Elements allElements = document.select
                ("a[href*=#] ~ font:containsOwn" +
                        "(New!)");
        for (Element element : allElements) {
            System.out.println(element
                    .previousElementSibling()
                    .ownText());
        }
    }
}

Here, the selectors locate the <font> tag as an element. I then call the previousElementSibling() method on it, so as to step one element back to the link. This select() method is available in the Document, Element, and Elements classes. Currently, jsoup does not support XPath queries on selectors. More information about selectors is available at the jsoup site.

Traversing nodes. jsoup provides the org.jsoup.select.NodeVisitor interface, which contains two methods: head() and tail(). By implementing an anonymous class from that interface and passing it as a parameter to the document.traverse() method, it is possible to have a callback when the node is first and last visited. The code in Listing 5 uses this technique to traverse a simple HTML text and outputs all node details.

Listing 5.
public class Example5Main {

    static String htmlText = "<!DOCTYPE html>" +
            "<html>" +
            "<head>" +
            "<title>Java Magazine</title>" +
            "</head>" +
            "<body>" +
            "<h1>Hello World!</h1>" +
            "</body>" +
            "</html>";

    public static void main(String... args) 
            throws IOException {
        Document document = Jsoup.parse(htmlText);

        document.traverse(new NodeVisitor() {
            public void head(Node node, int depth){
                System.out.println("Node start: "
                        + node.nodeName());
            }

            public void tail(Node node, int depth){
                System.out.println("Node end: " +
                        node.nodeName());
            }
        });
    }
}

The output from this traversal is as follows:

Node start: #document
Node start: #doctype
Node end: #doctype
Node start: html
Node start: head
Node start: title
Node start: #text
Node end: #text
Node end: title
Node end: head
Node start: body
Node start: h1
Node start: #text
Node end: #text
Node end: h1
Node end: body
Node end: html
Node end: #document

Parsing XML files. jsoup supports parsing of XML files with a built-in XML parser. The example in Listing 6 parses an XML text and outputs it with appropriate formatting. Note once again how easily this is accomplished.

Listing 6.
public class Example6Main {

    static String xml = 
         "<?xml version=\"1.0\"" +
         "encoding=\"UTF8\"><entries><entry>" +
         "<key>xxx</key>" +
         "<value>yyy</value></entry>" +
         "<entry><key>xxx</key>" +
         "<value>zzz</value>" +
         "</entry></entries></xml>";

    public static void main(String... args) {
        Document doc = 
          Jsoup.parse(xml, "", Parser.xmlParser());
        System.out.println(doc.toString());
    }
}

As you would expect, the output from this is

<?xml version="1.0"encoding="UTF8">
<entries>
 <entry>
  <key>
   xxx
  </key>
  <value>
   yyy
  </value>
 </entry>
 <entry>
  <key>
   xxx
  </key>
  <value>
   zzz
  </value>
 </entry>
</entries>

It’s also possible to use selectors for picking up values from specified XML tags. The code snippet in Listing 7 selects <value> tags that reside in <entry> tags.

Listing 7.
Document doc = 
    Jsoup.parse(xml, "", Parser.xmlParser());
Elements elements = doc.select("entry value");
Iterator<Element> it = elements.iterator();
while (it.hasNext()) {
    Element element = it.next();
    System.out.println(element.nodeName() + 
        " - " + element.ownText());
}

Preventing XSS attacks. Many sites prevent cross-site scripting (XSS) attacks by prohibiting the user from submitting HTML content or by enforcing the use of alternative markup syntax, such as markdown. A clever solution to prevent malicious HTML input is to use a WYSIWYG editor and filter the HTML output with jsoup’s whitelist sanitizer. The whitelist sanitizer parses the HTML, and iterates through it and removes the unwanted tags, attributes, or values according to the whitelist built into the framework.

The example in Listing 8 defines a test method that cleans up HTML text according to a simple text whitelist. This list, as you will see in a moment, allows only simple text formatting with HTML tags: b, em, i, strong, and u.

Listing 8.
@Test
public void simpleTextCleaningWorksOK() {
    String html = "<div>" + 
        "<a href='http://www.oracle.com'>" +
        "<b>Hello + Reader</b>!</a></div>";
    String cleanHtml = Jsoup.clean(
        html, Whitelist.simpleText());
    assertThat(cleanHtml, 
               is("<b>Hello Reader</b>!"));
}

The WhiteList class offers prebuilt lists such as simpleText(), which limits HTML to the previous elements. There are other acceptance options, such as none(), basic(), basicWithImages(), and relaxed().

Listing 9 shows an example of the usage of basic(), which allows these HTML tags: a, b, blockquote, br, cite, code, dd, dl, dt, em, i, li, ol, p, pre, q, small, span, strike, strong, sub, sup, u, ul.

Listing 9.
@Test
public void basicCleaningWorksOK() {
    String html = "<div><p><a " +
            "href='javascript:hackSystem()" +
            "'>Hello</a></div>";
    String cleanHtml = Jsoup.clean(html,
            Whitelist.basic());
    assertThat(cleanHtml, is("<p><a " +
            "rel=\"nofollow\">Hello</a></p>"));
}

As seen in the test, the script call is eliminated and the tags that are not allowed, such as div, are also removed. In addition, jsoup automatically completes unbalanced tags, such as the missing </p> in our example.

Conclusion

This article, which previously appeared in Java Magazine but has been updated here, shows only a subset of what jsoup can do. It also offers features such as tidying HTML, manipulating HTML tags’ attributes or texts, and more. Put another way, any HTML processing you might need to do is a likely candidate for using jsoup.


This article originally was published in Java Magazine.


Mert Çalişkan (@0hjc) is a Java Champion and coauthor of PrimeFaces Cookbook and Beginning Spring (Wiley Publications). He is the founder of AnkaraJUG, which is the most active Java user group in Turkey.