Your search did not match any results.
We suggest you try the following to help find what you’re looking for:
Easily parse HTML, extract specified elements, validate structure, and sanitize content.
By Mert Çalişkan
Today, enterprise Java web application developers use HTML in every aspect of a project. This work is made difficult at times because parsing HTML content is a tedious task. Doing so without a parser framework is a most undesirable chore. Fortunately, there are a handful of Java-based HTML parsers publicly available. In this article, I will focus on one of my favorites, jsoup, which was first released as open source in January 2010. It has been under active development since then by Jonathan Hedley, and the code uses the liberal MIT license.
jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors.
jsoup can manipulate the content: the HTML element itself, its attributes, or its text. It updates older content based on HTML 4.x to HTML5 or XHTML by converting deprecated tags to new versions. It can also do cleanup based on whitelists, tidy HTML output, and complete unbalanced tags automagically. I will demonstrate these features with some working examples.
All the examples in this article are based on jsoup version 1.10.2, which is the latest available version at the time of this writing. The complete source code for this article is available on GitHub.
DOM is the language-independent representation of the HTML documents, which defines the structure and the styling of the document. Figure 1 shows the class diagram of jsoup framework classes. Later, I’ll show you how they map to the DOM elements.
The org.jsoup.nodes.Node
abstract class is the main element of jsoup. It represents a node in the DOM tree, which could either be the document itself, a text node, a comment, or an element—that is, form elements—within the document. The Node
class refers to its parent node and knows all the parent’s child nodes.
The Element
class represents an HTML element, which consists of a tag name, attributes, and child nodes. The Attributes
class is a container for the attributes of the HTML elements and is composed within the Node
class.
You can obtain the latest version of jsoup from Maven’s Central Repository with the following dependency definition. The current release will run on any version of Java since Java 5.
<dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.10.2</version> </dependency>
Gradle users can retrieve the artifact with
org.jsoup:jsoup:1.10.2
The main access point class, org.jsoup.Jsoup
, is the principal way to use the functionality of jsoup. It provides base methods that can parse an HTML document passed to it as a file or an input stream, a string, or an HTML document provided through a URL. The example in Listing 1 parses HTML text and outputs first the node name of the element and then the HTML text owned by the element, as shown immediately below the code.
public class Example1Main { static String htmlText = "<!DOCTYPE html>" + " <html>" + " <head>" + " <title>Java Magazine</title>" + " </head>" + " <body>" + " <h1>Hello World!</h1>" + " </body>" + "</html>"; public static void main(String... args) { Document document = Jsoup.parse(htmlText); Elements allElements = document.getAllElements(); for (Element element : allElements) { System.out.println(element.nodeName() + " " + element.ownText()); } } }
The output is
#document html head title Java Magazine body h1 Hello World!
CSS and jQuery-like selectors are powerful compared with DOM-specific methods. They can be combined together to refine selection.
Ways to select DOM elements. jsoup provides several ways to iterate through the parsed HTML elements and find the requested ones. You can use either the DOM-specific getElementBy*
methods or CSS and jQuery-like selectors. I will demonstrate both approaches by parsing a web page and extracting all links that have HTML <a>
tags. The code in Listing 2 parses the Java Champions bio page and extracts the link names for all the Java Champions marked as “New!
” (see Figure 2).
The marking was done by adding a <font>
tag with text New!
right next to the link. So, I will be checking for the content of the next-sibling element of each link.
public class Example2Main { public static void main(String... args) throws IOException { Document document = Jsoup.connect( "https://java.net/website/" + "java-champions/bios.html" ) .timeout(0).get(); Elements allElements = document.getElementsByTag("a"); for (Element element : allElements) { if ("New!".equals( element.nextElementSibling()!=null ? element.nextElementSibling() .ownText() : "")) { System.out.println( element.ownText()); } } } }
The same extraction of the links can also be done with selectors, as shown in Listing 3. This code extracts the links that start with href
value #
.
public class Example3Main { public static void main(String... args) throws IOException { Document document = Jsoup.connect ("https://java.net" + " /website/java-champions/bios.html") .timeout(0).get(); Elements allElements = document.select ("a[href*=#]"); for (Element element : allElements) { if ("New!".equals(element .nextElementSibling() != null ? element.nextElementSibling ().ownText() : "")) { System.out.println(element .ownText()); } } } }
Selectors are powerful compared with DOM-specific methods. They can be combined together to refine selection. In the previous code examples, we are doing the New!
text check by ourselves, which is trivial. The example in Listing 4 selects the <font>
tag that contains the New!
text, which resides after a link that has an href starting with the value #
. This really shows the power of selectors.
public class Example4Main { public static void main(String... args) throws IOException { Document document = Jsoup.connect ("https://java.net" + ".website/java-champions/bios.html") .timeout(0).get(); Elements allElements = document.select ("a[href*=#] ~ font:containsOwn" + "(New!)"); for (Element element : allElements) { System.out.println(element .previousElementSibling() .ownText()); } } }
Here, the selectors locate the <font>
tag as an element. I then call the previousElementSibling()
method on it, so as to step one element back to the link. This select()
method is available in the Document
, Element
, and Elements
classes. Currently, jsoup does not support XPath queries on selectors. More information about selectors is available at the jsoup site.
Traversing nodes. jsoup provides the org.jsoup.select.NodeVisitor
interface, which contains two methods: head()
and tail()
. By implementing an anonymous class from that interface and passing it as a parameter to the document.traverse()
method, it is possible to have a callback when the node is first and last visited. The code in Listing 5 uses this technique to traverse a simple HTML text and outputs all node details.
public class Example5Main { static String htmlText = "<!DOCTYPE html>" + "<html>" + "<head>" + "<title>Java Magazine</title>" + "</head>" + "<body>" + "<h1>Hello World!</h1>" + "</body>" + "</html>"; public static void main(String... args) throws IOException { Document document = Jsoup.parse(htmlText); document.traverse(new NodeVisitor() { public void head(Node node, int depth){ System.out.println("Node start: " + node.nodeName()); } public void tail(Node node, int depth){ System.out.println("Node end: " + node.nodeName()); } }); } }
The output from this traversal is as follows:
Node start: #document Node start: #doctype Node end: #doctype Node start: html Node start: head Node start: title Node start: #text Node end: #text Node end: title Node end: head Node start: body Node start: h1 Node start: #text Node end: #text Node end: h1 Node end: body Node end: html Node end: #document
Parsing XML files. jsoup supports parsing of XML files with a built-in XML parser. The example in Listing 6 parses an XML text and outputs it with appropriate formatting. Note once again how easily this is accomplished.
Listing 6.public class Example6Main { static String xml = "<?xml version=\"1.0\"" + "encoding=\"UTF8\"><entries><entry>" + "<key>xxx</key>" + "<value>yyy</value></entry>" + "<entry><key>xxx</key>" + "<value>zzz</value>" + "</entry></entries></xml>"; public static void main(String... args) { Document doc = Jsoup.parse(xml, "", Parser.xmlParser()); System.out.println(doc.toString()); } }
As you would expect, the output from this is
<?xml version="1.0"encoding="UTF8"> <entries> <entry> <key> xxx </key> <value> yyy </value> </entry> <entry> <key> xxx </key> <value> zzz </value> </entry> </entries>
It’s also possible to use selectors for picking up values from specified XML tags. The code snippet in Listing 7 selects <value>
tags that reside in <entry>
tags.
Document doc = Jsoup.parse(xml, "", Parser.xmlParser()); Elements elements = doc.select("entry value"); Iterator<Element> it = elements.iterator(); while (it.hasNext()) { Element element = it.next(); System.out.println(element.nodeName() + " - " + element.ownText()); }
Preventing XSS attacks. Many sites prevent cross-site scripting (XSS) attacks by prohibiting the user from submitting HTML content or by enforcing the use of alternative markup syntax, such as markdown. A clever solution to prevent malicious HTML input is to use a WYSIWYG editor and filter the HTML output with jsoup’s whitelist sanitizer. The whitelist sanitizer parses the HTML, and iterates through it and removes the unwanted tags, attributes, or values according to the whitelist built into the framework.
The example in Listing 8 defines a test method that cleans up HTML text according to a simple text whitelist. This list, as you will see in a moment, allows only simple text formatting with HTML tags: b
, em
, i
, strong
, and u
.
@Test public void simpleTextCleaningWorksOK() { String html = "<div>" + "<a href='http://www.oracle.com'>" + "<b>Hello + Reader</b>!</a></div>"; String cleanHtml = Jsoup.clean( html, Whitelist.simpleText()); assertThat(cleanHtml, is("<b>Hello Reader</b>!")); }
The WhiteList
class offers prebuilt lists such as simpleText()
, which limits HTML to the previous elements. There are other acceptance options, such as none()
, basic()
, basicWithImages()
, and relaxed()
.
Listing 9 shows an example of the usage of basic()
, which allows these HTML tags: a
, b
, blockquote
, br
, cite
, code
, dd
, dl
, dt
, em
, i
, li
, ol
, p
, pre
, q
, small
, span
, strike
, strong
, sub
, sup
, u
, ul
.
@Test public void basicCleaningWorksOK() { String html = "<div><p><a " + "href='javascript:hackSystem()" + "'>Hello</a></div>"; String cleanHtml = Jsoup.clean(html, Whitelist.basic()); assertThat(cleanHtml, is("<p><a " + "rel=\"nofollow\">Hello</a></p>")); }
As seen in the test, the script call is eliminated and the tags that are not allowed, such as div
, are also removed. In addition, jsoup automatically completes unbalanced tags, such as the missing </p>
in our example.
This article, which previously appeared in Java Magazine but has been updated here, shows only a subset of what jsoup can do. It also offers features such as tidying HTML, manipulating HTML tags’ attributes or texts, and more. Put another way, any HTML processing you might need to do is a likely candidate for using jsoup.
This article originally was published in Java Magazine.
Mert Çalişkan (@0hjc) is a Java Champion and coauthor of PrimeFaces Cookbook and Beginning Spring (Wiley Publications). He is the founder of AnkaraJUG, which is the most active Java user group in Turkey.