As Published In

Oracle Magazine
January/February 2004
Developer XML

XQuery Tricks and Traps

By Jason Hunter

In part 3 of the XQuery series, learn how to avoid common XQuery mistakes.

The previous two articles in my series on XQuery introduced you to this powerful language designed for querying collections of XML data. (See "X Is for Xquery" from the May/June 2003 issue of Oracle Magazine and "What's New in Xquery" from the November/December issue.)

To help you better understand how to correctly form XQuery queries, I focus in this article on important but tricky and commonly misunderstood aspects of the XQuery language. I warn you: Things will get hairy beyond this point. But it's better to have fair warning of the complex parts than to have to deduce it on your own when your query fails. This article is based on the May 2003 XQuery specification drafts.

Input Functions

The first step of writing a good query is selecting the right input. XQuery has three input functions to give a query access to the backing store data. (Backing store data refers to the XML document set against which XQuery can execute.) The first function, input(), returns the input sequence, usually a sequence of document nodes corresponding to all the documents available in the database. In previous articles, you saw this function used to print the URI for all stored documents:

for $n in input() return document-uri($n)

The input() function usually starts an XPath expression. So, for example, the following returns all book elements within any document in the backing store:

input()//book

The XQuery specification intentionally leaves vague the rules on what goes into the input sequence. This lets special environments return whatever makes sense for their situations. In a relational environment, it can be the document retrieved from a SQL query. In another environment, input() can return the direct results of a previous query, allowing a sort of query chaining.

There's one flaw to this approach. The input() function signature returns a sequence of nodes (nodes are XML constructs such as elements, documents, comments, and text), whereas a query can return a sequence of items (an item can be a node or an atomic value such as xs:integer or xs:string). Thus, input() can't fully represent the results of an arbitrary query.

The second input function, collection($uri as xs:string), returns a sequence of nodes corresponding to a given URI. Exactly how each URI resolves to a set of nodes depends on the server and its configuration. This function allows a server administrator to preselect a set of documents to be queried. A server administration tool can map documents into named collections. As with input(), the function most often returns document nodes, but the signature allows the function to return any type of nodes, so different environments have some flexibility. The following query returns the address elements from any mortgage deadbeat account:

collection("deadbeats")//address

The last input function, doc($uri as xs:string), returns the document node associated with the given URI name. As with collection(), the engine dictates the URI-to-node mapping. The following query fetches a <part> element from a specific document:

doc("inventory.xml")/parts/part[@id="54"]

Before using multiple doc() input functions to query over a common set of documents, make sure to check your server's configuration (or ask the server administrator) to see if a common collection has already been mapped to a URI for use with collection().

Sequences

The input() and collection() functions both return sequences. To make the most of sequences in XQuery, you have to understand a bit about how they work and their special oddities. Each sequence has an order (like a Java List and unlike a Java Set) and allows any number of items, including none, which is the empty sequence, represented by ().

The following expression represents a sequence with one atomic and one element node:

(1, <elt/>)

Sequences can have duplicate values:

(1, 2, 2, 3)

But sequences are never nested. Any nesting is automatically removed:

(10, (1, 2))

is equivalent to

(10, 1, 2)

(1 to 3, 1 to 2)

is equivalent to

(1, 2, 3, 1, 2)

Perhaps surprisingly, an item is the same as a sequence of length 1 containing that item. This lets you pass a single item to a function supporting a sequence:

1 = (1)

You can combine sequences of nodes by using the operators union, intersect, and except. The union operator, also written as |, joins two sequences; the intersect operator returns all nodes that occur in either sequence; and the except operator returns all nodes in the first sequence but not the second. Note that the union, intersect, and except operators work against sequences of nodes, not items. That means that you can't use these operators on atomics.

If you wanted to query against the union of two collections, you could do it like this:

(collection("a")|collection("b"))//elt

As another example, the following expression generates search results based on two user-defined search functions operating against a source $list. The query calculates the "and" of the two criteria, the "or" of the two criteria, and all matches of the first criteria that don't match the second criteria. It returns the empty sequence only to make the expression complete.

let $x := match-condition-one($list)
let $y := match-condition-two($list)
let $andResult := $x intersect $y
let $orResult := $x union $y
let $xorResult := 
   $orResult except $andResult
return ()

The union operator provides XQuery with one of its greatest advantages over SQL: the ability to deal with dissimilar inputs. Document formats can shift over time, changing structure in subtle or significant ways, depending on the source of the document and the time of the document's generation. Using the union operator, a single query can handle multiple input schemas:

input()/root/(title|titl|head/title)/text()

Sequences are considered equal if their intersection is nonempty. This produces some truly surprising results!

(1, 2, 3) = 1
("pink", "floyd") = ("pink", "posies")

Although the behavior appears counterintuitive, this style of comparison can appear quite intuitive in other contexts. For example, the test $book/[author = "Hunter"] should and does return true if any author of the book has the value Hunter.

When you want a full comparison between two sequences, you can use the deep-equal() function. This function, akin to the Java equals() method, determines if any two parameters are equivalent. Here's its signature:

deep-equal($param1 as item()*,
	$param2 as item()*) as xs:boolean

This function works on sequences or single items (because a single item is just a sequence of length 1). When comparing XML element nodes, the deep-equal() function compares their name, namespace, attributes, text, and child elements recursively. Comments and processing instructions are ignored. The following returns true:

deep-equal(
<elt a="b">
	<!-- first -->
	<child>c</child>
</elt>,
<elt a="b">
	<!-- second -->
	<child>c</note>
</elt>)

The deep-equal() call can take some time to execute for large element nodes. In some situations, it's better to use a shortcut function that compares nodes based just on node identity (that is, nodes that are the exact same node and not just equivalent, akin to the Java == operator). This function is sequence-node-identical(). The following returns true, because the exact same nodes are in each sequence in the same order:

sequence-node-identical(
	($elt1, $elt2), ($elt1, $elt2))

But the following returns false, because two different nodes are created:

sequence-node-identical(<elt1/>, <elt1/>)

The following query quickly determines if two collections have been identically defined:

sequence-node-identical(
	collection("a"), collection("b"))

Data and String Values

The data model of XQuery has been the source of much confusion. That's partly because it's complicated and partly because an effort to make queries simpler has made the number of special cases quite large. It sometimes seems that everything in XQuery having to do with data typing operates as if by magic. Here's an explanation of that magic.

Items in XQuery have both a typed value and a string value. A typed value is always an atomic value. Samples include the xs:integer 5, the xs:string "foo", the xs:double 500.1, and so on for all the standard XML Schema types. A string value is always a simple xs:string.

You can calculate an item's typed value by calling the data($seq) function. You rarely call this directly however; it's usually called implicitly by the XQuery engine when it is performing an operation that needs to treat a node like an atomic. For example, when you call $track/PlayCount * $track/TotalTime, you're technically multiplying nodes, but it works, because the engine implicitly uses their typed values.

An element's string value is usually its contained text recursively concatenated. The following element has the string value "It's true he sold 10 units." You can check this with the string($elt) function:

<elt>It's <stmt>true</stmt> he sold <count>10</count> units.</elt>

An element's data value depends on its XML Schema type. If the element <count> above had the schema type xs:integer, its data value would be the xs:integer 10. If the element instead had the schema type xs:double, the typed value would be the xs:double 10.0. If it had the schema type xs:string, the typed value would be the xs:string "10".

The Boolean element works similarly. If the <stmt> element above had the schema type xs:boolean, its data value would be the xs:boolean true().

What if no schema is in use? Then an element's data value is the same as its string value, but as an instance of the special type xdt:untypedAtomic. The "xdt" prefix indicates it's part of the XQuery datatypes, introduced by the XQuery specification. An xdt:untypedAtomic type can shift and act as both a string and numeric value, depending on the usage. So if <count> were untyped, $count = "10" would be true and so would $count = 10.

For an element, such as <elt>, that has complex content, there is no typed value, only a string value. Calling data($elt) generates an error.

Boolean Values

Every item and sequence has an "effective Boolean value." You see this value used frequently, especially in where clauses like this:

for $c in input()//item/count
where $c
return $c

The where clause here says where $c, and $c is bound to a <count> element. How does this work? The rules are that an item or a sequence has the effective Boolean value of false() if it's

  • The empty sequence
  • A Boolean false()
  • The empty string
  • A numeric 0
  • A double/float NaN

Otherwise, it's true(). You can determine the Boolean value of an item or sequence with the boolean($seq) function. Here are some true/false puzzlers (the answers appear below them):

  1. boolean(1)
  2. boolean((1))
  3. boolean(0)
  4. boolean((0))
  5. boolean((0, 0))
  6. boolean(())
  7. boolean(string(<foo/>))
  8. string(boolean(<foo/>))
  9. boolean(string(data(<foo/>)))
  10. boolean(string(data(<foo>x</foo>)))
  11. boolean(data(<foo>false</foo>))

Here are the answers:

  1. true(), because the value doesn't satisfy any of the false conditions.
  2. true() also, because it's a sequence of length 1 and thus identical to the value shown in puzzler No. 1.
  3. false(), because it matches a numeric 0.
  4. false() also, because it's identical to the value shown in puzzler No. 3.
  5. true(), because it doesn't satisfy any of the false conditions.
  6. false(), because it's an empty sequence.
  7. false(), because the string value is "" and that matches a false condition.
  8. true, because the boolean() call takes an element that doesn't satisfy any false condition and the string of true() is true.
  9. false(), because the data of <foo/> is "" as type xdt:untypedAtomic, which has the string value "", which has the Boolean value false().
  10. true(), because the data of <foo>x</ foo> is x as type xdt:untypedAtomic, which has the string value x, which has the Boolean value true().
  11. Depends. Without a schema, the typed value of <foo> is false as type xdt:untypedAtomic, which doesn't match any false criteria so is true().
    With a schema typing <foo> as xs:boolean, the typed value is false() with an effective Boolean value that is also false().

Now look back at the original <count> example and see if you can figure out how it would work.

for $c in input()//item/count
where $c
return $c

Answer: If <count> had the XML Schema type xs:integer, this would return every <count> whose content was nonzero. If it had the type xs:double or xs:float, it would return every <count> whose content was an actual number (as opposed to NaN, a special nonnumber case). In the absence of a specific schema type, it would return any <count> that had any text content at all. And if <count> had complex content (attributes or child elements), it would generate an error.
Test-First Learning

The BumbleBee XQuery test harness provides a convenient way to learn the tricky aspects of XQuery. Normally used for testing full production queries, a test harness also provides an environment for experimenting, exploring, and recording behaviors apart from the production environment. The test case file acts like a repository of knowledge, documenting how your queries execute (or should execute). Not only do the tests help you learn and predict behavior but, by running the tests against a new XQuery engine release, you can also check to see if any regressions were introduced that might cause problems with your production queries. You can find more about BumbleBee in the "Next Steps" section of this article.

In practical XQuery, it's best not to assume that everyone understands these rules. Be explicit, and use an expression in the where clause that directly returns a Boolean, rather than implying an effective Boolean value. Not being explicit can get you in trouble. For example, you may recall the following query from an earlier article:

define function star-count($movie-id) {
  let $review-doc := 
    doc("movie-reviews.xml")
  return avg($reviewdoc/reviews/review
    [movie-id = $movie-id]/stars)
}
let $movie-doc := document("movies.xml")
for $movie in $movie-doc/movies/movie
let $stars := star-count($movie/@id)
return
	<movie id="{$movie/@id}">
	  {$movie/title}
	  <stars>{
	  if ($stars) then $stars else "N/A"
	}</stars>
	</movie>

There's actually a bug here. Can you find it? The if clause was supposed to return N/A for any movie where the star-count() function returned the empty sequence() because no reviews were stored. However, it can also return N/A if the star-count() returns the xs:integer 0, because that qualifies as an effective Boolean value false(). In this case, it would be better to have written if (exists($stars)) then $stars else "N/A".

Sorting

The XQuery data model matters most when you are sorting with the order by FLWOR clause. Can you find the bug in the following query that's supposed to return books sorted by price?

for $b in doc("bib.xml")/bib/book
order by $b/price
return $b

The bug is that this performs a lexical sort, rather than the intended numeric sort. The prices will be sorted as 129.95, 39.95, and 65.95. In a lexical sort, anything starting with 1 comes before something starting with 3. This happens because ordering is based on a node's typed value.

Assuming that bib.xml doesn't have a schema to inform the engine that the <price> element is an xs:decimal, the <price> element will have a typed value that is the string value xdt:untypedAtomic.

You can make the sort work by specifying a schema or including a cast within the order by clause:

for $b in doc("bib.xml")/bib/book
order by xs:decimal($b/price)
return $b

Oddly enough, the following query works fine and returns a numeric average price:

avg(doc("bib.xml")/bib/book/price)

This succeeds because avg() has special rules for automatically promoting the xdt:untypedAtomic values to numerics.

Casting Functions

Casting functions are provided for every built-in XML Schema and XQuery datatype, and if you add a new user-defined schema type, an implicit casting function is created also. These come in handy when the implicit typing rules don't match what a query requires. Some examples:

xs:decimal("5.2") = 5.2
xs:string(xs:decimal("5.2")) = "5.2"
xs:boolean("false") = false()
xs:boolean("0") = false()
xs:float("314159e-5") = 3.14159

Expert XQuery programmers should know that boolean() and xs:boolean() are not the same thing. The boolean() function returns the effective Boolean value of an item. The xs:boolean() function casts the item to a Boolean. Casting rules are different from effective Boolean value rules. An xs:boolean() cast of a string looks at the string's value, so a "false" and a "0" evaluate to false(). In effective Boolean evaluation, the string contents don't matter.

This expression returns false(), because "false" cast to an xs:boolean is false().

xs:boolean(<a>false</a>)

This expression returns false() also if a schema defines <a> as xs:boolean.

boolean(<a>false</a>)

But this expression returns true() if the type of <a> has not been declared in a schema, because it has the typed value "false" as xdt:untypedAtomic with the effective Boolean value true().
Glossary of Terms

Atomic value. A number or string, generally.

Data value. An atomic value or sequence of atomic values obtained by atomization of an item or sequence of items. Returned by the data($seq) function. Some items and sequences cannot be atomized and may throw an error.

Effective Boolean value. An xs:boolean representation of an item or sequence of items. Returned by boolean($seq). False if it is an empty sequence, a Boolean false(), an empty string, a numeric 0, or a double/float NaN. True otherwise.

FLWOR. The core expression in XQuery with for, let, where, order by, and return clauses.

Item. A single node or an atomic value.

Node. Any component of an XML document. Node types include document, element, attribute, text, namespace, processing instruction, and comment.

Sequence. A collection of zero or more items. Sequences have order but are not nestable. A sequence of a single item is in all respects the same as the item alone.

String value. An xs:string representation of an item or sequence of items. Returned by the string($seq) function.

boolean(<a>false</a>) = true()

Finally, this expression errors out, because "x" can't be cast to an xs:boolean.

xs:boolean(<a>x</a>) errors out

My best advice is to avoid xs:boolean() casting and either always use or always avoid schemas, because the presence or absence of queries can greatly alter behavior.

Enhanced Sorting

Sometimes sorting in XQuery requires muscle beyond what the "order by" clause can provide. For example, assume you want to order items that don't have a quantifiable value for ordering but that can be compared to determine which is greater. Also assume you have a better() function that takes two items and, based on numerous internal criteria, determines the better item in a way that's hard to quantify but possible to execute. You then want to determine the best() item.

To do this, you use tail recursion, a common technique in expression languages for tasks such as this. You can see tail recursion in action with a factorial function definition:

define function 
factorial($n as xs:integer)
as xs:integer
{
	if ($n < 0) then 0
	else if ($n = 0) then 1
	else $n * factorial($n - 1)
}
factorial(5)
= 5*4*3*2*1 = 120

It's called tail recursion because the end of the function starts the recursion. This special type of recursion can be readily optimized into straight iteration, because the state of the stack within the function doesn't matter if the recursive call is the last call.

Using the better() function and tail recursion, you can write a best() function:

define function best($seq)
{
	if (empty($seq)) then ()
	else if (count($seq) = 1) then $seq
	else better($seq[1],
	  best($seq[position() > 1]))
}
best(($item1, $item2, $item3))

The best() function performs a conditional check on the value of the incoming $seq sequence. If it's empty, the function returns the empty sequence(). If it contains just one item, the function returns that item as the best. If it's a true sequence, it returns the better() of the first item versus the best() of all remaining items, triggering the tail recursion. This accomplishes the nonquantifiable best() determination in O(n) time.

One special thing to note about the last best() call is that it has double parentheses. Without the extra set, the function call would be passing three arguments, but the function takes only one: a single sequence. The extra set of parentheses wraps the items into a single sequence to be passed into the function.

Quantified Expressions

Many queries written with FLWORs could more easily be written with quantified expressions. A quantified expression lets you generate a Boolean based on a sequence's content. A "some" expression is true if any values match. An "every" expression is true if all values match. The expression looks like this:

some $var in $expr satisfies $predicate

For example, assume an event log that has <event> elements holding numerous <error> conditions:

<log>
	<event>
	  <error>
	    <desc>Disk corruption</desc>
     <resolved>false</resolved>
	  </error>
	  <error>
	   ...

The following query returns every <event> that has one or more unresolved conditions:

for $event in doc("log.xml")/log/event
where some $error in $event/error 
satisfies
	$error/resolved = "false"
return $event

As another example, assume you want to write a function to determine if a number is prime (not evenly divisible by any number other than 1 and itself). It's easy with a quantified expression. Let the expression perform an exhaustive search:

define function prime($i as xs:integer) as xs:boolean {
	$i = 2 or not( some $denom in (2 to $i - 	1)
	satisfies $i mod $denom = 0 )
}

<primes>
{
	for $i in (80 to 100)
	return
	  if (prime($i)) then 	  	  <prime>{$i}</prime>
	  else ()
}
</primes>

This produces the following output:

<primes>
	<prime>83</prime>
	<prime>89</prime>
	<prime>97</prime>
</primes>

The prime() function checks to see if any value $denom between 2 and $i-1 satisfies the expression that $i mod $denom is exactly 0. If that quantified expression returns false, you know that the value is prime and that the not() function causes the outer expression to return true. The initial "$i = 2" expression handles 2 as a special case that qualifies as prime even though it would confuse the second half of the test. A more advanced prime() implementation could search only odd divisors, stop searching at the square root of $i, and so on.
Next Steps

MORE Xquery
Author's XQuery resource site
xquery.com

BumbleBee XQuery test harness
xquery.com/bumblebee

XML Query working group
www.w3.org/XML/Query

XML Query Use Cases
www.w3.org/TR/xquery-use-cases/

Oracle's XML home
otn.oracle.com/tech/xml

To really understand quantified expressions, see if you can figure out what the following determines. Hint: The << operator returns true if the left operand precedes the right operand in document order.

for $proc 
in /book/section[title = "Procedure"]
where not (some $a in $proc//anesthesia
	   satisfies $a << ($proc//incision)[1])
return $proc

Here's how you figure out the answer. The FLWOR expression iterates over <section> elements that have the title "Procedure". The where clause then selects only those <section> elements in which a certain quantified expression is false. The quantified expression looks for any <anesthesia> element that precedes the first <incision> element within the <section>. If there is no preceding <anesthesia>, the quantified expression returns false, the not() function around it returns true, the where clause is satisfied, and the <section> is returned. So this query returns any overly painful procedures.

This example was borrowed from the use cases, which accompany the XQuery specification. For those who are interested in other good query examples, the use cases provide a great resource.

Conclusion

It seems that every XML-related technology has its own quirks, and XQuery is no exception. Getting started with XQuery turns out to be fairly easy, and writing productive queries comes quickly, but as your queries get more complicated and you start to write queries that go beyond simple transformations, it's likely you'll stumble on some of the issues or need some of the tricks outlined in this article. If not everything in this article makes sense to you, that's OK. Just keep it around. When you get stuck, look back and see if anything here helps break you free.

Jason Hunter ( jasonhunter@servlets.com) is a consultant, author of Java Servlet Programming and coauthor of Java Enterprise Best Practices (both from O'Reilly & Associates), and a publisher of XQuery.com.

E-mail this page
Printer View Printer View
Oracle Is The Information Company About Oracle | Oracle RSS Feeds | Careers | Contact Us | Site Maps | Legal Notices | Terms of Use | Privacy