Oracle Text 9.0.1
XML Features Overview
Garrett Kaminaga, Product Development, Database and Application Server Technologies, Oracle HQ
This is a technical overview of the XML feautes in Oracle Text (the
product formerly known as interMedia Text) version 9.0.1. This is intended
for an audience fairly familiar with previous versions of the product.
Table of Contents
- New XML Features
New XML Features
Path Indexing
XML Path Searching
Every new version of Oracle Text/interMedia Text/ConText Option has added new
structured document searching features, so we've built up a lot of
functionality -- 8.1.7 has nested section search, doctype disambiguation,
attribute value searching, automatic section indexing, and more. But with the
industry embracing XML, demand is growing for even more sophisticated features
which are beyond the capabilities of the current section architecture.
9.0.1 introduces a new section type and new query operators which support an
XPath-like query language. ConText indexes with XML path searching are able
to perform sophisticated section searches that were impossible in previous
versions.
The Path Section Group
In order to use XML path searching, the index must be created with the new
path section group:
exec ctx_ddl.create_section_group('mypathgroup','PATH_SECTION_GROUP');
create index myindex on mytable(doc)
indextype is ctxsys.context
parameters ('SECTION GROUP mypathgroup');
You can add only SPECIAL sections to a path section group; you
do not have to
add ZONE, FIELD, or ATTR sections, and
it does not support STOP sections in
this version. Most of the time, you won't even need to create your own -- just
use the default one:
create index myindex on mytable(doc)
indextype is ctxsys.context
parameters ('SECTION GROUP ctxsys.path_section_group');
The path section group is somewhat like the auto section group in that it
automatically indexes all tags and attributes it encounters. For query, you
can still use WITHIN, but the path section group also supports
the more
powerful INPATH and HASPATH operators.
Path Section Internals
Now we'll talk a little about how path sections are indexed, and how they
differ from zone and attribute sections. You can skip this section if you're
just interested in usage.
We'll illustrate the difference in index data with this simple doc:
<OUTER><INNER ATTR="attrvalue">text</INNER></OUTER>
The auto section group produces the following in $I:
| TOKEN_TEXT |
TOKEN_TYPE |
TOKEN_INFO |
OUTER |
2 (zone section) |
DOC 1 START 0 LENGTH 1 |
INNER |
2 (zone section) |
DOC 1 START 0 LENGTH 1 |
INNER@ATTR |
5 (attribute section) |
DOC 1 START 0 LENGTH 1 |
ATTRVALUE |
4 (attribute value) |
DOC 1 POS 1 |
TEXT |
0 (plain text) |
DOC 1 POS 1 |
A simple within query like:
select id from mytable where contains(doc, 'text WITHIN inner')>0
can be fulfilled by fetching the info for word TEXT and section
INNER, then
looping through each word offset of TEXT, and checking to see if
it is between
INNER START and START + LENGTH. A simple attribute
query like:
attrvalue WITHIN inner@attr
(we'll just write the text query for brevity) can be fulfilled in much the same
way, but using type 4 and 5 tokens instead of types 0 and 2.
This section type cannot support more complex queries. For instance, equal
sections like INNER and OUTER are a problem. The
query:
(text WITHIN outer) WITHIN inner
matches this document. The index data records the same offsets for
INNER and
OUTER, so it is impossible to tell if INNER is
inside OUTER or vice versa.
Another problem is attribute value sensitive section search. For document:
<SHIPMENT>
<MEDIA TYPE="dvd">Ice Blue Eyes</MEDIA>
<MEDIA TYPE="book">The Monopoly Companion</MEDIA>
</SHIPMENTS>
(We won't include the xml declaration and DOCTYPE stuff -- just
pretend that
they are there) If you want to find shipments that include the DVD "Ice Blue
Eyes", you'd have to include both section and attribute criteria:
find documents where "Ice Blue Eyes" occurs within a "MEDIA" section
whose "TYPE" attribute has the value "dvd"
Unfortunately, attribute values and sections (types 4 and 5) are completely
separate from normal text and sections (types 0 and 2). There is no way to
link an attribute section to the particular zone section occurrence in which
it occurs.
The path section group solves both problems by indexing the document like:
| TOKEN_TEXT |
TOKEN_TYPE |
TOKEN_INFO |
OUTER |
7 (path section) |
DOC 1 START 0 LENGTH 1 LEVEL 1 |
INNER |
7 (path section) |
DOC 1 START 0 LENGTH 1 LEVEL 2 |
INNER@ATTR |
8 (path attr section) |
DOC 1 START 0 LENGTH 1 OCC 1 |
ATTRVALUE |
4 (attribute value) |
DOC 1 POS 1 |
TEXT |
0 (plain text) |
DOC 1 POS 1 |
Zone sections and attribute sections (types 2 and 5) have been replaced with
path sections and path attribute sections (types 7 and 8). Each occurrence
of a path section has a LEVEL bit which indicates the nesting
level of the tag.
The root tag gets a level of 1, its children get a level of 2, their children
get a level of 3, etc. Including level information solves the equal section
problem, because we can now tell which tag is the outer tag.
Path attribute sections have a OCC part which links the
attribute section
occurrence to a path section occurrence. Here, INNER@ATTR has an
OCC of 1
because it occurs in the first occurrence of INNER. The ability
to correlate
attributes and tags solves the attribute value sensitive section search.
New XML Features
Path Query
Querying XML Documents
The query interface is through SQL selects, so your XML queries return entire
documents, not just selected parts of them. While we work on better extraction
features for future release, you can explore using the new
XMLType, which has
extraction methods. Just use extraction methods in the select list and
contains in the where clause.
The INPATH Operator
The way path sections are indexed enables more complicated section searches,
but the WITHIN operator is not expressive enough to handle them. Instead of
<text query> WITHIN <section name> (e.g. Ice WITHIN MEDIA )
indexes with a path section group use the INPATH operator in queries:
<text query> INPATH(<path expression>) (e.g. Ice INPATH(//MEDIA) )
but it functions in much the same way, limiting the scope of the text query to
certain parts of the indexed documents. The parentheses around the path
expression are required. The path expression is more than just a simple
section name -- it is a mini query, with a specialized query language. The
next section explores the path query language in more detail.
You can still use the WITHIN operator even if you are using the
path section
group. There should be no difference in behavior between the path section
group or auto section group when using WITHIN queries.
Path Query Language
The Text path query language is based on XPath, and we will probably continue
to use XPath as a guide for future development, but it is NOT XPath. Not all
the XPath operators exist in the Text path query language, for instance. Also,
the Text path query language operators are case-insensitive, while XPath's are
strictly lower-case. There are other semantic differences covered below. Just
don't make assumptions about the path language based on XPath expectations.
Tags
When specifying tags in path queries, you must specify it exactly as it
appears in the document in order for it to match. There are two commonly-made
mistakes you should avoid.
First, tag names are case-sensitive so the query "title" does
not match the
tag <TITLE> or the tag <Title>. It
will match only <title>.
Second, there is no namespace support in this version. Take the fragments:
DOC 1 <A xmlns:ORCL="/index.html"><ORCL:B>
DOC 2 <A xmlns:ORACLE="/index.html"><ORACLE:B>
<ORCL:B> in DOC 1 is the same tag as
<ORACLE:B> in DOC 2, because their
namespace tags normalize to the same URI. However, when querying for these
tags, you must specify it as written in the document, so "ORCL:B"
to find the
tag in doc 1, and "ORACLE:B" to find it in doc 2.
"B" alone will not find
either tag, nor will something like "http://www.oracle.com/:B".
Future
versions will probably add more sophisticated namespace support.
Top-Level Tag
The simplest INPATH query string is a single tag:
perro INPATH(TITLE)
Like a WITHIN query, this query finds perro where it occurs between
<TITLE> and
</TITLE>. However, unlike a WITHIN query,
<TITLE> must be the top-level tag. Take
these two documents:
DOC 1 <TITLE>Clifford El Gran Perro Colorado</TITLE>
DOC 2 <BOOK><TITLE>Años De Perro</TITLE></BOOK>
The query
perro WITHIN TITLE
will find both documents, but the INPATH query will find only
document 1. It
does not match document 2 because there the TITLE tag has a
level of 2.
What's really happening is that no level for the query node is specified, so
it uses the default context, which is always the top level for
INPATH queries.
You can explicitly specify the top level context with slash:
perro INPATH(/TITLE)
or explicitly specify the default context using dot:
perro INPATH(./TITLE)
both are equivalent to the query without the slash. All examples from here
will include the top level slash for readability.
Any-Level Tag
A double slash indicates "any number of levels down". So, the query:
perro INPATH(//TITLE)
is looking for perro inside a TITLE tag that occurs
at the top level or any
level below. In other words, this query is equivalent to:
perro WITHIN TITLE
and finds both DOC 1 and DOC 2.
Ancestor/Descendant Searching
A child tag is a tag which is enclosed within another tag. For instance, in:
DOC 2 <BOOK><TITLE>Años De Perro</TITLE></BOOK>
TITLE is a child of BOOK. We can find this
document using the any-level
tag searching, as in the previous section. But what if the corpus also
contained:
DOC 3 <MOVIE><TITLE>Mi vida como un perro</TITLE></MOVIE>
In order to find only books with perro in the title, we need to
limit the
search to title tags whose parent is a book tag:
perro INPATH(/BOOK/TITLE)
Reading the path right-to-left, we are looking for a top-level
BOOK tag with
a child TITLE tag, which matches only DOC 2.
The single slash is direct parentage. The query above will not find:
DOC 4 <BOOK><DESCRIPTION>
<TITLE>Años De Perro</TITLE>
</DESCRIPTION></BOOK>
Because here TITLE is not a direct child of BOOK.
TITLE's direct parent is
DESCRIPTION, whose parent is BOOK --
TITLE is a grand-child of BOOK.
To find this doc, you can use the any-level slashes:
perro INPATH(/BOOK//TITLE)
Reading the path right-to-left, we are looking for a top-level
BOOK tag with
some descendant TITLE tag. This query will match both
DOC 3 and DOC 4.
Note that this is not the same as:
((perro WITHIN TITLE) WITHIN BOOK)
First, the INPATH query restricts BOOK to the
top-level. Second, equal
sections are not confused. That is, the query:
((perro WITHIN BOOK) WITHIN TITLE)
would match DOC 4, but the query:
perro INPATH(/TITLE//BOOK)
would not. Path sections know that TITLE is a child of
BOOK, even though
they occur at the same text offsets.
Finally, if you wanted to match only DOC 4 and not
DOC 3 -- that is, you want
to match TITLE only if it is a grandchild of BOOK,
and not a child or great
grandchild, etc. -- you can use the single level wildcard:
perro INPATH(/BOOK/*/TITLE)
The * matches exactly one level, so this path query filters out
DOC 3.
You can combine these ancestor/descendant elements for even more complicated
queries:
felis INPATH(//kindgom/*/*/order/family//genus)
Attributes
You can search within an attribute value using the syntax
<tag>/@<attribute>:
perro INPATH(//MOVIE/@SPANISHTITLE)
matches:
DOC 5 <MOVIE SPANISHTITLE="Mi vida como un perro">My Life As A Dog</MOVIE>
and is equivalent to the query:
perro WITHIN MOVIE@SPANISHTITLE
One limitation resulting from how attributes are indexed is that all attributes
must specify their direct-parent tags. The following:
perro INPATH(//@TITLE)
perro INPATH(A/*/@TITLE)
are not allowed, because the tag for the title attribute is not specified:
select * from doc where contains(text, 'perro INPATH(//@TITLE)')>0;
*
ERROR at line 1:
ORA-20000: Oracle Text error:
DRG-50951: Unable to resolve element name for attribute TITLE
Existence Test
The square brackets are used to impose a condition on a node without changing
the path context. For instance, the query:
monopoly INPATH(/auction[image])
is looking for monopoly inside a top-level auction
tag which has an image
tag as a direct child. The search for monopoly occurs within the
entirety
of <auction> and </auction>, and not
just within <image> and </image>. This
document will match:
<auction>Sailing Monopoly <image src="...">pic</image></auction>
but will not match:
<auction>Sailing Monopoly</auction>
because there is no image element. The default context inside a
test element
is the tag to which it is applied, so
monopoly INPATH(/auction[image])
is actually the same as:
monopoly INPATH(/auction[./image])
You need the dot to reference the default context. Without the dot:
monopoly INPATH(/auction[/image])
it would mean top-level image tag. This is not supported, and will result in
a syntax error.
The existence test for image will match only if image exists
and is a
direct child. It does not match:
<auction>Sailing Monopoly<desc><image src="...">pic</image></desc></auction>
because here image is not a direct child of auction.
You can match this
document using the any-level wildcard, instead:
monopoly INPATH(/auction[.//image])
You can also test for attribute existence:
monopoly INPATH(/auction[@reserve])
The test node can be combined with other operators for interesting searches:
monopoly INPATH(/auction[.//image]/title)
The test node does not change context, so the /title applies to
/auction
rather than /auction//image -- this query finds auctions where
monopoly
occurs inside a direct-child title tag, but only if the auction
has an
image tag in it somewhere. For instance, the doc:
<auction>
<title>Sailing Monopoly</title>
<description>
New Sailing Monopoly with custom pewter tokens from USAOpoly
<image src="...">here is a picture</image>
</description>
</auction>
To test for non-existence, use the NOT operator:
monopoly INPATH(/auction[not(taxinfo)])
this query looks for monopoly within an auction element that does not have
a direct child taxinfo. The NOT operator is
case-insensitive in our path
query language. In XPath it only works in lowercase.
Attribute and Tag Value Equality
The test operator is capable of more than simple existence testing. More
useful is attribute value testing, which contrains nodes by the value of their
attributes. For instance, given a document like:
<MOVIE>
<TITLE LANGUAGE="German">Tiger und Dragon</TITLE>
<TITLE LANGUAGE="French">Tigre et Dragon</TITLE>
<TITLE LANGUAGE="Spanish">Tigre y Dragón</TITLE>
<TITLE LANGUAGE="Mandarin">Wo hu cang long</TITLE>
<TITLE LANGUAGE="English">Crouching Tiger, Hidden Dragon</TITLE>
</MOVIE>
the query:
dragon INPATH(//TITLE)
will search all language titles. To limit the search to just English titles,
you can add an attribute value equality test:
dragon INPATH(//TITLE[@LANGUAGE = "English"])
Only equality and inequality (using !=) are supported. Range
searches are
not supported in this version. The left-hand side must be an attribute or tag,
while the right-hand side must be a literal. The query:
gato INPATH(//A[@B = @C])
is not allowed, nor is something like
gato INPATH(//A["dog" = "cat"]
Only string literals are allowed. Numeric literals, such as
tora INPATH(//MOVIE[@PRICE = 5])
will raise a syntax error. This means that numbers are not normalized. The
query above will not match:
<MOVIE PRICE="5.0">Tora! Tora! Tora!</MOVIE>
because the string 5 is not equal to the string 5.0,
although numerically
they are equal.
The equality test is not strict equality -- it uses "contains-equality". Two
text fragments are contains-equal if the lexer produces identical index info.
Some of the significant ways that this deviates from strict equality are:
- contains equality is probably case insensitive
Unless the lexer has MIXED_CASE on, it would consider
foo and FOO to
be equal strings.
- lexer transformation can make strings equal
The lexer can be set to normalize word forms. Base letter, for instance,
converts accented characters to non-accented forms. A word with u-umlaut
(Müller) is contains-equal to the same word with an unaccented u
(Muller) in this case.
- lexer ignores whitespace and non-alphabetic, non-join characters
The lexer deals in word offsets. In a phrase like
WORD1 WORD2, the word
offset of WORD2 is always 1 greater than the word offset of
WORD1 -- it
doesn't matter how many spaces or newlines there are between them. Also,
any non-alphabetic, non-join character is converted to whitespace (and
subsequently ignored). This can confuse names, with
Chase Matthew being
contains-equal to Chase, Matthew, or phrases, with
fruit-plants being
contains-equal to fruit, plants.
- stopwords in the query string are considered any word wildcard
Any stopword appearing in the query text fragment will match any single
word in the document's text fragment. This means that the query
Paris in the spring would be contains-equal to the document
Paris: avoid during spring.
The rules for contains equality seem complex, but it works the same as regular
text queries hitting document text -- you've probably internalized these rules
already. One significant difference between equality and contains, though, is
that the equality test always makes sure that the number of words in the
attribute value is the same as the number of words in the query string.
dragon INPATH(//TITLE[@LANGUAGE = "French"])
does not match any of these fragments:
<TITLE LANGUAGE="Canadian French">dragon</TITLE>
<TITLE LANGUAGE="French Colloquial">dragon</TITLE>
<TITLE LANGUAGE="Medieval French Colloquial">dragon</TITLE>
Although each LANGUAGE attribute value has the word
French, there are extra
words. These would match a contains in the attribute value, but they do not
meet the "same number of words" equality criteria.
While docu-head people use a lot of attributes in their DTD's, data-heads
prefer child tags. For instance, a docu-head might write:
<MOVIE YEAR="2001" TITLE="Moulin Rogue">...
While a data-head would prefer:
<MOVIE>
<YEAR>2001</YEAR>
<TITLE>Moulin Rogue</TITLE>
...
To match the data-head version, you can use equality testing on tag values:
moulin INPATH(//MOVIE[YEAR = "2001"])
Tag value equality uses contains-equality just like attribute value testing.
Inequality is also supported in both attribute and tag value equality, using
the != operator:
moulin INPATH(//MOVIE[@YEAR != "2000"])
moulin INPATH(//MOVIE[YEAR != "2000"])
Note that inequality implies existence. The queries above do not match
<MOVIE>Moulin Rouge</MOVIE>
Because the MOVIE tag does not have a YEAR
attribute or YEAR child element. To
test for non-existence, use the NOT operator.
AND and OR Operators
You can use boolean AND and OR to combine
existence or equality predicates in
a test. Say you have documents like:
<MOVIE>
<TITLE>Big Trouble in Little China</TITLE>
<ACTORS>
<ACTOR>Kurt Russell</ACTOR>
<ACTOR>Kim Cattrall</ACTOR>
</ACTORS>
<DVD>2 DISCS</DVD>
</MOVIE>
and you want to find movies with china in the title starring
Kurt Russell
and Kim Cattrall that are available on DVD:
china INPATH(/MOVIE[DVD and
.//ACTOR = "Kurt Russell" and .//ACTOR = "Kim Cattrall"]/TITLE)
You can use parentheses for precedence:
blue INPATH(/video[DVD and (discount or @rating = "4")])
AND and OR are case-insensitive in our path query
language. In XPath they
must be lowercase.
Nested INPATH
Nested INPATH operators are allowed, but the two are independent -- the
default context of an INPATH is always top level. For instance:
(perro INPATH(A)) INPATH(B)
will never hit any documents, because both INPATH's are looking
for top-level
tags, and, except for invalid documents, a document cannot have two different
top-level tags.
The HASPATH Operator
The HASPATH operator is not a path query language operator;
it's a ConText
query language operator like INPATH. INPATH is used
when you want to search
for a text query within a path. HASPATH is used when all you want
to do is
test for path existence; it takes a path as its only argument, and returns 100
for a document if the path exists, 0 otherwise.
select id from documents
where contains(doc, 'HASPATH(/movie/dvd)')>0;
will return all documents where the top-level tag is a movie
element which
has a dvd element as a direct child.
HASPATH can also do tag value equality tests:
HASPATH(//A = "dog")
Attribute value equality tests and AND and
OR operators are not currently
supported. You can use the ConText query language AND and
OR, with multiple
HASPATHs to achieve the same effect. Instead of:
HASPATH(A and B)
write:
HASPATH(A) and HASPATH(B)
HASPATH can return false hits when there are empty sections.
Path sections
are recorded with level information, but not true parentage. As a result, a
document like:
<A>
<B>
<C></C>
</B>
<D>
<E></E>
</D>
</A>
is matched by the query:
HASPATH(//B/E)
Since we do not have real parent information, we cannot detect that
E is not
the child of B. The index tells us only that E and
B surround the same
word offsets, and that E is a third-level tag and B
is a second-level tag.
Normally this indicates that E is a child of B.
In this boundary case it
does not. This limitation only applies to empty sections like this -- any
words in the document would ensure correct behavior.
New XML Features
Highlighting
XMLType Indexing
Highlighting
Highlighting with the INPATH and HASPATH operators
is not supported in this
version. You can still highlight and markup regular words, and
WITHIN queries,
but use of the path operators will result in an error message. We are working
on support for a future release.
XMLType Indexing
Oracle 9i introduces a new datatype for storing XML -- the XMLType. This is
a core database feature, and you can find out about the type and its usage in
the XML features manual.
You can create a ConText index on this type, but you need a few database
privileges first:
1. the user creating the index must have query rewrite:
grant query rewrite to <user>
Without this privilege, the create index will fail with:
ORA-01031: insufficient privileges
<user> should be the user creating the index. The
schema that owns the
index (if different) does not need the grant.
2. query_rewrite_enabled should be true, and
query_rewrite_integrity should
be trusted. You can add them to the init.ora:
query_rewrite_enabled=true
query_rewrite_integrity=trusted
or turn it on for the session:
alter session set query_rewrite_enabled=true;
alter session set query_rewrite_integrity=trusted;
Without these, queries will fail with:
DRG-10599: column is not indexed
These privileges are needed because under the covers a ConText index on an
XMLType column is actually a function-based index on the
getclobval() method
of the type. These are the standard grants you need to use function-based
indexes, as covered in the general Oracle documentation. However, unlike
function-based b-tree indexes, you do not need to calculate statistics.
When an XMLType column is detected, and no section group is specified in the
parameters string, the default system examines the new system parameter
DEFAULT_XML_SECTION, and uses the section group specified there.
At install
time this system parameter is set to CTXSYS.PATH_SECTION_GROUP,
which is the
default path sectioner. The default filter system parameter for
XMLType is
DEFAULT_FILTER_TEXT, which probably means that the INSO filter
is not engaged
by default.
Other than the database privileges and the special default section group
system parameter, indexes on XMLType columns work like any other
ConText index.
Here is a simple example:
connect ctxsys/ctxsys
grant query rewrite to xtest;
connect xtest/xtest
create table xtest(doc sys.xmltype);
insert into xtest values (sys.xmltype.createxml('<A>simple</A>'));
create index xtestx on xtest(doc)
indextype is ctxsys.context;
alter session set query_rewrite_enabled = true;
alter session set query_rewrite_integrity = trusted;
select a.doc.getclobval() from xtest a
where contains(doc, 'simple INPATH(A)')>0;
|