This is a technical overview of the XML feautes in Oracle Text (the product formerly known as interMedia Text) version 9.0.1. This is intended for an audience fairly familiar with previous versions of the product.
Every new version of Oracle Text/interMedia Text/ConText Option has added new structured document searching features, so we've built up a lot of functionality -- 8.1.7 has nested section search, doctype disambiguation, attribute value searching, automatic section indexing, and more. But with the industry embracing XML, demand is growing for even more sophisticated features which are beyond the capabilities of the current section architecture.
9.0.1 introduces a new section type and new query operators which support an XPath-like query language. ConText indexes with XML path searching are able to perform sophisticated section searches that were impossible in previous versions.
In order to use XML path searching, the index must be created with the new path section group:
exec ctx_ddl.create_section_group('mypathgroup','PATH_SECTION_GROUP');
create index myindex on mytable(doc)
indextype is ctxsys.context
parameters ('SECTION GROUP mypathgroup');
You can add only
SPECIAL sections to a path section group; you do not have to add
ZONE,
FIELD, or
ATTR sections, and it does not support
STOP sections in this version. Most of the time, you won't even need to create your own -- just use the default one:
create index myindex on mytable(doc)
indextype is ctxsys.context
parameters ('SECTION GROUP ctxsys.path_section_group');
The path section group is somewhat like the auto section group in that it automatically indexes all tags and attributes it encounters. For query, you can still use
WITHIN, but the path section group also supports the more powerful
INPATH and
HASPATH operators.
Now we'll talk a little about how path sections are indexed, and how they differ from zone and attribute sections. You can skip this section if you're just interested in usage.
We'll illustrate the difference in index data with this simple doc:
<OUTER><INNER ATTR="attrvalue">text</INNER></OUTER>
The auto section group produces the following in $I:
| TOKEN_TEXT | TOKEN_TYPE | TOKEN_INFO |
|---|---|---|
OUTER
|
2 (zone section)
|
DOC 1 START 0 LENGTH 1
|
INNER
|
2 (zone section)
|
DOC 1 START 0 LENGTH 1
|
INNER@ATTR
|
5 (attribute section)
|
DOC 1 START 0 LENGTH 1
|
ATTRVALUE
|
4 (attribute value)
|
DOC 1 POS 1
|
TEXT
|
0 (plain text)
|
DOC 1 POS 1
|
A simple within query like:
select id from mytable where contains(doc, 'text WITHIN inner')>0
can be fulfilled by fetching the info for word
TEXT and section
INNER, then looping through each word offset of
TEXT, and checking to see if it is between
INNER START and
START + LENGTH. A simple attribute query like:
attrvalue WITHIN inner@attr
(we'll just write the text query for brevity) can be fulfilled in much the same way, but using type 4 and 5 tokens instead of types 0 and 2.
This section type cannot support more complex queries. For instance, equal sections like
INNER and
OUTER are a problem. The query:
(text WITHIN outer) WITHIN inner
matches this document. The index data records the same offsets for
INNER and
OUTER, so it is impossible to tell if
INNER is inside
OUTER or vice versa. Another problem is attribute value sensitive section search. For document:
<SHIPMENT>
<MEDIA TYPE="dvd">Ice Blue Eyes</MEDIA>
<MEDIA TYPE="book">The Monopoly Companion</MEDIA>
</SHIPMENTS>
(We won't include the xml declaration and
DOCTYPE stuff -- just pretend that they are there) If you want to find shipments that include the DVD "Ice Blue Eyes", you'd have to include both section and attribute criteria:
find documents where "Ice Blue Eyes" occurs within a "MEDIA" section
whose "TYPE" attribute has the value "dvd"
Unfortunately, attribute values and sections (types 4 and 5) are completely separate from normal text and sections (types 0 and 2). There is no way to link an attribute section to the particular zone section occurrence in which it occurs.
The path section group solves both problems by indexing the document like:
| TOKEN_TEXT | TOKEN_TYPE | TOKEN_INFO |
|---|---|---|
OUTER
|
7 (path section)
|
DOC 1 START 0 LENGTH 1 LEVEL 1
|
INNER
|
7 (path section)
|
DOC 1 START 0 LENGTH 1 LEVEL 2
|
INNER@ATTR
|
8 (path attr section)
|
DOC 1 START 0 LENGTH 1 OCC 1
|
ATTRVALUE
|
4 (attribute value)
|
DOC 1 POS 1
|
TEXT
|
0 (plain text)
|
DOC 1 POS 1
|
Zone sections and attribute sections (types 2 and 5) have been replaced with path sections and path attribute sections (types 7 and 8). Each occurrence of a path section has a
LEVEL bit which indicates the nesting level of the tag. The root tag gets a level of 1, its children get a level of 2, their children get a level of 3, etc. Including level information solves the equal section problem, because we can now tell which tag is the outer tag.
Path attribute sections have a
OCC part which links the attribute section occurrence to a path section occurrence. Here,
INNER@ATTR has an
OCC of 1 because it occurs in the first occurrence of
INNER. The ability to correlate attributes and tags solves the attribute value sensitive section search.
The query interface is through SQL selects, so your XML queries return entire documents, not just selected parts of them. While we work on better extraction features for future release, you can explore using the new
XMLType, which has extraction methods. Just use extraction methods in the select list and contains in the where clause.
The way path sections are indexed enables more complicated section searches, but the
WITHIN operator is not expressive enough to handle them. Instead of
<text query> WITHIN <section name> (e.g. Ice WITHIN MEDIA )
indexes with a path section group use the INPATH operator in queries:
<text query> INPATH(<path expression>) (e.g. Ice INPATH(//MEDIA) )
but it functions in much the same way, limiting the scope of the text query to certain parts of the indexed documents. The parentheses around the path expression are required. The path expression is more than just a simple section name -- it is a mini query, with a specialized query language. The next section explores the path query language in more detail.
You can still use the
WITHIN operator even if you are using the path section group. There should be no difference in behavior between the path section group or auto section group when using
WITHIN queries.
The Text path query language is based on XPath, and we will probably continue to use XPath as a guide for future development, but it is NOT XPath. Not all the XPath operators exist in the Text path query language, for instance. Also, the Text path query language operators are case-insensitive, while XPath's are strictly lower-case. There are other semantic differences covered below. Just don't make assumptions about the path language based on XPath expectations.
When specifying tags in path queries, you must specify it exactly as it appears in the document in order for it to match. There are two commonly-made mistakes you should avoid.
First, tag names are case-sensitive so the query
"title" does not match the tag
<TITLE> or the tag
<Title>. It will match only
<title>.
Second, there is no namespace support in this version. Take the fragments:
DOC 1 <A xmlns:ORCL="/index.html"><ORCL:B> DOC 2 <A xmlns:ORACLE="/index.html"><ORACLE:B>
<ORCL:B> in
DOC 1 is the same tag as
<ORACLE:B> in
DOC 2, because their namespace tags normalize to the same URI. However, when querying for these tags, you must specify it as written in the document, so
"ORCL:B" to find the tag in doc 1, and
"ORACLE:B" to find it in doc 2.
"B" alone will not find either tag, nor will something like
"http://www.oracle.com/:B". Future versions will probably add more sophisticated namespace support.
The simplest
INPATH query string is a single tag:
perro INPATH(TITLE)
Like a
WITHIN query, this query finds
perro where it occurs between
<TITLE> and
</TITLE>. However, unlike a
WITHIN query,
<TITLE> must be the top-level tag. Take these two documents:
DOC 1 <TITLE>Clifford El Gran Perro Colorado</TITLE> DOC 2 <BOOK><TITLE>Años De Perro</TITLE></BOOK>
The query
perro WITHIN TITLE
will find both documents, but the
INPATH query will find only document 1. It does not match document 2 because there the
TITLE tag has a level of 2.
What's really happening is that no level for the query node is specified, so it uses the default context, which is always the top level for
INPATH queries. You can explicitly specify the top level context with slash:
perro INPATH(/TITLE)
or explicitly specify the default context using dot:
perro INPATH(./TITLE)
both are equivalent to the query without the slash. All examples from here will include the top level slash for readability.
A double slash indicates "any number of levels down". So, the query:
perro INPATH(//TITLE)
is looking for
perro inside a
TITLE tag that occurs at the top level or any level below. In other words, this query is equivalent to:
perro WITHIN TITLE
and finds both
DOC 1 and
DOC 2.
A child tag is a tag which is enclosed within another tag. For instance, in:
DOC 2 <BOOK><TITLE>Años De Perro</TITLE></BOOK>
TITLE is a child of
BOOK. We can find this document using the any-level tag searching, as in the previous section. But what if the corpus also contained:
DOC 3 <MOVIE><TITLE>Mi vida como un perro</TITLE></MOVIE>
In order to find only books with
perro in the title, we need to limit the search to
title tags whose parent is a
book tag:
perro INPATH(/BOOK/TITLE)
Reading the path right-to-left, we are looking for a top-level
BOOK tag with a child
TITLE tag, which matches only
DOC 2.
The single slash is direct parentage. The query above will not find:
DOC 4 <BOOK><DESCRIPTION>
<TITLE>Años De Perro</TITLE>
</DESCRIPTION></BOOK>
Because here
TITLE is not a direct child of
BOOK.
TITLE's direct parent is
DESCRIPTION, whose parent is
BOOK --
TITLE is a grand-child of
BOOK. To find this doc, you can use the any-level slashes:
perro INPATH(/BOOK//TITLE)
Reading the path right-to-left, we are looking for a top-level
BOOK tag with some descendant
TITLE tag. This query will match both
DOC 3 and
DOC 4. Note that this is not the same as:
((perro WITHIN TITLE) WITHIN BOOK)
First, the
INPATH query restricts
BOOK to the top-level. Second, equal sections are not confused. That is, the query:
((perro WITHIN BOOK) WITHIN TITLE)
would match
DOC 4, but the query:
perro INPATH(/TITLE//BOOK)
would not. Path sections know that
TITLE is a child of
BOOK, even though they occur at the same text offsets.
Finally, if you wanted to match only
DOC 4 and not
DOC 3 -- that is, you want to match
TITLE only if it is a grandchild of
BOOK, and not a child or great grandchild, etc. -- you can use the single level wildcard:
perro INPATH(/BOOK/*/TITLE)
The * matches exactly one level, so this path query filters out
DOC 3.
You can combine these ancestor/descendant elements for even more complicated queries:
felis INPATH(//kindgom/*/*/order/family//genus)
You can search within an attribute value using the syntax
<tag>/@<attribute>:
perro INPATH(//MOVIE/@SPANISHTITLE)
matches:
DOC 5 <MOVIE SPANISHTITLE="Mi vida como un perro">My Life As A Dog</MOVIE>
and is equivalent to the query:
perro WITHIN MOVIE@SPANISHTITLE
One limitation resulting from how attributes are indexed is that all attributes must specify their direct-parent tags. The following:
perro INPATH(//@TITLE) perro INPATH(A/*/@TITLE)
are not allowed, because the tag for the title attribute is not specified:
select * from doc where contains(text, 'perro INPATH(//@TITLE)')>0;
*
ERROR at line 1:
ORA-20000: Oracle Text error:
DRG-50951: Unable to resolve element name for attribute TITLE
The square brackets are used to impose a condition on a node without changing the path context. For instance, the query:
monopoly INPATH(/auction[image])
is looking for
monopoly inside a top-level
auction tag which has an
image tag as a direct child. The search for
monopoly occurs within the entirety of
<auction> and
</auction>, and not just within
<image> and
</image>. This document will match:
<auction>Sailing Monopoly <image src="...">pic</image></auction>
but will not match:
<auction>Sailing Monopoly</auction>
because there is no
image element. The default context inside a test element is the tag to which it is applied, so
monopoly INPATH(/auction[image])
is actually the same as:
monopoly INPATH(/auction[./image])
You need the dot to reference the default context. Without the dot:
monopoly INPATH(/auction[/image])
it would mean top-level image tag. This is not supported, and will result in a syntax error.
The existence test for image will match only if
image exists and is a direct child. It does not match:
<auction>Sailing Monopoly<desc><image src="...">pic</image></desc></auction>
because here
image is not a direct child of
auction. You can match this document using the any-level wildcard, instead:
monopoly INPATH(/auction[.//image])You can also test for attribute existence:
monopoly INPATH(/auction[@reserve])
The test node can be combined with other operators for interesting searches:
monopoly INPATH(/auction[.//image]/title)
The test node does not change context, so the
/title applies to
/auction rather than
/auction//image -- this query finds auctions where
monopoly occurs inside a direct-child
title tag, but only if the auction has an
image tag in it somewhere. For instance, the doc:
<auction>
<title>Sailing Monopoly</title>
<description>
New Sailing Monopoly with custom pewter tokens from USAOpoly
<image src="...">here is a picture</image>
</description>
</auction>
To test for non-existence, use the NOT operator:
monopoly INPATH(/auction[not(taxinfo)])
this query looks for
monopoly within an
auction element that does not have a direct child
taxinfo. The
NOT operator is case-insensitive in our path query language. In XPath it only works in lowercase.
The test operator is capable of more than simple existence testing. More useful is attribute value testing, which contrains nodes by the value of their attributes. For instance, given a document like:
<MOVIE>
<TITLE LANGUAGE="German">Tiger und Dragon</TITLE>
<TITLE LANGUAGE="French">Tigre et Dragon</TITLE>
<TITLE LANGUAGE="Spanish">Tigre y Dragón</TITLE>
<TITLE LANGUAGE="Mandarin">Wo hu cang long</TITLE>
<TITLE LANGUAGE="English">Crouching Tiger, Hidden Dragon</TITLE>
</MOVIE>
the query:
dragon INPATH(//TITLE)
will search all language titles. To limit the search to just English titles, you can add an attribute value equality test:
dragon INPATH(//TITLE[@LANGUAGE = "English"])
Only equality and inequality (using
!=) are supported. Range searches are not supported in this version. The left-hand side must be an attribute or tag, while the right-hand side must be a literal. The query:
gato INPATH(//A[@B = @C])
is not allowed, nor is something like
gato INPATH(//A["dog" = "cat"]
Only string literals are allowed. Numeric literals, such as
tora INPATH(//MOVIE[@PRICE = 5])
will raise a syntax error. This means that numbers are not normalized. The query above will not match:
<MOVIE PRICE="5.0">Tora! Tora! Tora!</MOVIE>
because the string
5 is not equal to the string
5.0, although numerically they are equal.
The equality test is not strict equality -- it uses "contains-equality". Two text fragments are contains-equal if the lexer produces identical index info. Some of the significant ways that this deviates from strict equality are:
MIXED_CASE on, it would consider
foo and
FOO to be equal strings.
WORD1 WORD2, the word offset of
WORD2 is always 1 greater than the word offset of
WORD1 -- it doesn't matter how many spaces or newlines there are between them. Also, any non-alphabetic, non-join character is converted to whitespace (and subsequently ignored). This can confuse names, with
Chase Matthew being contains-equal to
Chase, Matthew, or phrases, with
fruit-plants being contains-equal to
fruit, plants.
Paris in the spring would be contains-equal to the document
Paris: avoid during spring.
The rules for contains equality seem complex, but it works the same as regular text queries hitting document text -- you've probably internalized these rules already. One significant difference between equality and contains, though, is that the equality test always makes sure that the number of words in the attribute value is the same as the number of words in the query string.
dragon INPATH(//TITLE[@LANGUAGE = "French"])
does not match any of these fragments:
<TITLE LANGUAGE="Canadian French">dragon</TITLE> <TITLE LANGUAGE="French Colloquial">dragon</TITLE> <TITLE LANGUAGE="Medieval French Colloquial">dragon</TITLE>
Although each
LANGUAGE attribute value has the word
French, there are extra words. These would match a contains in the attribute value, but they do not meet the "same number of words" equality criteria.
While docu-head people use a lot of attributes in their DTD's, data-heads prefer child tags. For instance, a docu-head might write:
<MOVIE YEAR="2001" TITLE="Moulin Rogue">...
While a data-head would prefer:
<MOVIE>
<YEAR>2001</YEAR>
<TITLE>Moulin Rogue</TITLE>
...
To match the data-head version, you can use equality testing on tag values:
moulin INPATH(//MOVIE[YEAR = "2001"])
Tag value equality uses contains-equality just like attribute value testing.
Inequality is also supported in both attribute and tag value equality, using the
!= operator:
moulin INPATH(//MOVIE[@YEAR != "2000"]) moulin INPATH(//MOVIE[YEAR != "2000"])
Note that inequality implies existence. The queries above do not match
<MOVIE>Moulin Rouge</MOVIE>
Because the
MOVIE tag does not have a
YEAR attribute or
YEAR child element. To test for non-existence, use the NOT operator.
You can use boolean
AND and
OR to combine existence or equality predicates in a test. Say you have documents like:
<MOVIE>
<TITLE>Big Trouble in Little China</TITLE>
<ACTORS>
<ACTOR>Kurt Russell</ACTOR>
<ACTOR>Kim Cattrall</ACTOR>
</ACTORS>
<DVD>2 DISCS</DVD>
</MOVIE>
and you want to find movies with
china in the title starring Kurt Russell and Kim Cattrall that are available on DVD:
china INPATH(/MOVIE[DVD and
.//ACTOR = "Kurt Russell" and .//ACTOR = "Kim Cattrall"]/TITLE)
You can use parentheses for precedence:
blue INPATH(/video[DVD and (discount or @rating = "4")])
AND and
OR are case-insensitive in our path query language. In XPath they must be lowercase.
Nested
INPATH operators are allowed, but the two are independent -- the default context of an
INPATH is always top level. For instance:
(perro INPATH(A)) INPATH(B)
will never hit any documents, because both
INPATH's are looking for top-level tags, and, except for invalid documents, a document cannot have two different top-level tags.
The
HASPATH operator is not a path query language operator; it's a ConText query language operator like
INPATH.
INPATH is used when you want to search for a text query within a path.
HASPATH is used when all you want to do is test for path existence; it takes a path as its only argument, and returns 100 for a document if the path exists, 0 otherwise.
select id from documents where contains(doc, 'HASPATH(/movie/dvd)')>0;
will return all documents where the top-level tag is a
movie element which has a
dvd element as a direct child.
HASPATH can also do tag value equality tests:
HASPATH(//A = "dog")
Attribute value equality tests and
AND and
OR operators are not currently supported. You can use the ConText query language
AND and
OR, with multiple
HASPATHs to achieve the same effect. Instead of:
HASPATH(A and B)
write:
HASPATH(A) and HASPATH(B)
HASPATH can return false hits when there are empty sections. Path sections are recorded with level information, but not true parentage. As a result, a document like:
<A>
<B>
<C></C>
</B>
<D>
<E></E>
</D>
</A>
is matched by the query:
HASPATH(//B/E)
Since we do not have real parent information, we cannot detect that
E is not the child of
B. The index tells us only that
E and
B surround the same word offsets, and that
E is a third-level tag and
B is a second-level tag. Normally this indicates that
E is a child of
B. In this boundary case it does not. This limitation only applies to empty sections like this -- any words in the document would ensure correct behavior.
Highlighting with the
INPATH and
HASPATH operators is not supported in this version. You can still highlight and markup regular words, and
WITHIN queries, but use of the path operators will result in an error message. We are working on support for a future release.
Oracle 9i introduces a new datatype for storing XML -- the
XMLType. This is a core database feature, and you can find out about the type and its usage in the XML features manual.
You can create a ConText index on this type, but you need a few database privileges first:
1. the user creating the index must have
query rewrite:
grant query rewrite to <user>
Without this privilege, the create index will fail with:
ORA-01031: insufficient privileges
<user> should be the user creating the index. The schema that owns the index (if different) does not need the grant.
2.
query_rewrite_enabled should be
true, and
query_rewrite_integrity should be
trusted. You can add them to the
init.ora:
query_rewrite_enabled=true
query_rewrite_integrity=trusted
or turn it on for the session:
alter session set query_rewrite_enabled=true;
alter session set query_rewrite_integrity=trusted;
Without these, queries will fail with:
DRG-10599: column is not indexed
These privileges are needed because under the covers a ConText index on an
XMLType column is actually a function-based index on the
getclobval() method of the type. These are the standard grants you need to use function-based indexes, as covered in the general Oracle documentation. However, unlike function-based b-tree indexes, you do not need to calculate statistics.
When an
XMLType column is detected, and no section group is specified in the parameters string, the default system examines the new system parameter
DEFAULT_XML_SECTION, and uses the section group specified there. At install time this system parameter is set to
CTXSYS.PATH_SECTION_GROUP, which is the default path sectioner. The default filter system parameter for
XMLType is
DEFAULT_FILTER_TEXT, which probably means that the INSO filter is not engaged by default.
Other than the database privileges and the special default section group system parameter, indexes on
XMLType columns work like any other ConText index.
Here is a simple example:
connect ctxsys/ctxsys
grant query rewrite to xtest;
connect xtest/xtest
create table xtest(doc sys.xmltype);
insert into xtest values (sys.xmltype.createxml('<A>simple</A>'));
create index xtestx on xtest(doc)
indextype is ctxsys.context;
alter session set query_rewrite_enabled = true;
alter session set query_rewrite_integrity = trusted;
select a.doc.getclobval() from xtest a
where contains(doc, 'simple INPATH(A)')>0;
Popular Downloads | ||
|
| ||
|
More Database Downloads
|
||
