Oracle Text 9.0.1
XML Features Overview

Garrett Kaminaga, Product Development, Database and Application Server Technologies, Oracle HQ

This is a technical overview of the XML feautes in Oracle Text (the product formerly known as interMedia Text) version 9.0.1. This is intended for an audience fairly familiar with previous versions of the product.

Table of Contents

  • New XML Features
  • New XML Features

    Path Indexing

    XML Path Searching

    Every new version of Oracle Text/interMedia Text/ConText Option has added new structured document searching features, so we've built up a lot of functionality -- 8.1.7 has nested section search, doctype disambiguation, attribute value searching, automatic section indexing, and more. But with the industry embracing XML, demand is growing for even more sophisticated features which are beyond the capabilities of the current section architecture.

    9.0.1 introduces a new section type and new query operators which support an XPath-like query language. ConText indexes with XML path searching are able to perform sophisticated section searches that were impossible in previous versions.

    The Path Section Group

    In order to use XML path searching, the index must be created with the new path section group:

      exec ctx_ddl.create_section_group('mypathgroup','PATH_SECTION_GROUP');
    

      create index myindex on mytable(doc)
        indextype is ctxsys.context
        parameters ('SECTION GROUP mypathgroup');
    

    You can add only SPECIAL sections to a path section group; you do not have to add ZONE, FIELD, or ATTR sections, and it does not support STOP sections in this version. Most of the time, you won't even need to create your own -- just use the default one:

      create index myindex on mytable(doc)
        indextype is ctxsys.context
        parameters ('SECTION GROUP ctxsys.path_section_group');
    

    The path section group is somewhat like the auto section group in that it automatically indexes all tags and attributes it encounters. For query, you can still use WITHIN, but the path section group also supports the more powerful INPATH and HASPATH operators.

    Path Section Internals

    Now we'll talk a little about how path sections are indexed, and how they differ from zone and attribute sections. You can skip this section if you're just interested in usage.

    We'll illustrate the difference in index data with this simple doc:

      <OUTER><INNER ATTR="attrvalue">text</INNER></OUTER>
    
    

    The auto section group produces the following in $I:

    TOKEN_TEXT TOKEN_TYPE TOKEN_INFO
    OUTER 2 (zone section) DOC 1 START 0 LENGTH 1
    INNER 2 (zone section) DOC 1 START 0 LENGTH 1
    INNER@ATTR 5 (attribute section) DOC 1 START 0 LENGTH 1
    ATTRVALUE 4 (attribute value) DOC 1 POS 1
    TEXT 0 (plain text) DOC 1 POS 1

    A simple within query like:

    
      select id from mytable where contains(doc, 'text WITHIN inner')>0
    

    can be fulfilled by fetching the info for word TEXT and section INNER, then looping through each word offset of TEXT, and checking to see if it is between INNER START and START + LENGTH. A simple attribute query like:

      attrvalue WITHIN inner@attr
    

    (we'll just write the text query for brevity) can be fulfilled in much the same way, but using type 4 and 5 tokens instead of types 0 and 2.

    This section type cannot support more complex queries. For instance, equal sections like INNER and OUTER are a problem. The query:

      (text WITHIN outer) WITHIN inner
    

    matches this document. The index data records the same offsets for INNER and OUTER, so it is impossible to tell if INNER is inside OUTER or vice versa. Another problem is attribute value sensitive section search. For document:

      <SHIPMENT>
        <MEDIA TYPE="dvd">Ice Blue Eyes</MEDIA>
        <MEDIA TYPE="book">The Monopoly Companion</MEDIA>
      </SHIPMENTS>
    

    (We won't include the xml declaration and DOCTYPE stuff -- just pretend that they are there) If you want to find shipments that include the DVD "Ice Blue Eyes", you'd have to include both section and attribute criteria:

      find documents where "Ice Blue Eyes" occurs within a "MEDIA" section
        whose "TYPE" attribute has the value "dvd"
    

    Unfortunately, attribute values and sections (types 4 and 5) are completely separate from normal text and sections (types 0 and 2). There is no way to link an attribute section to the particular zone section occurrence in which it occurs.

    The path section group solves both problems by indexing the document like:

    TOKEN_TEXT TOKEN_TYPE TOKEN_INFO
    OUTER 7 (path section) DOC 1 START 0 LENGTH 1 LEVEL 1
    INNER 7 (path section) DOC 1 START 0 LENGTH 1 LEVEL 2
    INNER@ATTR 8 (path attr section) DOC 1 START 0 LENGTH 1 OCC 1
    ATTRVALUE 4 (attribute value) DOC 1 POS 1
    TEXT 0 (plain text) DOC 1 POS 1

    Zone sections and attribute sections (types 2 and 5) have been replaced with path sections and path attribute sections (types 7 and 8). Each occurrence of a path section has a LEVEL bit which indicates the nesting level of the tag. The root tag gets a level of 1, its children get a level of 2, their children get a level of 3, etc. Including level information solves the equal section problem, because we can now tell which tag is the outer tag.

    Path attribute sections have a OCC part which links the attribute section occurrence to a path section occurrence. Here, INNER@ATTR has an OCC of 1 because it occurs in the first occurrence of INNER. The ability to correlate attributes and tags solves the attribute value sensitive section search.



    New XML Features

    Path Query

    Querying XML Documents

    The query interface is through SQL selects, so your XML queries return entire documents, not just selected parts of them. While we work on better extraction features for future release, you can explore using the new XMLType, which has extraction methods. Just use extraction methods in the select list and contains in the where clause.

    The INPATH Operator

    The way path sections are indexed enables more complicated section searches, but the WITHIN operator is not expressive enough to handle them. Instead of

      <text query> WITHIN <section name>      (e.g. Ice WITHIN MEDIA   )
    

    indexes with a path section group use the INPATH operator in queries:

      <text query> INPATH(<path expression>)  (e.g. Ice INPATH(//MEDIA) )
    

    but it functions in much the same way, limiting the scope of the text query to certain parts of the indexed documents. The parentheses around the path expression are required. The path expression is more than just a simple section name -- it is a mini query, with a specialized query language. The next section explores the path query language in more detail.

    You can still use the WITHIN operator even if you are using the path section group. There should be no difference in behavior between the path section group or auto section group when using WITHIN queries.

    Path Query Language

    The Text path query language is based on XPath, and we will probably continue to use XPath as a guide for future development, but it is NOT XPath. Not all the XPath operators exist in the Text path query language, for instance. Also, the Text path query language operators are case-insensitive, while XPath's are strictly lower-case. There are other semantic differences covered below. Just don't make assumptions about the path language based on XPath expectations.

    Tags

    When specifying tags in path queries, you must specify it exactly as it appears in the document in order for it to match. There are two commonly-made mistakes you should avoid.

    First, tag names are case-sensitive so the query "title" does not match the tag <TITLE> or the tag <Title>. It will match only <title>.

    Second, there is no namespace support in this version. Take the fragments:

      DOC 1    <A xmlns:ORCL="/index.html"><ORCL:B> 
      DOC 2    <A xmlns:ORACLE="/index.html"><ORACLE:B> 
    

    <ORCL:B> in DOC 1 is the same tag as <ORACLE:B> in DOC 2, because their namespace tags normalize to the same URI. However, when querying for these tags, you must specify it as written in the document, so "ORCL:B" to find the tag in doc 1, and "ORACLE:B" to find it in doc 2. "B" alone will not find either tag, nor will something like "http://www.oracle.com/:B". Future versions will probably add more sophisticated namespace support.

    Top-Level Tag

    The simplest INPATH query string is a single tag:

      perro INPATH(TITLE)
    

    Like a WITHIN query, this query finds perro where it occurs between <TITLE> and </TITLE>. However, unlike a WITHIN query, <TITLE> must be the top-level tag. Take these two documents:

      DOC 1    <TITLE>Clifford El Gran Perro Colorado</TITLE>
      DOC 2    <BOOK><TITLE>Años De Perro</TITLE></BOOK>
    

    The query

      perro WITHIN TITLE
    

    will find both documents, but the INPATH query will find only document 1. It does not match document 2 because there the TITLE tag has a level of 2.

    What's really happening is that no level for the query node is specified, so it uses the default context, which is always the top level for INPATH queries. You can explicitly specify the top level context with slash:

      perro INPATH(/TITLE)
    

    or explicitly specify the default context using dot:

      perro INPATH(./TITLE)
    

    both are equivalent to the query without the slash. All examples from here will include the top level slash for readability.

    Any-Level Tag

    A double slash indicates "any number of levels down". So, the query:

      perro INPATH(//TITLE)
    

    is looking for perro inside a TITLE tag that occurs at the top level or any level below. In other words, this query is equivalent to:

      perro WITHIN TITLE
    

    and finds both DOC 1 and DOC 2.

    Ancestor/Descendant Searching

    A child tag is a tag which is enclosed within another tag. For instance, in:

      DOC 2    <BOOK><TITLE>Años De Perro</TITLE></BOOK>
    

    TITLE is a child of BOOK. We can find this document using the any-level tag searching, as in the previous section. But what if the corpus also contained:

      DOC 3    <MOVIE><TITLE>Mi vida como un perro</TITLE></MOVIE>
    

    In order to find only books with perro in the title, we need to limit the search to title tags whose parent is a book tag:

      perro INPATH(/BOOK/TITLE)
    
    

    Reading the path right-to-left, we are looking for a top-level BOOK tag with a child TITLE tag, which matches only DOC 2.

    The single slash is direct parentage. The query above will not find:

      DOC 4    <BOOK><DESCRIPTION>
                 <TITLE>Años De Perro</TITLE>
               </DESCRIPTION></BOOK> 
    

    Because here TITLE is not a direct child of BOOK. TITLE's direct parent is DESCRIPTION, whose parent is BOOK -- TITLE is a grand-child of BOOK. To find this doc, you can use the any-level slashes:

      perro INPATH(/BOOK//TITLE)
    

    Reading the path right-to-left, we are looking for a top-level BOOK tag with some descendant TITLE tag. This query will match both DOC 3 and DOC 4. Note that this is not the same as:

    
      ((perro WITHIN TITLE) WITHIN BOOK)
    

    First, the INPATH query restricts BOOK to the top-level. Second, equal sections are not confused. That is, the query:

      ((perro WITHIN BOOK) WITHIN TITLE)
    

    would match DOC 4, but the query:

      perro INPATH(/TITLE//BOOK)
    

    would not. Path sections know that TITLE is a child of BOOK, even though they occur at the same text offsets.

    Finally, if you wanted to match only DOC 4 and not DOC 3 -- that is, you want to match TITLE only if it is a grandchild of BOOK, and not a child or great grandchild, etc. -- you can use the single level wildcard:

      perro INPATH(/BOOK/*/TITLE)
    

    The * matches exactly one level, so this path query filters out DOC 3.

    You can combine these ancestor/descendant elements for even more complicated queries:

      felis INPATH(//kindgom/*/*/order/family//genus)
    

    Attributes

    You can search within an attribute value using the syntax <tag>/@<attribute>:

      perro INPATH(//MOVIE/@SPANISHTITLE)
    

    matches:

      DOC 5    <MOVIE SPANISHTITLE="Mi vida como un perro">My Life As A Dog</MOVIE>
    

    and is equivalent to the query:

      perro WITHIN MOVIE@SPANISHTITLE
    

    One limitation resulting from how attributes are indexed is that all attributes must specify their direct-parent tags. The following:

      perro INPATH(//@TITLE)
      perro INPATH(A/*/@TITLE)
    

    are not allowed, because the tag for the title attribute is not specified:

      select * from doc where contains(text, 'perro INPATH(//@TITLE)')>0;
                                             *
      ERROR at line 1:
      ORA-20000: Oracle Text error:
      DRG-50951: Unable to resolve element name for attribute TITLE
    

    Existence Test

    The square brackets are used to impose a condition on a node without changing the path context. For instance, the query:

      monopoly INPATH(/auction[image])
    

    is looking for monopoly inside a top-level auction tag which has an image tag as a direct child. The search for monopoly occurs within the entirety of <auction> and </auction>, and not just within <image> and </image>. This document will match:

      <auction>Sailing Monopoly <image src="...">pic</image></auction>
    

    but will not match:

      <auction>Sailing Monopoly</auction>    
    

    because there is no image element. The default context inside a test element is the tag to which it is applied, so

      monopoly INPATH(/auction[image])
    

    is actually the same as:

      monopoly INPATH(/auction[./image])
    

    You need the dot to reference the default context. Without the dot:

      monopoly INPATH(/auction[/image])
    

    it would mean top-level image tag. This is not supported, and will result in a syntax error.

    The existence test for image will match only if image exists and is a direct child. It does not match:

      <auction>Sailing Monopoly<desc><image src="...">pic</image></desc></auction>
    

    because here image is not a direct child of auction. You can match this document using the any-level wildcard, instead:

      monopoly INPATH(/auction[.//image])
    
    You can also test for attribute existence:

      monopoly INPATH(/auction[@reserve])
    

    The test node can be combined with other operators for interesting searches:

      monopoly INPATH(/auction[.//image]/title)
    

    The test node does not change context, so the /title applies to /auction rather than /auction//image -- this query finds auctions where monopoly occurs inside a direct-child title tag, but only if the auction has an image tag in it somewhere. For instance, the doc:

      <auction>
        <title>Sailing Monopoly</title>
    
        <description>
           New Sailing Monopoly with custom pewter tokens from USAOpoly
           <image src="...">here is a picture</image>
        </description>
      </auction>
    

    To test for non-existence, use the NOT operator:

      monopoly INPATH(/auction[not(taxinfo)])
    

    this query looks for monopoly within an auction element that does not have a direct child taxinfo. The NOT operator is case-insensitive in our path query language. In XPath it only works in lowercase.

    Attribute and Tag Value Equality

    The test operator is capable of more than simple existence testing. More useful is attribute value testing, which contrains nodes by the value of their attributes. For instance, given a document like:

      <MOVIE>
        <TITLE LANGUAGE="German">Tiger und Dragon</TITLE>
        <TITLE LANGUAGE="French">Tigre et Dragon</TITLE>
        <TITLE LANGUAGE="Spanish">Tigre y Dragón</TITLE>
        <TITLE LANGUAGE="Mandarin">Wo hu cang long</TITLE>
        <TITLE LANGUAGE="English">Crouching Tiger, Hidden Dragon</TITLE>
      </MOVIE>
    

    the query:

      dragon INPATH(//TITLE)
    

    will search all language titles. To limit the search to just English titles, you can add an attribute value equality test:

      dragon INPATH(//TITLE[@LANGUAGE = "English"])
    

    Only equality and inequality (using !=) are supported. Range searches are not supported in this version. The left-hand side must be an attribute or tag, while the right-hand side must be a literal. The query:

      gato INPATH(//A[@B = @C])
    

    is not allowed, nor is something like

      gato INPATH(//A["dog" = "cat"]
    

    Only string literals are allowed. Numeric literals, such as

      tora INPATH(//MOVIE[@PRICE = 5])
    

    will raise a syntax error. This means that numbers are not normalized. The query above will not match:

      <MOVIE PRICE="5.0">Tora! Tora! Tora!</MOVIE>
    

    because the string 5 is not equal to the string 5.0, although numerically they are equal.

    The equality test is not strict equality -- it uses "contains-equality". Two text fragments are contains-equal if the lexer produces identical index info. Some of the significant ways that this deviates from strict equality are:

    • contains equality is probably case insensitive
      Unless the lexer has MIXED_CASE on, it would consider foo and FOO to be equal strings.
    • lexer transformation can make strings equal
      The lexer can be set to normalize word forms. Base letter, for instance, converts accented characters to non-accented forms. A word with u-umlaut (Müller) is contains-equal to the same word with an unaccented u (Muller) in this case.
    • lexer ignores whitespace and non-alphabetic, non-join characters
      The lexer deals in word offsets. In a phrase like WORD1 WORD2, the word offset of WORD2 is always 1 greater than the word offset of WORD1 -- it doesn't matter how many spaces or newlines there are between them. Also, any non-alphabetic, non-join character is converted to whitespace (and subsequently ignored). This can confuse names, with Chase Matthew being contains-equal to Chase, Matthew, or phrases, with fruit-plants being contains-equal to fruit, plants.
    • stopwords in the query string are considered any word wildcard
      Any stopword appearing in the query text fragment will match any single word in the document's text fragment. This means that the query Paris in the spring would be contains-equal to the document Paris: avoid during spring.

    The rules for contains equality seem complex, but it works the same as regular text queries hitting document text -- you've probably internalized these rules already. One significant difference between equality and contains, though, is that the equality test always makes sure that the number of words in the attribute value is the same as the number of words in the query string.

      dragon INPATH(//TITLE[@LANGUAGE = "French"])
    

    does not match any of these fragments:

      <TITLE LANGUAGE="Canadian French">dragon</TITLE>
      <TITLE LANGUAGE="French Colloquial">dragon</TITLE>
      <TITLE LANGUAGE="Medieval French Colloquial">dragon</TITLE>
    

    Although each LANGUAGE attribute value has the word French, there are extra words. These would match a contains in the attribute value, but they do not meet the "same number of words" equality criteria.

    While docu-head people use a lot of attributes in their DTD's, data-heads prefer child tags. For instance, a docu-head might write:

      <MOVIE YEAR="2001" TITLE="Moulin Rogue">...
    

    While a data-head would prefer:

      <MOVIE>
    
        <YEAR>2001</YEAR>
        <TITLE>Moulin Rogue</TITLE>
        ...
    

    To match the data-head version, you can use equality testing on tag values:

      moulin INPATH(//MOVIE[YEAR = "2001"])
    

    Tag value equality uses contains-equality just like attribute value testing.

    Inequality is also supported in both attribute and tag value equality, using the != operator:

      moulin INPATH(//MOVIE[@YEAR != "2000"])
      moulin INPATH(//MOVIE[YEAR != "2000"])
    

    Note that inequality implies existence. The queries above do not match

      <MOVIE>Moulin Rouge</MOVIE>
    

    Because the MOVIE tag does not have a YEAR attribute or YEAR child element. To test for non-existence, use the NOT operator.

    AND and OR Operators

    You can use boolean AND and OR to combine existence or equality predicates in a test. Say you have documents like:

      <MOVIE>
        <TITLE>Big Trouble in Little China</TITLE>
    
        <ACTORS>
          <ACTOR>Kurt Russell</ACTOR>
          <ACTOR>Kim Cattrall</ACTOR>
        </ACTORS>
        <DVD>2 DISCS</DVD>
      </MOVIE>
    

    and you want to find movies with china in the title starring Kurt Russell and Kim Cattrall that are available on DVD:

      china INPATH(/MOVIE[DVD and 
                   .//ACTOR = "Kurt Russell" and .//ACTOR = "Kim Cattrall"]/TITLE)
    

    You can use parentheses for precedence:

      blue INPATH(/video[DVD and (discount or @rating = "4")]) 
    

    AND and OR are case-insensitive in our path query language. In XPath they must be lowercase.

    Nested INPATH

    Nested INPATH operators are allowed, but the two are independent -- the default context of an INPATH is always top level. For instance:

      (perro INPATH(A)) INPATH(B)
    

    will never hit any documents, because both INPATH's are looking for top-level tags, and, except for invalid documents, a document cannot have two different top-level tags.

    The HASPATH Operator

    The HASPATH operator is not a path query language operator; it's a ConText query language operator like INPATH. INPATH is used when you want to search for a text query within a path. HASPATH is used when all you want to do is test for path existence; it takes a path as its only argument, and returns 100 for a document if the path exists, 0 otherwise.

      select id from documents
       where contains(doc, 'HASPATH(/movie/dvd)')>0;
    

    will return all documents where the top-level tag is a movie element which has a dvd element as a direct child.

    HASPATH can also do tag value equality tests:

      HASPATH(//A = "dog")
    

    Attribute value equality tests and AND and OR operators are not currently supported. You can use the ConText query language AND and OR, with multiple HASPATHs to achieve the same effect. Instead of:

      HASPATH(A and B)
    

    write:

      HASPATH(A) and HASPATH(B)
    

    HASPATH can return false hits when there are empty sections. Path sections are recorded with level information, but not true parentage. As a result, a document like:

      <A>
       <B>
        <C></C>
       </B>
       <D>
        <E></E>
       </D>
      </A>
    

    is matched by the query:

      HASPATH(//B/E)
    

    Since we do not have real parent information, we cannot detect that E is not the child of B. The index tells us only that E and B surround the same word offsets, and that E is a third-level tag and B is a second-level tag. Normally this indicates that E is a child of B. In this boundary case it does not. This limitation only applies to empty sections like this -- any words in the document would ensure correct behavior.



    New XML Features

    Highlighting
    XMLType Indexing

    Highlighting

    Highlighting with the INPATH and HASPATH operators is not supported in this version. You can still highlight and markup regular words, and WITHIN queries, but use of the path operators will result in an error message. We are working on support for a future release.

    XMLType Indexing

    Oracle 9i introduces a new datatype for storing XML -- the XMLType. This is a core database feature, and you can find out about the type and its usage in the XML features manual.

    You can create a ConText index on this type, but you need a few database privileges first:

    1. the user creating the index must have query rewrite:

           grant query rewrite to <user>
    

    Without this privilege, the create index will fail with:

           ORA-01031: insufficient privileges
    

    <user> should be the user creating the index. The schema that owns the index (if different) does not need the grant.

    2. query_rewrite_enabled should be true, and query_rewrite_integrity should be trusted. You can add them to the init.ora:

     
           query_rewrite_enabled=true
           query_rewrite_integrity=trusted
    

    or turn it on for the session:

           alter session set query_rewrite_enabled=true;
           alter session set query_rewrite_integrity=trusted;
    

    Without these, queries will fail with:

           DRG-10599: column is not indexed
    

    These privileges are needed because under the covers a ConText index on an XMLType column is actually a function-based index on the getclobval() method of the type. These are the standard grants you need to use function-based indexes, as covered in the general Oracle documentation. However, unlike function-based b-tree indexes, you do not need to calculate statistics.

    When an XMLType column is detected, and no section group is specified in the parameters string, the default system examines the new system parameter DEFAULT_XML_SECTION, and uses the section group specified there. At install time this system parameter is set to CTXSYS.PATH_SECTION_GROUP, which is the default path sectioner. The default filter system parameter for XMLType is DEFAULT_FILTER_TEXT, which probably means that the INSO filter is not engaged by default.

    Other than the database privileges and the special default section group system parameter, indexes on XMLType columns work like any other ConText index.

    Here is a simple example:

      connect ctxsys/ctxsys
      grant query rewrite to xtest;
    
      connect xtest/xtest
      create table xtest(doc sys.xmltype);
      insert into xtest values (sys.xmltype.createxml('<A>simple</A>'));
    
    
      create index xtestx on xtest(doc)
        indextype is ctxsys.context;
    
      alter session set query_rewrite_enabled = true;
      alter session set query_rewrite_integrity = trusted;
    
      select a.doc.getclobval() from xtest a
        where contains(doc, 'simple INPATH(A)')>0;
    

E-mail this page
Printer View Printer View
Oracle Is The Information Company About Oracle | Oracle RSS Feeds | Careers | Contact Us | Site Maps | Legal Notices | Terms of Use | Privacy