Oracle Text 9.0.1 Technical Overview

Oracle Text 9.0.1
Technical Overview

Garrett Kaminaga, Product Development, Database and Application Server Technologies, Oracle HQ

This is a technical overview of the improvements found in Oracle Text (the product formerly known as interMedia Text) version 9.0.1. This is intended for an audience fairly familiar with previous versions of the product.

If you are new to the product, you should familiarize yourself with the product before reading this document. The official documentation, especially the new Application Developer's Guide, is a good place to start, as are the interMedia Text technical overviews for 8.1.5, 8.1.6, and 8.1.7, in order.

This document is only a summary of new features and internal changes. As always, please refer to the official documentation for detailed information.



Table of Contents



Overview of 9.0.1 Improvements

Three big new features stand out in the set of 9.0.1 improvements: local partitioning, XML path searching, and document classification. With local partitioning, large text indexes are easier to manage and faster to create, maintain, and query. XML path searching enables sophisticated queries which can reference and leverage the embedded structure of XML documents. The new CTXRULE indextype is the inverse of the ConText indextype -- instead of using a text query to find documents, you use a document to find queries.

The rest of the improvements are smaller changes, but were driven by user feedback, and make using the product a little easier, a little faster, or a little more robust. These include new, more sophisticated lexers for Japanese and Korean, an INSO filter timeout, per-row logging during indexing to find faulty documents, transportable tablespace support, and an on-demand filter.



Upgrading 8.1.7 to 9.0.1

Manual Upgrade From 8.1.7

Oracle Text is now integrated with the Database Migration Assistant (DBMA), so the Text upgrade should happen as part of the overall database upgrade procedure. This section details the manual upgrade process, in case something goes wrong or you choose to not use DBMA.

This process assumes that you are starting with a valid 8.1.7 ctxsys data dictionary, and that the system is not being used for index, DML, or query. Use SQL*Plus to run all scripts.

1. run ?/ctx/admin/s0900010.sql as SYS
this grants new, needed database privileges to ctxsys

2. run ?/ctx/admin/u0900010.sql as CTXSYS
this upgrades the ctxsys schema to 9.0.1

It is not necessary to run anything else. The u0900010.sql script calls other scripts to upgrade the indextypes (t0900010.sql), recompile all packages (dr0pkh.sql and dr0plb.sql), and recreate the shared library (dr0ulib.sql).

The script is not expected to encounter errors other than ORA-955 ("name is already used by an existing object", for public synonyms). After the script is complete, all ctx-created objects owned by ctxsys should be VALID.

For iMT veterans who may be wondering, the path is not a typo. The upgrade scripts, which used to be in their own sub-directory, are now found in the ?/ctx/admin directory.

SQL*Net Listener Setup

interMedia Text 8.1.7 shifted all code to trusted callouts, eliminating the need for SQL*Net listener configuration (and there was great rejoicing).

Sorry, extproc is back again. But just for one new document service function -- ctx_doc.ifilter, the on-demand INSO filtering call (this is covered below). If you don't use this function, you don't need to set up the listener.

A brief description of setting up the listener is below; complete details are in the Oracle 9i Server Administrator's Guide.

1. Add an entry to the tnsnames.ora:

       extproc_connection_data = 
         (DESCRIPTION = 
            (ADDRESS = (PROTOCOL = ipc)
                       (KEY = DBSID))
            (CONNECT_DATA = (SID = ep_agt1)))

DBSID is the database SID. ep_agt1 can be named anything. extproc_connection_data should not be changed.

2. Add the following to the listener SID_LIST:

       (SID_DESC = (SID_NAME = ep_agt1)
                   (ORACLE_HOME = /oracle)
                   (ENVS = LD_LIBRARY_PATH=/oracle/ctx/lib)
                   (PROGRAM = extproc))

ep_agt1 matches the CONNECT_DATA SID for extproc_connection_data in the tnsnames.ora. The PROGRAM section tells the Net8 listener to start the external procedure process. The ENVS section, which is shown here for UNIX, will ensure that the environment includes ?/ctx/lib in LD_LIBRARY_PATH. This is needed so that indexing can use the INSO filters.

3. Since the extproc_connection_data ADDRESS section specifies ipc, make sure that the ADDRESS_LIST of listener.ora accepts ipc connections.

A quick way to test the Net8 configuration is to do:

    exec ctx_adm.test_extproc;

from SQL*Plus, as ctxsys. If you get the error:

    DRG-50704: Net8 listener is not running or cannot start external procedures

then things aren't set up correctly. OTN and Oracle Support have helpful debugging guides for extproc setup, compiled problem-by-problem, and paid for in past headaches.

Upgrading Existing Indexes

Nothing needs to be done to upgrade existing text indexes. These should still work without any problems in 9.0.1. However, we made ctxcat DML a bit more efficient, so you might choose to re-constitute the triggers for ctxcat indexes. All you have to do is rename the index:

  alter index <indexname> rename to <newname>

You can, of course, rename it back to the original name if you choose.



Local Partitioned Indexes

Introduction
Creating Local Partitioned Indexes

Local Partitioned Indexes

A local partitioned index is a partitioned index on a partitioned table with a one-to-one mapping of index partitions to table partitions. It's like splitting up a large table and index into a set of smaller tables and indexes.

Local partitioned indexes have been around since partitioning support was first added to the database. However, this support was limited to b-tree and bitmap indexes -- domain indexes (including ConText indexes) could only be global indexes. Kernel's extensible indexing framework has been enhanced in 9i, allowing local partitioned text indexes for the first time.

Why Local Partitioned Indexes?

We won't discuss why table partitioning is good for large data sets -- that's covered pretty well in the general database Concepts manual. Instead, we'll list some reasons why a local partitioned text index is better than a global index:

  • Easier Table Maintenance
    With a global index, partition-level DDL invalidates the entire index. You could truncate one partition, and spend days rebuilding the index, even though the data in the other partitions hasn't changed. A local partitioned index truncates the corresponding index partition and the rest of the index doesn't even notice. This makes managing large sets of data much easier. This is discussed further below in the section "Partitioned Table Maintenance"
  • Parallel Sync
    Oracle Text allows only one session to sync a given index at a time. Users with large datasets have found this too limiting -- the system can't keep up with huge amounts of incoming data. With local partitioning, the index partitions are independent, and can be sync'ed in parallel. We discuss this further in the section "Local Partitioned Index Maintenance".
  • Query Performance
    A local partitioned ConText index can take advantage of partition pruning and partition iteration for faster mixed query and sorted query. In contrast, a global index wastes time producing rowids in other partitions which are simply filtered out by the query plan. Query is examined in the section "Query Considerations".

Creating a Local Partitioned Index

Let's use the following partitioned table as an example:

  create table ptest (id number, text varchar2(80))
    partition by range (id)
    (partition p1 values less than (100),
     partition p2 values less than (200),
     partition p3 values less than (300));

To create a local partitioned index, simply specify the LOCAL keyword:

  create index ptestx on ptest(text)
    indextype is ctxsys.context
    local;

The local keyword must be after the indextype clause. You can stick a parameters clause before or after the local keyword:

  create index ptestx on ptest(text)
    indextype is ctxsys.context
    parameters ('filter ctxsys.inso_filter')
    local

  create index ptestx on ptest(text)
    indextype is ctxsys.context
    local
    parameters ('filter ctxsys.inso_filter')

are both OK. By default, the index partitions will have the same name as the corresponding table partitions, so in our examples above the index partitions are named p1, p2, and p3. You can override this by specifying individual partition clauses:

  create index ptestx on ptest(text)
    indextype is ctxsys.context
    local
    (partition larry, partition moe, partition curly)
    parameters ('filter ctxsys.inso_filter')

The partition clauses must come immediately after the local keyword. The partition clauses correspond to the table partitions in order. For instance, our first table partition is p1, so, in our example above, its index partition will be named larry, because that's the first partition clause in the set. Because it's done solely by order, this means that you can't just name a few partitions here -- you must specify a partition clause for each table clause. If your table has 100 partitions, you must put 100 partition clauses here.

Each partition clause can optionally have a parameters clause:

  create index ptestx on ptest(text)
    indextype is ctxsys.context
    local
    (partition larry, 
     partition moe parameters ('memory 5m'),
     partition curly)
    parameters ('filter ctxsys.inso_filter')

These partition-level parameters clauses can be used to specify index memory and storage attributes on a per-partition basis. We'll discuss the difference between the index-level and partition-level parameters clauses below.

Limitations

There are a few limitations to be aware of when creating local partitioned ConText indexes:

  • Base table must be RANGE partitioned
    Other partitioning methods, such as HASH, COMPOSITE, or LIST, are not supported for local indexes. You can still create a global index on tables partitioned by these methods, though.
  • ConText indextype only
    In this version, only the ConText indextype supports local partitioning. You will just get an error message if you try to create a local CTXCAT or CTXRULE index.
  • Index name must be 21 bytes or less
    Due to the table naming scheme used for local partitioned indexes (more on this below), the names of local partitioned indexes cannot exceed 21 bytes in length.
  • Maximum number of partitions is 9999
    Again due to the table naming scheme, the index (and therefore the table) cannot have more than 9999 partitions.

Local Partitioned Indexing Process

Now let's take a closer look at what happens under the covers when a local partitioned index is created:

  1. Create the index global meta-data
  2. For each partition in the table, do the following:
    1. create the index partition meta-data
    2. create the index partition internal tables
    3. populate the index data
    4. create the $X b-tree

First, the index global meta-data is created. Like a regular ConText index, it parses the index-level parameters clause, determines the preferences to use for all classes, then deep-copies the preference values to index meta-data.

In this version, the index meta-data is shared by all index partitions. This means that you cannot, for instance, specify a Japanese lexer for partition 1 and a Basic lexer for partition 2. The documents in the table are assumed to be partitioned by contiguous numeric qualities like date or id, rather than by document type. The existing heterogeneous data support (charset, format, and language column) can be used to work around this limitation.

Since all the index partitions share common meta-data, the partition parameters clause does not allow anything except MEMORY and STORAGE -- use the index-level parameters clause to specify everything else.

Once the index meta-data is created, we iterate over the table partitions. (Parallel index creation is covered below) For each table partition, an index partition is created. Creating an index partition is not that different from creating a regular ConText index. There are four stages: meta-data, internal table creation, population, and b-tree index.

The partition meta-data is usually just a single row in dr$index_partition, visible through the view CTX_USER_INDEX_PARTITIONS, which has the columns:

IXP_ID
This is a numeric ID for index partition. It is Text-assigned, so does not correspond to the index partition object id. The value ranges from 1 to 9999, and is unique within an index. The values are designed to be contiguous, so as index partitions are dropped and added, values will get re-used. Because most partition operations take in an index partition name, you generally won't care about the ixp_id with one exception -- the ixp_id makes up part of the internal tablenames.

IXP_INDEX_OWNER
IXP_INDEX_NAME
The owner and name of the index to which this index partition belongs.

IXP_INDEX_PARTITION_NAME
The index partition name. This is the same as the partition name in user_ind_partitions. Use this name to identify the index partition in all index partition operations.

IXP_TABLE_OWNER
IXP_TABLE_NAME
IXP_TABLE_PARTITION_NAME
The base table partition for this index.

IXP_DOCID_COUNT
IXP_STATUS
Like idx_docid_count and idx_status for normal text indexes, only now these apply on a partition level, so the values for these columns in CTX_USER_INDEXES can be ignored.

The second step in creating the index partition is internal table creation. Each index partition gets its own set of DR$ tables. The naming scheme is somewhat different -- DR#<indexname><ixp_id><suffix>. For instance, index partition p1 from our example above will have a $I table named DR#PTESTX0001$I. The # instead of $ avoids collisions with global ConText index tablenames. The addition of ixp_id is the reason why local partitioned index names are limited to 21 bytes instead of the 25 bytes for global indexes, and the four- digit decimal representation of the index partition id imposes the limit of 9999 index partitions.

The tablespace and storage for these internal tables are determined mainly by the storage preference specified in the index-level parameters clause. However, you can override this by using the partition parameters clause. Modifying our example above:

  create index ptestx on ptest(text)
    indextype is ctxsys.context
    local
    (partition larry, 
     partition moe parameters ('storage p2_storage'),
     partition curly)
    parameters ('storage global_storage')

The internal tables for larry and curly will be created using the global_storage preference, while moe will be created using p2_storage. Specifying a per-partition storage preference uses a little more index meta- data (stored in table dr$index_values).

The third step in creating the index partition is population. This is the real work -- reading the documents, inverting them, and dumping the index data to the $I table. This is the same as a global index population, except that the datastore will use a partition-extended tablename; we are indexing only a particular partition of the base table here.

Index memory size determination works like storage preference determination. First, we check if MEMORY was specified in the partition-level parameters clause. If not, we then check MEMORY in the index-level parameters clause. If not there, then we take system parameter DEFAULT_INDEX_MEMORY.

The fourth and final stage of index partition creation is the creation of the $X index on the $I table. The tablespace and storage for this index comes from the same storage preference used for internal table creation.

After these four stages are complete, it goes to the next table partition and starts all over. When all table partitions have been indexed, the index creation completes, and you can do contains queries.

Parallel Local Partitioned Index Creation

Normally the index partitions are created serially -- one at a time. More powerful systems may be able to create index faster by creating the partitions in parallel (inter-partition parallelism). This is not as simple as adding a PARALLEL clause to the create index statement. Principally this is because we currently ignore it for local partitioned text indexes. However, even if it were supported, it wouldn't create the partitions in parallel. With local b-tree indexes, specifying a PARALLEL degree still creates the index partitions serially, though each partition is parallelized (intra-partition parallelism).

The strategy for creating a local partitioned index with inter-partition parallelism is:

  1. create an empty index
  2. rebuild the index partitions in parallel

Normally, step 1 is accomplished using the UNUSABLE keyword on create index. Unfortunately, this is not supported for domain indexes -- you'll just get syntax errors. Instead, you can use the Text way of creating an empty index -- the NOPOPULATE keyword:

  create index ptestx on ptest(text)
    indextype is ctxsys.context
    local
    parameters ('NOPOPULATE');

This will create the index meta-data, then create partition meta-data, tables, and $X index for each partition. The partitions are processed serially, but since we're not doing index population, this should take only a few seconds per partition.

For step 2, rebuilding the index partitions in parallel, the database has a nifty utility -- DBMS_PCLXUTIL -- which uses DBMS_JOB to kick off multiple sessions which rebuild the index partitions. Unfortunately, this does not work with domain indexes, because the jobs that it submits include the PARALLEL clause, which is not supported for domain indexes, and raises a syntax error.

Instead, you have to manually submit the rebuild jobs. The easiest way to do this is to simply open multiple SQL*Plus sessions and, in each one, run scripts that have statements like:

  alter index ptestx rebuild partition p1;

Issuing alter index rebuild partition like this will drop the internal tables, re-create them, do the indexing, then re-create the $X index. So, there's a little extra work involved because we've created the tables already during create index, but it's not that wasteful. You can specify an index memory size like this:

  alter index ptestx rebuild partition p1 parameters ('replace memory 50M');

We are working with extensible framework development to simplify parallel local domain index creation for 9i Release 2.



Local Partitioned Index

Query Considerations

Query Syntax

There is no difference between local partitioned text index queries and global text index queries -- the syntax is the same. However, local partitioning can change the performance characteristics of certain types of queries. We'll examine this in this section using the following test table:

  create table ptest(a number not null, b varchar2(128))
  partition by range(a)
  (
    partition p1  values less than (10000),
    partition p2  values less than (20000),
    partition p3  values less than (30000),
    partition p4  values less than (40000),
    partition p5  values less than (50000),
    partition p6  values less than (60000),
    partition p7  values less than (70000),
    partition p8  values less than (80000),
    partition p9  values less than (90000),
    partition p10 values less than (100000)
  );

populated with synthetic data. There are 100,000 rows. Every row has the token EVRY. TM07 occurs every tenth row, between 1 and 7 times within the row (to give a score range). HSPC occurs every 100 rows, and KSPC occurs every 1000 rows.

We will be comparing two runs -- one with a global ConText index on this table, and one with a local partitioned index. The timing numbers given are for comparison against each other only, and do not represent tuned, publishable benchmark numbers.

SCORE

The contains score is calculated from several statistics, including number of documents containing the query token, and number of documents in the corpus. In a local partitioned index, each partition is independent, so these numbers are partition-scope. That is, it uses the number of documents in the partition that contain the token, and the number of documents in the table partition. As a result, the SCORE values for any given row that you get from a query on a local partitioned index may be different than if the index were a global index. However, we do not believe that the partition-scope of the score will have noticeable adverse effects for most applications.

Simple Text Queries

Each partition of a local partitioned ConText index is an almost-independent index. If many index partitions have to be scanned to fulfill a query, local index query time can be much slower than global index query time on the same data, because essentially you are scanning many indexes.

A plain text query like this:

  select * from ptest where contains(b, 'HSPC')>0;

with a global index shows a single domain index scan:

  1.1 DOMAIN INDEX  PTESTX

The local partitioned index, on the other hand, includes a partition iterator:

  1.1 PARTITION RANGE ALL
    2.1 DOMAIN INDEX  PTESTX

This shows that the query plan is looping over the index partitions. For each partition, it is doing a separate domain index scan.

Each domain index scan incurs some fixed amount of overhead, plus a query time which is proportional to the amount of indexed data and number of hits. The partitions of the local index have much less data than the global index, so that portion of the query time is lessened. However, the overhead is much more because the same fixed overhead is incurred once per partition. This slowdown increases as the number of partitions increases.

The following table illustrates this effect. We ran successive tests with the same data at different levels of partitioning -- for instance, the first test partitioned the 100,000 rows into 10 partitions of 10,000 rows, the second test partitioned the 100,000 rows into 50 partitions of 2,000 rows, etc. up to 1000 partitions. We created a local ConText index each time, and ran the query above, for HSPC. HSPC occurs every 100th row, so there's 1000 hits overall, and a minimum of 1 hit per partition. Times given below are in minutes:seconds and are to select all 1000 rows

# of partitions in ptest
10 50 100 250 500 1000
global index 1.14 1.11 1.14 1.11 1.12 1.18
local index 1.34 11.37 21.23 50.42 2:25.48 5:01.82

As the number of partitions increases, the overhead for scanning all those index partitions swamps the query time -- at 1000 partitions, the local index is about 250 times as slow.

Local partitioning may not be looking too good so far. Keep in mind, though, that this is the worst case scenario -- where local partitioning is weakest. After all, it makes no sense to split up your index unless you're going to leverage that split -- and this is what we turn to next.

Also, these times are for selecting all 1000 hits. The runtimes would be faster if you were interested in only the first 10 hits, for instance. In the 50 partition case, for instance, the query can get 10 hits from the first partition alone, and would stop without scanning the other 49 partitions. This would yield times very close to global index.

Partition Pruning

We've seen that local partitioned indexes have the disadvantage when all the partitions have to be scanned; the flip side of this is that they have a distinct advantage when you only have to scan a handful of partitions.

The query class that most obviously benefits from local partitioning is mixed text and structured, with structured condition on the partition key. Finding news articles from this week that contain some phrase, for instance. With global indexes, mixed queries work by getting all the text hits, then, for each hit, evaluating the structured condition. If most of the text hits don't match the structured condition, ConText does a lot of wasted work.

With local indexes, performance can be improved using partition pruning. Partition pruning acts as a first-level filter, avoiding query on those partitions that will never meet the structured criteria -- this results in a higher text hit to structured hit ratio, and less wasted work in ConText. For instance:

  select /*+ FIRST_ROWS(50) */ 12345 
    from ptest
   where contains(b, 'EVRY') > 0
     and a between 90000 and 100000
     and rownum < 50;

With a global ConText index, the query plan is:

  1.1 COUNT STOPKEY
    2.1 TABLE ACCESS BY GLOBAL INDEX ROWID PTEST
      3.1 DOMAIN INDEX  PTESTX

The ConText index produces a list of rowids in the table which contain the token EVRY. Since every row contains this token, this results in 100,000 rowids, in docid order. For each of those rowids, the kernel does a table access by rowid to get the value of column A. In this testcase, the rowids resulted from the ConText index happen to come back in ascending A order. As a result, the ConText index wastes time producing 90,000 rowids that are just thrown away by the structured condition filter. (This is only for a domain index scan. This is not true of a functional invocation, though that has its own problems, discussed in "Sort by Partition Key" below)

With a local partitioned ConText index, the query plan does not look much different from the global index:

  1.1 COUNT STOPKEY
    2.1 TABLE ACCESS BY LOCAL INDEX ROWID PTEST
      3.1 DOMAIN INDEX  PTESTX

And, in fact, it works pretty much the same -- ask the ConText index for a list of rowids containing EVRY, then for each rowid, do a table access to get the value of A, and ensure that the value is between 90,000 and 100,000.

The difference in this case is that partition pruning has occurred. Since we have a local partitioned index, instead of one big index, we have 10 small indexes. The optimizer is smart enough to know that 9 of them will never get any hits (because of the structured condition on A, the partition key) so it only asks p10. All of the rowids produced by p10 fulfill the structured condition, so none are thrown away.

Time for global index: 4.07. Time for local partitioned index: 0.09

The example above prunes to a single index partition, but this technique can still be useful with multiple partitions. The query plan simply iterates over the selected index partitions. For instance:

  select /*+ FIRST_ROWS(50) */ 12345 
    from ptest
   where contains(b, 'EVRYS') > 0
     and a between 80000 and 100000
     and rownum < 50;

Involves two partitions, using the query plan:

1.1 COUNT STOPKEY
  2.1 PARTITION RANGE ITERATOR 
    3.1 TABLE ACCESS BY LOCAL INDEX ROWID PTEST
      4.1 DOMAIN INDEX  PTESTX

EVERYS is a special token just for this test, which occurs in every row from 1-79999, and then every 400th row after that. This lopsided distribution is to force the partition query to hit two partitions.

Time for global index: 3.59. Time for local partitioned index: 0.20

We are scanning multiple partitions, so there's increased overhead, as the section "Simple Text Queries" above illustrated. The query takes about twice as long as the pruned-to-single-partition query above (which took 0.09). However, scanning two index partitions is still a lot better than producing all the text hits and filtering out the ones that don't match the range.

In order for true partition pruning to happen, the query must be parsed, and the structured condition must be literals. If the structured condition uses bind variables, the optimizer will be forced to produce a plan that iterates over all index partitions. However, our informal experiments showed that at execution time there was no performance difference between the pruned query plan and the iterator query plan. Evidently, the SQL execution engine's partition iterator recognizes that certain partitions will not yield hits, and skips over them without opening an index scan. So, using bind variables should not be cause for alarm. If you want to be sure, you can use literal values, or use partition-extended tablenames in your select.

Score Sorting

Score-sorted queries are generally slower on local partitioned indexes than on global indexes. The need to scan multiple partitions vs. a single index makes it slower just to produce the results, as we saw above in "Simple Text Queries", but the major factor is the sorting difference.

With a global index, ConText can "push down" the sort condition and return the rows in sorted order:

  select /*+ FIRST_ROWS(50) */ 12345 
    from ptest
   where contains(b, 'TM07', 1)>0
     and rownum < 50
   order by score(1) desc;

  1.1 COUNT STOPKEY
    2.1 TABLE ACCESS BY GLOBAL INDEX ROWID PTEST
      3.1 DOMAIN INDEX  PTESTX

The contains produces the 10,000 rows containing TM07, sorts them internally, then returns them in order.

A local partitioned index, on the other hand, is split up, so we cannot produce a list of all the hits and sort internally. Instead, the local index scans every partition individually, then does a regular SORT ORDER BY on the amassed results:

  select * from (
    select /*+ FIRST_ROWS(50) */ 12345 from ptest
     where contains(b, 'TM07', 1)>0
     order by score(1) desc
  ) where rownum < 50

  1.1 COUNT STOPKEY
    2.1 VIEW
      3.1 SORT ORDER BY STOPKEY
        4.1 PARTITION RANGE ALL
          5.1 TABLE ACCESS BY LOCAL INDEX ROWID PTEST
            6.1 DOMAIN INDEX  PTESTX

The SORT ORDER BY makes the query much slower. The global index query runs in 1.19 sec, while the local index takes 7.87 seconds. Also, the query syntax has to be changed to incorporate a view, because of the different semantics introduced by using a sort instead of internal ordering.

Because of this effect, if your application does a lot of text-only, score- sorted queries, local indexes are not a good choice. If you have score-sorted mixed query that can do partition pruning, however, local indexes can offer performance advantages even with the sort order by. For instance, taking the same TM07 text query, but adding a two-partition condition:

  select /*+ FIRST_ROWS(50) */ 12345 
    from ptest
   where contains(b, 'TM07', 1)>0
     and a between 80000 and 100000
     and rownum < 50
   order by score(1) desc;

The global index does a lot of wasted work producing and sorting rows that get filtered out by the text condition. The global index here takes 3.61 sec, while the local index finishes in only 1.31 sec, thanks to partition pruning.

Even better, if your structured condition prunes down to a single partition, then the score sorting WILL be pushed down:

  select /*+ FIRST_ROWS(50) */ 12345 
    from ptest
   where contains(b, 'TM07', 1)>0
     and a between 90000 and 100000
     and rownum < 50
   order by score(1) desc;

  1.1 COUNT STOPKEY
    2.1 TABLE ACCESS BY LOCAL INDEX ROWID PTEST
      3.1 DOMAIN INDEX PTESTX

Note the absence of the SORT ORDER BY. For this query you get the benefits of both partition elimination and ConText score sorting. The time for the global index is 3.51 seconds, while the local index takes only 0.56 seconds. Again, the order by score pushdown occurs only when you prune to a single partition.

Sort by Partition Key

Another situation where local partitioning can improve performance is sort by partition key. Selecting news articles that contain some phrase, with most recent articles first, for instance. Or, in our example:

  select /*+ FIRST_ROWS(50) INDEX(ptest ptestx) */ 12345 from ptest
   where contains(b, 'TM07') > 0
   order by a

We've chosen TM07 here as a medium-selectivity text query, which is the most problematic for this query class. A very selective text query produces few hits which can be sorted very quickly, while a very unselective text query produces so many hits that it's faster to do an ordered b-tree scan, invoking contains as a function for each row -- you'll only have to do a few to fill the 50-row buffer. Medium-selectivity text queries produce too many rows to sort quickly, but too few rows to drive off the b-tree. (the example sql includes an INDEX hint to force driving from the domain index scan)

With a global index, the ConText index produces all the hits, then does a SORT ORDER BY to sort them by the structured column. In this case, we have to produce and sort 10,000 rows:

  1.1 SORT ORDER BY
    2.1 TABLE ACCESS BY GLOBAL INDEX ROWID PTEST
      3.1 DOMAIN INDEX  PTESTX

This takes 2.34 sec for the first 50 rows.

With a local partitioned index, it is more efficient to do an ordered partition iteration. The base table is range-partitioned, so the value of A for any row in partition p1 is guaranteed to be less than the value of A for any row in partition p2. Therefore, any hits from partition p1 are guaranteed to be first in the result set when you order by A, and so it is correct to produce, sort, and result the hits from partition p1 before touching any other partition. This is an ordered partition iteration, and the query plan looks like this:

  1.1 PARTITION RANGE ALL
    2.1 SORT ORDER BY
      3.1 TABLE ACCESS BY LOCAL INDEX ROWID PTEST
        4.1 DOMAIN INDEX  PTESTX

The SORT ORDER BY is under the partition iterator. This indicates that it is sorting on a partition level, rather than a global level. This is actually more efficient, because it is faster to sort smaller batches (sort is O(n lg n) if you can recall those dimly-remembered computer science lectures), but the primary savings in this case are because we can saturate our 50-hit buffer with the results from the first partition alone. This query takes only 0.20 sec for the first 50 hits, something like 1200% faster than global's 2.34 sec.

On the other hand, with very selective queries, you may have to hit multiple partitions, and we have already seen how this can put the local index at a disadvantage. Changing the query to KSPC (1/1000 selectivity) forces the partition iterator to hit 5 of 10 partitions, and its performance -- 0.34 sec is more than the global index's 0.30 sec. However, the ordered iteration may still be a good all-around plan -- it's much faster on the queries that are now slow, and a little slower on the queries that are already fast, so you should see an overall improvement as long as the number of partitions is not large.

One note of caution: do not use an inline view to do the top-n hit buffering. Something like this:

  select * from (
    select /*+ FIRST_ROWS(50) INDEX(ptest ptestx) */ 12345 from ptest
     where contains(b, 'TM07') > 0
      order by a
  ) where rownum < 50
Produces a very different query plan:

  1.1 COUNT STOPKEY
    2.1 VIEW
      3.1 SORT ORDER BY STOPKEY
        4.1 PARTITION RANGE ALL
          5.1 TABLE ACCESS BY LOCAL INDEX ROWID PTEST
            6.1 DOMAIN INDEX  PTESTX

Note that the SORT ORDER BY is above the partition iterator. This means a global sort, and the timing for this -- 2.63 sec -- is slower than the global index. This seems to be a quirk in query plan parsing and optimization, and is currently being investigated (see bug 1800113).

Parallel Query

While local b-tree index queries can be run in parallel, parallel contains queries are not supported in this version. To illustrate, let's modify our test table with a numeric column C and its own local b-tree index:

  alter table ptest add (c number);
  begin for i in 1..100000 loop
    update ptest set c = mod(i, 113) where a = i;
  end loop; end;
  create index ptestc on ptest(c) local;

We can use the PARALLEL_INDEX hint to force parallel execution of this query:

  select /*+ INDEX(ptest ptestc) PARALLEL_INDEX(ptest, ptestc, 4) */ 12345
    from ptest
   where c = 111
     and a between 20000 and 100000;

  1.1 PARTITION RANGE ITERATOR                   PARALLEL_COMBINED_WITH_PARENT
    2.1 TABLE ACCESS BY LOCAL INDEX ROWID PTEST  PARALLEL_TO_SERIAL
      3.1 INDEX RANGE SCAN PTESTC NON-UNIQUE     PARALLEL_COMBINED_WITH_PARENT

Unfortunately, parallel query is not supported by the extensible indexing framework in this version:

  select /*+ PARALLEL_INDEX(ptest, ptestx, 4) */ 12345
    from ptest
   where contains(b, 'KSPC')>0
     and a between 20000 and 100000;

  1.1 PARTITION RANGE ITERATOR
    2.1 TABLE ACCESS BY LOCAL INDEX ROWID PTEST
      3.1 DOMAIN INDEX  PTESTX

Note the lack of any parallel comments in the plan. This capability might be added by the kernel's extensible indexing framework development in some future version.

Query Considerations Summary

These tests show that local partitioned indexes are much faster than global indexes as long as you are leveraging the partitioned nature of the base table, either through significant partition pruning or ordered partition iteration. Plain text queries are much slower because of the overhead incurred by scanning multiple partitions -- especially if the number of partitions is large.

You might therefore conclude that, if your application were doing plain text queries or mixed queries that do not involve the partition key, that global indexes would be much better. However, do not overlook the maintenance advantage of local partitioned indexes (discussed in the following sections). Overall application performance (including admin and indexing time) might still be better with local indexes, especially if the number of partitions is small.



Local Partitioned Index

Table Partition Operations
Partitioned Table Maintenance
Index Maintenance
Query Services
Document Services

Table Partition Operations

A major benefit of using partitioned tables is easier maintenance. To quote the concepts manual, "partitions enable data management operations like data loads, index creation, and data purges at the partition level, rather than on the entire table, resulting in significantly reduced times for these operations".

But what about the indexes? When you do partition-level DDL, a global ConText index built on the table is a monolithic entity, and knows nothing about table partitions. The only option is for the whole index to become unusable, so you have to rescan the entire table and rebuild the index from scratch -- even though most of the data has not changed.

A local partitioned index, on the other hand, can take more limited and appropriate action. If you truncate a table partition, for instance, the index simply truncates the corresponding index partition's internal tables automatically -- no extra rebuild time required. At worst, just the affected index partition is marked unusable. That partition must be rebuilt (which takes less time than a full rebuild because it has just a portion of the data) but the other index partitions are still available and can be queried. Rebuilding index partitions is covered below in the section "Rebuilding UNUSABLE Index Partitions".

We'll now examine the effects of the table partition operations individually.

RENAME PARTITION

  alter table ptest rename partition p5 to pnew;

This renames a table partition. It has no effect on the index. In fact, the index is not even notified when this happens. The index partition name does not change to match the table partition name. To rename an index partition, use alter index:

  alter index ptestx rename partition p5 to pnew; 

TRUNCATE PARTITION

  alter table ptest truncate partition p5;

This removes all data in a table partition. The internal tables of the affected index partition are truncated. The index partition remains valid and usable.

DROP PARTITION

  alter table ptest drop partition p5;

This removes the data and definition of a table partition. The corresponding index partition's meta-data and internal tables are dropped. The remaining index partitions are not affected. The index partition id's ( ixp_id in view ctx_user_index_partitions), which are normally contiguous, will have a hole. This doesn't have much effect, and the id will get reused if partitions are added (through ADD or SPLIT).

ADD PARTITION

  alter table ptest add partition p11 values less than (110000);

This creates a new table partition. This new table partition will be empty, because with RANGE partitioning (the only method currently supported) you can add partitions only at the end of the range, where there will be no data. The ConText index automatically creates a new index partition, creating the meta-data and internal tables. The new index partition will be valid and can be queried, although there is no data, because the underlying table partition is empty.

The name of the new index partition will be the same as the new table partition name. If that name is already used by an existing index partition, a system-generated name is assigned instead.

You cannot control the storage characteristics of the new index partition. It will use the default storage values for the index. You can override this by rebuilding the partition.

MOVE PARTITION

  alter table ptest move partition p5 tablespace testing;

This reconstructs the table partition's data. For example, the command above moves p5 to a new tablespace. Because ROWIDs and object ids might change because of this operation, the corresponding index partition will be marked as UNUSABLE. The internal tables will still exist and contain the same data, but the index partition cannot be queried until it is rebuilt. The index has to be rebuilt from scratch even though the data has not changed. This is because there is no way to determine the "old" rowid of each row, so it is not possible to simply rebuild the $R and $K tables.

SPLIT PARTITION

  alter table ptest split partition p5 at (45000)
    into (partition p5_1, partition p5_2);

This takes an existing partition and splits it into two adjacent partitions. The ConText index first drops the existing partition's meta-data and internal tables, then creates two new index partitions. The internal tables for these new index partitions will be created, but they will be empty.

After the operation, the space taken up by the old index partition is free, and the two new index partitions are empty and UNUSABLE -- they cannot be queried until they are rebuilt. (see Rebuilding UNUSABLE Index Partitions, below)

The two new partitions will have the same name as their table partitions. If an existing index partition already has that name, a system-generated name is assigned instead.

You cannot control the storage characteristics of the two new index partitions. They will use the default storage values for the index. You can override this when rebuilding the partitions.

MERGE PARTITION

  alter table ptest merge partitions p2, p3
    into partition p2and3;
This takes two neighbor partitions and merges them into a single partition. It's the opposite of SPLIT. The local ConText index drops the meta-data and internal tables for the two old index partitions, then creates a new index partition, with empty internal tables. The new index partition is UNUSABLE and cannot be queried until it is rebuilt. (see Rebuilding UNUSABLE Index Partitions, below)

The new index partition will have the same name as the table partition. If an existing index partition already has that name, a system-generated name is assigned instead.

You cannot control the storage characteristics of the new index partition. It will use the default storage values for the index. You can override this when rebuilding the partition.

EXCHANGE PARTITION

  alter table ptest exchange partition p7 with table p7_ext;

This is the most involved of the partition operations. It takes a specified table partition and a non-partitioned table and swaps their data. If you do not specify INCLUDING INDEXES, as above, or you explicitly exclude indexes:

  alter table ptest exchange partition p7 with table p7_ext
    excluding indexes;

then the corresponding index partition is marked as UNUSABLE and it must be rebuilt. If a ConText index exists on the non-partitioned table, then that is also marked as UNUSABLE and must be rebuilt.

If you include indexes:

  alter table ptest exchange partition p7 with table p7_ext
    including indexes;

Then the index partition will be automatically rebuilt. If there's a ConText index on the non-partitioned table, that will also be rebuilt. Since you will be re-creating two ConText indexes, this may take some time.

The local index and the index on the non-partitioned table are rebuilt with their existing preferences. It will not synchronize or reconcile divergent preferences, so you might find queries on the data return different results after the move.

Rebuilding UNUSABLE Index Partitions

Certain table partition operations cause associated index partitions to become UNUSABLE. When an index partition is UNUSABLE, it cannot be queried. Queries that involve that partition -- including global queries where no partition pruning occurs -- will fail. For example:

  alter table ptest move partition p5 tablespace testing;
  select * from ptest where contains(b, 'KSPC')>0;
  --> gets ORA-29954: domain index partition is marked LOADING/FAILED/UNUSABLE

Partition-pruned queries will still work, though:

  select * from ptest partition(p4) where contains(b, 'KSPC')>0;
  --> OK

You can find which partitions are UNUSABLE by querying USER_IND_PARTITIONS:

  select index_name, partition_name
    from user_ind_partitions
   where status != 'USABLE'

Once the UNUSABLE partitions have been identified, you must rebuild them using alter index:

  alter index ptestx rebuild partition p5;

Note that multiple partitions of a given index can be queried or even rebuilt at the same time. Alternatively, you can use:

  alter table ptest modify partition p5 rebuild unusable local indexes;

which will go out itself and find all UNUSABLE partitions in all local indexes for partition p5, and rebuild them.

Partition rebuild is also useful to reset storage characteristics of partitions which are implicitly created through split, merge, or add; these use the default storage values for the index when first created:

  alter index ptestx rebuild partition p5 
    parameters ('replace storage p5_store');

The amount of index memory to use when rebuilding the index can also be specified in the parameters string, although the REPLACE keyword must be used:

  alter index ptestx rebuild partition p5 
    parameters ('replace memory 5m');

Concurrency and Table Partition Operations

During a table partition operation, the entire domain index is unusable. For instance, if in session 1 we do:

  alter table ptest move partition p5 tablespace testing;

and, while it is in progress, in session 2 we query:

  select * from ptest where contains(b, 'KSPC')>0;

you would see the query plan:

  1.1 TABLE ACCESS FULL PTEST

and the query itself would get an error:

  DRG-10599: column is not indexed

Even partition-specific queries that do not touch the affected partitions will fail with this error. This is a quirk of the extensible indexing framework -- local b-trees and bitmap indexes do not have this same limitation. This is currently being investigated by extensible indexing framework development.

Most table partition operations are relatively quick, so the window where queries are unavailable should be small. However, be aware of this effect during potentially long-running partition operations -- especially EXCHANGE PARTITION INCLUDING INDEXES, which will force two ConText index rebuilds.

Local Partitioned Index Maintenance

Local partitioned ConText indexes are not fundamentally different from global ConText indexes -- they still have to be sync'ed to reflect new data, and still should be optimized once in a while to defragment. The only real difference is that a local index is made up of many sets of internal tables; each set is independent, so maintenance is on a partition level.

SYNC

CTX_DDL.SYNC_INDEX has been extended to take a partition name:

    PROCEDURE sync_index(
      idx_name  in  varchar2 default NULL,
      memory    in  varchar2 default NULL,
      part_name in  varchar2 default NULL
   );

Actually, the memory parameter is also new, and lets you pass in the index memory size to use for the sync.

For local indexes, you must specify both an index name and index partition name in the idx_name and part_name parameters. The ConText index will then scan dr$pending for rowids covered by that index partition, and update just that index partition's internal tables.

part_name must be specified when idx_name is a local index. This means that, in this version at least, you must know which index partitions need to be sync'ed, and you must issue one sync_index call for each of those. The view CTX_USER_PENDING has been extended with a PND_PARTITION_NAME column that shows the index partition name for each row, so you can find which partitions need sync with:

  select distinct pnd_index_name, pnd_partition_name
    from ctx_user_pending;

We are investigating enhancing sync_index in future versions to enable it to find and sync all out-of-date partitions without user specification.

Only one session can sync a given index partition at a time. However, you can sync different partitions of the same index at the same time. Just open multiple sessions -- either separate SQL*Plus windows, or, even better, use several DBMS_JOBs and multiple SNP slaves. This can improve ingest rates if data flows to multiple partitions, but it won't help something like date partitioning, where new records are all going into the same partition.

OPTIMIZE

Like sync, optimize is also now done at the partition level; so, like SYNC_INDEX, OPTIMIZE_INDEX has been extended to take a partition name:

    PROCEDURE optimize_index(
      idx_name  in  varchar2,
      optlevel  in  varchar2,
      maxtime   in  number    default null,
      token     in  varchar2  default null,
      part_name in  varchar2  default null
    );

For local indexes, you must specify both an index name and index partition name in the idx_name and part_name parameters. Only one session can optimize a given index partition at a time, but you can optimize different index partitions of the same index at the same time.

Each index partition records its own optimization state (it is not in a user view, but is stored in dr$index_partition) so running a time- limited FULL optimization on an index partition will not interfere with any other index partition. When FULL optimize on partition 1 runs out of time, the next FULL optimize on that partition will pick up where it left off, no matter what you do to the other index partitions.

Query Services

All query services (CTX_QUERY package) that require an index name -- COUNT_HITS, EXPLAIN, HFEEDBACK, and BROWSE_WORDS -- have been extended with a part_name argument. When specifying a local index, you must specify an index partition. The operation will be run against the internal tables of that partition only. So, for instance, you can count the hits in a particular partition, but to count hits in the whole table, you must call count_hits multiple times -- once per partition -- and manually add up the individual return values.

We are investigating multi-partition versions of these calls for future release. The results can be calculated only on a per-partition level, so there is a performance impact, but the main stumbling block is that, other than COUNT_HITS, merging multiple partition results is somewhat complicated.

Document Services

There are no changes to document services (CTX_DOC package) for local partitioned indexes. The document services operate on a per-document basis, specifying the document by rowid or primary key. The index is not even used, except to find out the base table and indexed column, and to know which indexing objects and attribute values to use. So, the document services don't really care whether the index is local or global.



New XML Features

Path Indexing

XML Path Searching

Every new version of Oracle Text/interMedia Text/ConText Option has added new structured document searching features, so we've built up a lot of functionality -- 8.1.7 has nested section search, doctype disambiguation, attribute value searching, automatic section indexing, and more. But with the industry embracing XML, demand is growing for even more sophisticated features which are beyond the capabilities of the current section architecture.

9.0.1 introduces a new section type and new query operators which support an XPath-like query language. ConText indexes with XML path searching are able to perform sophisticated section searches that were impossible in previous versions.

The Path Section Group

In order to use XML path searching, the index must be created with the new path section group:

  exec ctx_ddl.create_section_group('mypathgroup','PATH_SECTION_GROUP');

  create index myindex on mytable(doc)
    indextype is ctxsys.context
    parameters ('SECTION GROUP mypathgroup');

You can add only SPECIAL sections to a path section group; you do not have to add ZONE, FIELD, or ATTR sections, and it does not support STOP sections in this version. Most of the time, you won't even need to create your own -- just use the default one:

  create index myindex on mytable(doc)
    indextype is ctxsys.context
    parameters ('SECTION GROUP ctxsys.path_section_group');

The path section group is somewhat like the auto section group in that it automatically indexes all tags and attributes it encounters. For query, you can still use WITHIN, but the path section group also supports the more powerful INPATH and HASPATH operators.

Path Section Internals

Now we'll talk a little about how path sections are indexed, and how they differ from zone and attribute sections. You can skip this section if you're just interested in usage.

We'll illustrate the difference in index data with this simple doc:

  <OUTER><INNER ATTR="attrvalue">text</INNER></OUTER>

The auto section group produces the following in $I:

TOKEN_TEXT TOKEN_TYPE TOKEN_INFO
OUTER 2 (zone section) DOC 1 START 0 LENGTH 1
INNER 2 (zone section) DOC 1 START 0 LENGTH 1
INNER@ATTR 5 (attribute section) DOC 1 START 0 LENGTH 1
ATTRVALUE 4 (attribute value) DOC 1 POS 1
TEXT 0 (plain text) DOC 1 POS 1

A simple within query like:

  select id from mytable where contains(doc, 'text WITHIN inner')>0

can be fulfilled by fetching the info for word TEXT and section INNER, then looping through each word offset of TEXT, and checking to see if it is between INNER START and START + LENGTH. A simple attribute query like:

  attrvalue WITHIN inner@attr

(we'll just write the text query for brevity) can be fulfilled in much the same way, but using type 4 and 5 tokens instead of types 0 and 2.

This section type cannot support more complex queries. For instance, equal sections like INNER and OUTER are a problem. The query:

  (text WITHIN outer) WITHIN inner

matches this document. The index data records the same offsets for INNER and OUTER, so it is impossible to tell if INNER is inside OUTER or vice versa. Another problem is attribute value sensitive section search. For document:

  <SHIPMENT>
    <MEDIA TYPE="dvd">Ice Blue Eyes</MEDIA>
    <MEDIA TYPE="book">The Monopoly Companion</MEDIA>
  </SHIPMENTS>

(We won't include the xml declaration and DOCTYPE stuff -- just pretend that they are there) If you want to find shipments that include the DVD "Ice Blue Eyes", you'd have to include both section and attribute criteria:

  find documents where "Ice Blue Eyes" occurs within a "MEDIA" section
    whose "TYPE" attribute has the value "dvd"

Unfortunately, attribute values and sections (types 4 and 5) are completely separate from normal text and sections (types 0 and 2). There is no way to link an attribute section to the particular zone section occurrence in which it occurs.

The path section group solves both problems by indexing the document like:

TOKEN_TEXT TOKEN_TYPE TOKEN_INFO
OUTER 7 (path section) DOC 1 START 0 LENGTH 1 LEVEL 1
INNER 7 (path section) DOC 1 START 0 LENGTH 1 LEVEL 2
INNER@ATTR 8 (path attr section) DOC 1 START 0 LENGTH 1 OCC 1
ATTRVALUE 4 (attribute value) DOC 1 POS 1
TEXT 0 (plain text) DOC 1 POS 1

Zone sections and attribute sections (types 2 and 5) have been replaced with path sections and path attribute sections (types 7 and 8). Each occurrence of a path section has a LEVEL bit which indicates the nesting level of the tag. The root tag gets a level of 1, its children get a level of 2, their children get a level of 3, etc. Including level information solves the equal section problem, because we can now tell which tag is the outer tag.

Path attribute sections have a OCC part which links the attribute section occurrence to a path section occurrence. Here, INNER@ATTR has an OCC of 1 because it occurs in the first occurrence of INNER. The ability to correlate attributes and tags solves the attribute value sensitive section search.



New XML Features

Path Query

Querying XML Documents

The query interface is through SQL selects, so your XML queries return entire documents, not just selected parts of them. While we work on better extraction features for future release, you can explore using the new XMLType, which has extraction methods. Just use extraction methods in the select list and contains in the where clause.

The INPATH Operator

The way path sections are indexed enables more complicated section searches, but the WITHIN operator is not expressive enough to handle them. Instead of

  <text query> WITHIN <section name>      (e.g. Ice WITHIN MEDIA   )

indexes with a path section group use the INPATH operator in queries:

  <text query> INPATH(<path expression>)  (e.g. Ice INPATH(//MEDIA) )

but it functions in much the same way, limiting the scope of the text query to certain parts of the indexed documents. The parentheses around the path expression are required. The path expression is more than just a simple section name -- it is a mini query, with a specialized query language. The next section explores the path query language in more detail.

You can still use the WITHIN operator even if you are using the path section group. There should be no difference in behavior between the path section group or auto section group when using WITHIN queries.

Path Query Language

The Text path query language is based on XPath, and we will probably continue to use XPath as a guide for future development, but it is NOT XPath. Not all the XPath operators exist in the Text path query language, for instance. Also, the Text path query language operators are case-insensitive, while XPath's are strictly lower-case. There are other semantic differences covered below. Just don't make assumptions about the path language based on XPath expectations.

Tags

When specifying tags in path queries, you must specify it exactly as it appears in the document in order for it to match. There are two commonly-made mistakes you should avoid.

First, tag names are case-sensitive so the query "title" does not match the tag <TITLE> or the tag <Title>. It will match only <title>.

Second, there is no namespace support in this version. Take the fragments:

  DOC 1    <A xmlns:ORCL="/index.html"><ORCL:B> 
  DOC 2    <A xmlns:ORACLE="/index.html"><ORACLE:B> 

<ORCL:B> in DOC 1 is the same tag as <ORACLE:B> in DOC 2, because their namespace tags normalize to the same URI. However, when querying for these tags, you must specify it as written in the document, so "ORCL:B" to find the tag in doc 1, and "ORACLE:B" to find it in doc 2. "B" alone will not find either tag, nor will something like "http://www.oracle.com/:B". Future versions will probably add more sophisticated namespace support.

Top-Level Tag

The simplest INPATH query string is a single tag:

  perro INPATH(TITLE)

Like a WITHIN query, this query finds perro where it occurs between <TITLE> and </TITLE>. However, unlike a WITHIN query, <TITLE> must be the top-level tag. Take these two documents:

  DOC 1    <TITLE>Clifford El Gran Perro Colorado</TITLE>
  DOC 2    <BOOK><TITLE>Años De Perro</TITLE></BOOK>

The query

  perro WITHIN TITLE

will find both documents, but the INPATH query will find only document 1. It does not match document 2 because there the TITLE tag has a level of 2.

What's really happening is that no level for the query node is specified, so it uses the default context, which is always the top level for INPATH queries. You can explicitly specify the top level context with slash:

  perro INPATH(/TITLE)

or explicitly specify the default context using dot:

  perro INPATH(./TITLE)

both are equivalent to the query without the slash. All examples from here will include the top level slash for readability.

Any-Level Tag

A double slash indicates "any number of levels down". So, the query:

  perro INPATH(//TITLE)

is looking for perro inside a TITLE tag that occurs at the top level or any level below. In other words, this query is equivalent to:

  perro WITHIN TITLE

and finds both DOC 1 and DOC 2.

Ancestor/Descendant Searching

A child tag is a tag which is enclosed within another tag. For instance, in:

  DOC 2    <BOOK><TITLE>Años De Perro</TITLE></BOOK>

TITLE is a child of BOOK. We can find this document using the any-level tag searching, as in the previous section. But what if the corpus also contained:

  DOC 3    <MOVIE><TITLE>Mi vida como un perro</TITLE></MOVIE>

In order to find only books with perro in the title, we need to limit the search to title tags whose parent is a book tag:

  perro INPATH(/BOOK/TITLE)

Reading the path right-to-left, we are looking for a top-level BOOK tag with a child TITLE tag, which matches only DOC 2.

The single slash is direct parentage. The query above will not find:

  DOC 4    <BOOK><DESCRIPTION>
             <TITLE>Años De Perro</TITLE>
           </DESCRIPTION></BOOK> 

Because here TITLE is not a direct child of BOOK. TITLE's direct parent is DESCRIPTION, whose parent is BOOK -- TITLE is a grand-child of BOOK. To find this doc, you can use the any-level slashes:

  perro INPATH(/BOOK//TITLE)

Reading the path right-to-left, we are looking for a top-level BOOK tag with some descendant TITLE tag. This query will match both DOC 3 and DOC 4. Note that this is not the same as:

  ((perro WITHIN TITLE) WITHIN BOOK)

First, the INPATH query restricts BOOK to the top-level. Second, equal sections are not confused. That is, the query:

  ((perro WITHIN BOOK) WITHIN TITLE)

would match DOC 4, but the query:

  perro INPATH(/TITLE//BOOK)

would not. Path sections know that TITLE is a child of BOOK, even though they occur at the same text offsets.

Finally, if you wanted to match only DOC 4 and not DOC 3 -- that is, you want to match TITLE only if it is a grandchild of BOOK, and not a child or great grandchild, etc. -- you can use the single level wildcard:

  perro INPATH(/BOOK/*/TITLE)

The * matches exactly one level, so this path query filters out DOC 3.

You can combine these ancestor/descendant elements for even more complicated queries:

  felis INPATH(//kindgom/*/*/order/family//genus)

Attributes

You can search within an attribute value using the syntax <tag>/@<attribute>:

  perro INPATH(//MOVIE/@SPANISHTITLE)

matches:

  DOC 5    <MOVIE SPANISHTITLE="Mi vida como un perro">My Life As A Dog</MOVIE>

and is equivalent to the query:

  perro WITHIN MOVIE@SPANISHTITLE

One limitation resulting from how attributes are indexed is that all attributes must specify their direct-parent tags. The following:

  perro INPATH(//@TITLE)
  perro INPATH(A/*/@TITLE)

are not allowed, because the tag for the title attribute is not specified:

  select * from doc where contains(text, 'perro INPATH(//@TITLE)')>0;
                                         *
  ERROR at line 1:
  ORA-20000: Oracle Text error:
  DRG-50951: Unable to resolve element name for attribute TITLE

Existence Test

The square brackets are used to impose a condition on a node without changing the path context. For instance, the query:

  monopoly INPATH(/auction[image])

is looking for monopoly inside a top-level auction tag which has an image tag as a direct child. The search for monopoly occurs within the entirety of <auction> and </auction>, and not just within <image> and </image>. This document will match:

  <auction>Sailing Monopoly <image src="...">pic</image></auction>

but will not match:

  <auction>Sailing Monopoly</auction>    

because there is no image element. The default context inside a test element is the tag to which it is applied, so

  monopoly INPATH(/auction[image])

is actually the same as:

  monopoly INPATH(/auction[./image])

You need the dot to reference the default context. Without the dot:

  monopoly INPATH(/auction[/image])

it would mean top-level image tag. This is not supported, and will result in a syntax error.

The existence test for image will match only if image exists and is a direct child. It does not match:

  <auction>Sailing Monopoly<desc><image src="...">pic</image></desc></auction>

because here image is not a direct child of auction. You can match this document using the any-level wildcard, instead:

  monopoly INPATH(/auction[.//image])
You can also test for attribute existence:

  monopoly INPATH(/auction[@reserve])

The test node can be combined with other operators for interesting searches:

  monopoly INPATH(/auction[.//image]/title)

The test node does not change context, so the /title applies to /auction rather than /auction//image -- this query finds auctions where monopoly occurs inside a direct-child title tag, but only if the auction has an image tag in it somewhere. For instance, the doc:

  <auction>
    <title>Sailing Monopoly</title>
    <description>
       New Sailing Monopoly with custom pewter tokens from USAOpoly
       <image src="...">here is a picture</image>
    </description>
  </auction>

To test for non-existence, use the NOT operator:

  monopoly INPATH(/auction[not(taxinfo)])

this query looks for monopoly within an auction element that does not have a direct child taxinfo. The NOT operator is case-insensitive in our path query language. In XPath it only works in lowercase.

Attribute and Tag Value Equality

The test operator is capable of more than simple existence testing. More useful is attribute value testing, which contrains nodes by the value of their attributes. For inst