What You See Is What You Get Element

Oracle Text - Technical Overviews

This document brings together all the technical overview documents that were issued with previous versions of Oracle Text.

The original purpose of the technical overviews was as follows:

  • Collect all the new features and only the new features in one document so that expert users can catch up just by reading this instead of reading the whole doc and mentally diff'ing.
  • Sketch each new feature without covering every option, bell, and whistle. Give the reader enough information to decide if the new feature merits further investigation in the main doc.
  • Include information, observations, and notes on each feature which may not be suitable for documentation. For instance, opinions on the pros and cons of feature X, the internal mechanisms of feature Y, speculation of how feature Z could be useful, etc.
While some of this information is a little dated, it is nevertheless a valuable addition to the official documentation. You can browse through the topic areas at the beginning of each chapter, or simply search the entire document for any topics you are interested in.

Chapter List

This view will often be somewhat more technical than the information given in the official Oracle Text documentation, often including insights into implementation details and/or tricks that can be done with the features which wouldn't normally be considered standard functionality.

Be aware that when internal tables or data structures are mentioned these may (and sometimes do) change in later versions of the product. If such tables or data structures are not mentioned in the official documentation, they should NOT be accessed directly (for reading or writing) by any application.

Chapter 1 : Initial Release: Oracle 8.1.5

Oracle8i interMedia Text 8.1.5 - Technical Overview

Note: In this release, Oracle Text was known as interMedia Text. The name is maintained in all Oracle 8i sections of this document.
  1. Introduction to interMedia Text
  2. Installation and Setup
  3. Indexing
  4. The Preference System
  5. Datastores
  6. Filters
  7. Section Groups
  8. Lexers
  9. Other Indexing Classes
  10. DML
  11. Optimization
  12. Querying
  13. Query Operators (Part 1)
  14. Query Operators (Part 2)
  15. Thesaurus
  16. Query Tuning
  17. Query Feedback
  18. Highlighting
  19. Linguistic Extraction

Introduction to interMedia Text

What does it do?

interMedia Text extends Oracle8i by indexing any text or documents stored in Oracle8i, in operating system flat files or URLs. It also enables content-based queries, (e.g., find text and documents which contain this word) using familiar, standard SQL. It enables file-based text applications to use Oracle8i to manage text and documents in an integrated fashion with traditional relational information.

The easiest way to grasp what it does is to see it in action...

How do you use it?

Here is a quick example. We'll start by creating a table to hold our documents. For now, we'll use a small varchar2 column to hold the document text. We'll also need a primary key to identify each document.

    create table docs (id number primary key, text varchar2(80)); 
Nothing special about that. Let's populate it with two example documents:
    insert into docs values (1, 'first document');     insert into docs values (2, 'second document');     commit; 
Now comes the fun part -- we build a Text index:
    create index doc_index on docs(text)     indextype is ctxsys.context; 

This is, for the most part, familiar DDL -- the difference is the indextype clause, which instructs Oracle to build a Text index instead of a regular, b-tree index.

Once the index has been created, we can run content-based queries on our huge database of two documents using the contains function:

    select id from docs      where contains(text, 'first') > 0;

This will find all rows in docs where the text column contains the word "first", (which is document 1, for those not following closely). The > 0 part is necessary to make it legal Oracle SQL, which does not support boolean return values for functions (at this time).

Integration Architecture

interMedia Text is more tightly integrated with Oracle8i. For those of you familiar with previous versions of ConText, notice that-- in the previous example, no ctxsrv was required. The result is a specialized "domain index" which works like any b-tree index. It's all transparent to the user -- Oracle8i is simply extended to support text. Unlike previous versions of Oracle ConText, there are no servers to start up, there is no query rewrite, and index creation is done through familiar SQL rather than through a custom PL/SQL interface.

Installation and Setup

Migrating from Previous Versions

Due to significant architectural changes, interMedia Text is not backward-compatible with previous versions of Oracle ConText. Existing installations and applications need to be migrated to work with Oracle8i. There is a manual which details the steps of this process. Please remember that some of those steps need to be completed on the existing system before you install Oracle8i.

Installation

By default, the Oracle interMedia installation procedure installs interMedia Text. interMedia Text can also be installed separately, if you choose a custom install.

Important Files

?/bin/ctxsrv This may be renamed ctxsrv8 or some such. Oracle8i still supports the ctxsrv server. However, the only valid personality mask is M. You need to run it only when you want background DML. Batch DML (sync) does not require a running server. We'll go into DML in more depth later.
?/ctx/admin Data dictionary scripts. See below for details.
?/ctx/admin/defaults Language-specific default preferences. See below, and an upcoming issue on the preference system for more details.
?/ctx/data This directory has data files needed by interMedia Text, which includes the linguistic lexicon files, the Xerox stemming files, and the Korean lexer dictionary. If you get strange internal errors in linguistics, stemming, etc. check this directory.
?/ctx/bin This directory is for user-defined filters, which we'll talk about later. This directory should also contain the ctxhx program which is used by the INSO filter.
?/ctx/lib This directory has platform-specific shared libraries and template files used by the INSO filter. It also contains the safe callout used by indexing and document services -- libctxx8.so (name will change from platform to platform -- look for "ctxx")

Data Dictionary Installation

interMedia Text is integrated with the Oracle Database Creation Assistant (DBCA) so the ctxsys data dictionary should be installed with this tool. If a manual installation is required please follow these steps.

  1. Before starting, make sure that:
    • The interMedia Text software tree is installed
    • The database does not have a ctxsys user
    • Current directory is ?/ctx/admin
    • You can sqlplus internal

     

  2. Create the ctxsys user. Pass it the ctxsys password, default tablespace, and temporary tablespace as arguments.
             sqlplus internal @dr0csys <password> <def_tblspc> <tmp_tblspc>
    
  3. Install the data dictionary:
             sqlplus ctxsys/<password> @dr0inst <ctxx_library>
    
    The argument is the full path to the ctxx library, for example:
             sqlplus ctxsys/<password> @dr0inst /some_directory/app/oracle/product/8.1.5/ctx/lib/libctxx8.so
    
    On Unix, you must not use the environment symbol $ORACLE_HOME. You have to use the actual physical path.
     
  4. Install appropriate language-specific default preferences. There are forty-odd scripts in ?/ctx/admin/defaults which create language- specific default preferences. They are named in the form drdefXX.sql, where XX is the language code (from the Server Reference Manual).

    To install the US defaults, for instance:

             sqlplus ctxsys/<password> @defaults/drdefus.sql
    
    We'll talk more about this in the upcoming preference system issue.

    After these steps, interMedia Text should be installed and working.

Post-Installation Setup

If this database was an existing ConText site, make sure to remove text_enable from the init.ora. It is no longer used in Oracle8i, and will actually prevent Oracle8i from operating properly -- you'll get errors like "cannot find package DR_REWRITE".

Finally, ensure that the Net8 listener is running and is configured to invoke external procedures. A brief description of the process is below, with complete details are in the Oracle8i Server Administrator's Guide.

     

  1. Add an entry to the tnsnames.ora:
           extproc_connection_data =           (DESCRIPTION =              (ADDRESS = (PROTOCOL = ipc)                        (KEY = DBSID))             (CONNECT_DATA = (SID = ep_agt1))) 
    DBSID is the database SID. ep_agt1 can be named anything. extproc_connection_data should not be changed.

     

  2. Add the following to the listener SID_LIST:
           (SID_DESC = (SID_NAME = ep_agt1)                    (ORACLE_HOME = /oracle)                    (ENVS = LD_LIBRARY_PATH=/oracle/ctx/lib)                    (PROGRAM = extproc)) 
    ep_agt1 matches the CONNECT_DATA SID for extproc_connection_data in the tnsnames.ora.
    The PROGRAM section tells the Net8 listener to start the external procedure process.
    The ENVS section, which is shown here for UNIX, will ensure that the environment includes ?/ctx/lib in LD_LIBRARY_PATH. This is needed so that indexing can use the INSO filters.
    On NT, you may need to have ORACLE_HOME set in this section as well.

     

  3. Since the extproc_connection_data ADDRESS section specifies ipc, make sure that the ADDRESS_LIST of listener.ora accepts ipc connections.

    A quick way to test the Net8 configuration is to do:

        exec ctx_output.start_log('log') 
    from SQL*Plus. If you get the error:
        DRG-50704: Net8 listener is not running or cannot start external procedures 
    then things aren't set up correctly. Some of the things to check:
    • listener is not running
    • listener.ora is not configured for extproc
    • need to reload the listener
    • tnsnames.ora is not configured for extproc

Indexing

Creating Indexes

The first step in using interMedia Text is to create a Text index. Without a b-tree index, value queries are slower; without a Text index, contains queries are simply not possible. As we've seen, the index is created using the create index command:

  create index INDEXNAME on TABLE(COLUMN)   indextype is ctxsys.context 

 

Unlike previous versions of ConText, there is no separate policy creation step. We'll talk more about this when we discuss the preference system. View indexing is not allowed in Oracle8i, consistent with regular b-tree indexes. Parallel index creation is also not supported in this first Oracle8i version.

Composite indexes are not supported -- only one column is allowed in the column list. This column must be one of the following types: CHAR, VARCHAR, VARCHAR2, LONG, LONG RAW, BLOB, CLOB, BFILE. Date, number, and nested table columns cannot be indexed. Object columns also cannot be indexed, but their attributes can be, provided they are atomic datatypes.

The table must also have a primary key constraint. This is needed mainly for identifying the documents for document services, and may be used in the future for other purposes. Composite primary keys are supported, up to 16 columns.

The issuing user does not need the ctxapp role to create an index. If the user has Oracle grants to create a b-tree index on the column, then they have sufficient permission to create a Text index.

Unlike previous versions of ConText, the issuing owner, table owner, and index owner can all be different users, just like regular b-tree indexes.

IMPORTANT: If a syntax error occurs in the create index statement, the index is still created. This is different from regular b-tree indexes; before you reissue the corrected statement, you must drop the failed index first.

If an error occurs during actual indexing (e.g. you run out of tablespace) then you can pick up where you left off (after correcting the problem, of course) using alter index:

  alter index INDEXNAME rebuild parameters ('resume') 

Storage clauses to the create index statement are not used. Index storage parameters are provided using the STORAGE object, which we will discuss later. Partitioning clauses are also not supported at this time, although you can manually partition the index tables if you wish, again using the STORAGE object.

Once the index is created, any export will include the index definition. At import time, imp will re-create the index by issuing the create index statement.

The Indexing Pipeline

Oracle8i detects that this is a Text index and performs the text indexing. The index is created using a pipeline of steps - "The Indexing Pipeline".

The diagram below shows the indexing pipeline:

 

  +-----------+           +----------+ filtered   +-----------+   |           | doc data  |          | doc text   |           |  | datastore |---------->|  filter  |----------->| sectioner |---------+  |           |           |          |            |           |--+      |  +-----------+           +----------+            +-----------+  |      |        ^                                                         |      |       |                               section offsets           |      |       |column data            +---------------------------------+      |       |                       |                                        |       |                       V                                        |  +==========+            +----------+            +-----------+         |  |          |            |          |            |           |         |  | database |<-----------|  engine  |<-----------|   lexer   |<--------+  |          |  index data|          |   tokens   |           |   plain  +==========+            +----------+            +-----------+   text  
Let's step through this pipeline, stage-by-stage. Each stage will be covered in depth later, so they'll be described here briefly.

Start with the datastore. This stage loops over the rows of the table and reads the data out of the column. Unlike previous versions, the rows are not read in any particular order. The datastore passes out document data. Usually, this will simply be the column data, but some datastores use the column data as a pointer to the document data. The URL_DATASTORE, for instance, uses the column data as a URL, does a GET, and passes out the returned data.

The filter takes the document data and transforms it to some kind of text representation. This is needed when storing binary documents such as Word or Acrobat files. Unlike previous versions, the output of the filter does not have to be plain text -- it can be a text format such as XML or HTML.

The sectioner, new for Oracle8i, takes the output from the filter, and converts it to plain text. There are different sectioners for different text formats, including XML and HTML. Conversion to plain text includes detecting important section tags, removing "invisible" information, and reformatting the text.

The lexer takes the plain text from the sectioner and splits it into discrete tokens. We have lexers for whitespace-delimited languages, and specialized lexers for Asian languages, where segmentation is quite a bit more complex. The basic lexer also includes theme functionality to build unified text/theme indexes.

Finally, Oacle8i takes all the tokens from the lexer, the section offsets from the sectioner, and a list of low-information words called the stoplist, and builds an inverted index. An inverted index stores tokens, and the documents in which these tokens occur. For instance, our example from issue #1:

  insert into docs values (1, 'first document');   insert into docs values (2, 'second document'); 
would produce an inverted index like this:
  DOCUMENT  ---> doc 1 position 2, doc 2 position 2   FIRST     ---> doc 1 position 1   SECOND    ---> doc 2 position 1 
Each document is assigned an identifier called a docid, which is used in the inverted index. The document primary key isn't stored.

Logging

The time needed to create an index is a function of the amount of text being indexed. Oracle8i can create a log file which can be monitored during indexing. start_log starts the logging, and should be called before issuing the create index, in the same session:

  ctx_output.start_log('mylog'); 
This will create a mylog file in ?/ctx/log which can be used to monitor indexing progress. The directory can be changed by ctxsys, using the system parameter LOG_DIRECTORY. See documentation for details.

IMPORTANT: In this first release of interMedia ?/ctx/log is not created at install time. You may get errors like "unable to open file". Simply have the DBA create this directory.

Logging is halted when the session exits or when end_log is called:

  ctx_output.end_log; 

Errors During Indexing

Processing errors encountered during indexing -- filter errors, file cannot be read, etc. -- do not bring indexing to a halt. Instead, the error is logged and indexing continues to the next file.

You can see these logged errors in the ctx_user_index_errors view. It will tell you the rowid of the failed document, and the error which was encountered. You need to manually empty this table when you are finished reviewing the errors:

    delete from ctx_user_index_errors; 

The Preference System

Classes, Objects, Preferences

The preference system allows you to customize Text indexing. Each stage of the indexing pipeline is really an interface into which customized objects can be plugged in. These stages are called "classes", and they can be found in the ctx_classes view:

DATASTORE Reads column data and returns document data
FILTER Reads document data and returns formatted document text
SECTION_GROUP Reads formatted document text and returns section offsets and plain document text
LEXER Reads plain document text and breaks it into tokens
WORDLIST Contains query expansion defaults
STOPLIST Contains a list of low-information words
STORAGE Index table storage parameters

Each class has one or more "objects" which are like behavioral templates. The DATASTORE class, for instance, has the following objects:

DIRECT_DATASTORE Passes column data directly as document data
DETAIL_DATASTORE Concatenates multiple rows of a detail table to construct document data
FILE_DATASTORE Interprets column data as a filename. Reads file and passes file contents as document data
URL_DATASTORE Interprets column data as a URL. Performs a GET and passes return as document data
USER_DATASTORE Invokes a stored procedure to synthesize document data

Objects can be found in the ctx_objects view.

But objects are only templates, and cannot be used directly in an index. Instead, you create a "preference" from one of these object templates, customize it by setting "attributes", then use the preferences to create the index. Let's take a closer look:

Creating Preferences

You create a preference using ctx_ddl.create_preference, specifying an object to use as template:

    ctx_ddl.create_preference('mypref', 'FILE_DATASTORE'); 
This creates the preference mypref, using the FILE_DATASTORE object. If you are familiar with previous versions, the object names have been changed, and there is no comment argument.

This is done in PL/SQL, so surround it with "begin" and "end", or use "exec" if you are in SQL*Plus. Also, in order to create preferences, the issuing owner must have the ctxapp role.

Once the preference is created, we can customize it using set_attribute: For our example, let's say all our files were in /docs. We can simply set the PATH attribute to communicate this to interMedia:

    ctx_ddl.set_attribute('mypref', 'PATH', '/docs'); 

If you are familiar with previous versions, set_attribute now comes *after* create_preference, which allows us to do attribute-level validation, and necessitates the first argument, which is the preference name.

This attribute is a text attribute, so it is set to a text value. There are also number attributes and boolean attributes, which should be set to 'TRUE' or 'FALSE' (or 'YES'/'NO'). The attributes for each object and their types can be found in view ctx_object_attributes or in the documentation.

Some attributes have a list of allowed values. The allowed values can be found in the view ctx_object_attribute_lov. Unlike previous versions, LOV attributes can be set using a descriptive "tag" rather than a numeric value.

Non-preference Classes

Preferences are used for all classes except the SECTION_GROUP and STOPLIST classes. These classes use specialized objects and have their own API. We'll discuss this in the respective upcoming issues.

Using Preferences

Once you've created your preferences, you can build all kinds of customized indexes by mixing and matching them. Because all objects of a class adhere to a common interface, any filter can be hooked up to any datastore, etc. Got a bunch of Korean Word 97 files? No problem -- hook up a FILE_DATASTORE preference, an INSO_FILTER preference, and a KOREAN_LEXER preference.

You plug in your preferences using the parameters clause of create index:

    create index doc_index on docs(text)       indextype is ctxsys.context       parameters ('datastore mypref') 
This tells create index to use mypref as the datastore. Each of the classes has a parameters keyword to name a preference. The keyword is the same as the class name, except for SECTION_GROUP, whose keyword is the phrase "section group". You can specify multiple classes by simply tacking the keyword-preference pairs on the end of the string:
    parameters('datastore mypref filter myfilter') 

Any user can use any preference. To specify a preference in another user's schema, simply add the owner name in the parameters string:

    parameters('datastore kenny.mypref') 

Unlike previous versions, the index makes a copy of the preference settings, which means that after use in an index, the preference can be modified or dropped without affecting the index. Note that only the preference owner can modify or drop it.

The Default System

In the example above, only the datastore class is set. Since preferences were not specified for the other classes in the parameters string, they get their values from the default system.

The default system uses system parameters. System parameters are name- value pairs which apply to the interMedia Text installation as a whole -- kind of our version of init.ora. They can be seen in the view ctx_parameters.

The values for the parameters used by the default system are preference names. For instance, in our example we have not specified a lexer preference. interMedia gets the value of the system parameter DEFAULT_LEXER, and uses the preference named there. This allows the administrator to set the default behavior for the system.

The set of parameters used by the default system are:

DEFAULT_DATASTORE The default datastore preference. At install time (the default default) it is set to CTXSYS.DEFAULT_DATASTORE, a preference which uses the DIRECT_DATASTORE object.
DEFAULT_FILTER_BINARY This is the filter preference to use when indexing binary datatypes such as RAW, LONG RAW, or BLOB. At install time it is set to CTXSYS.INSO_FILTER, which uses the INSO_FILTER object.
DEFAULT_FILTER_FILE This is the filter preference to use when indexing files, either BFILE or the FILE_DATASTORE. At install time it is set to CTXSYS.INSO_FILTER.
DEFAULT_FILTER_TEXT This is the filter preference to use when indexing text datatypes such as CHAR, VARCHAR2, or CLOB. At install time it is set to CTXSYS.NULL_FILTER, which does no filtering.
DEFAULT_SECTION_HTML This is the section group to use when the filter class uses the INSO filter, or the datastore uses the URL_DATASTORE. The INSO filter filters binary files to HTML (see filters). At install time it is set to CTXSYS.HTML_SECTION_GROUP, which merely converts the HTML to plain text.
DEFAULT_SECTION_TEXT This is the section group to use in all other cases. At install time it is set to CTXSYS.NULL_SECTION_GROUP, which does no sectioning.
DEFAULT_LEXER This is the default lexer preference. At install time it is set to CTXSYS.DEFAULT_LEXER. The definition of that preference is language-specific. See below for more details.
DEFAULT_WORDLIST This is the default wordlist preference. At install time it is set to CTXSYS.DEFAULT_WORDLIST. The definition of that preference is language-specific. See below for more details.
DEFAULT_STOPLIST This is the default stoplist. At install time it is set to CTXSYS.DEFAULT_STOPLIST. The contents of that stoplist are language-specific. See below for more details.
DEFAULT_STORAGE This is the default storage preference. At install time it is set to CTXSYS.DEFAULT_STORAGE, which has no settings.

Language-Specific Defaults

The LEXER, WORDLIST, and STOPLIST classes are the most affected by the language of the documents. Every installation gets DEFAULT_LEXER, DEFAULT_WORDLIST, and DEFAULT_STOPLIST preferences, but the definition of these preferences will depend on the language chosen at install time.

?/ctx/admin/defaults holds language-specific scripts for each supported language. Based on the language chosen, DBCA runs the matching language-specific script, which creates the default preferences most appropriate for that language.

The result is that a German instance gets a DEFAULT_LEXER which is case- sensitive and does German decompounding, a DEFAULT_STOPLIST which has common German words, and a DEFAULT_WORDLIST with German stemming. A Japanese instance gets different defaults, more tailored to Japanese.

Browsing the Data Dictionary

Here is a list of views useful for browsing the interMedia Text data dictionary: ctx_classes Available classes ctx_objects Available objects for each class ctx_object_attributes Attribute names and types for each object ctx_object_attribute_lov Allowed values for each LOV object attribute ctx_preferences Preferences ctx_preference_values Attribute values of each preference ctx_user_index_objects Objects used by the user's indexes ctx_user_index_values Attribute values for the user's indexes

Datastores

The DATASTORE class

The datastore class is responsible for reading the column data from the database, getting the document data based on the column data, and passing that up to the filter.

It is important to keep in mind that some form of the document data must be in a table. Datastore objects which access external resources such as files and web pages still require a table of pointers to those resources.

Datastore preferences are plugged into indexes using the datastore keyword:

    create index     ...     parameters ('datastore mydstore'); 
The datastore class has five objects.

DIRECT_DATASTORE

The direct datastore is the simplest case -- it assumes that document data is stored in the indexed column. It is so simple it has no attributes to customize. There should be no reason to create a preference based on the DIRECT_DATASTORE object -- CTXSYS.DEFAULT_DATASTORE can be used for any index.

IMPORTANT: If you are indexing BFILEs, make sure the user ctxsys has READ permission on the BFILE directory.

FILE_DATASTORE

The file datastore reads the column data as a filename. It opens the file and returns the contents as the document data. The indexed column cannot be LONG or LOB types. The files must be accessible from the database host machine.

This object has one attribute: PATH. PATH is optional, and specifies the directory where the documents are stored.

IMPORTANT: If PATH is specified, the column data must be simple file names; if not specified, the column data must be full path names. You cannot mix and match -- for instance, with PATH of /A, and a column of B/C.TXT it will NOT find /A/B/C.TXT.

On some platforms, PATH can specify multiple directories. If you do this, make sure that your documents have unique names -- if a document C.TXT is in directories /A and /B, and /A is in the PATH before /B, only /A/C.TXT will be indexed.

There is not much of a difference between this object and BFILEs. It's a matter of choice. Like BFILE's, when the file changes, the row will not be automatically reindexed. You should update the column value to itself to trigger a reindex.

URL_DATASTORE

The URL datastore reads the column data as a URL. It performs a GET and returns the contents as the document data. The URLs for the documents must in the table -- the URL datastore does not crawl. The indexed column cannot be LONG or LOB types.

http, ftp, and file protocols are supported. The URL datastore also has multi-threaded read (on most platforms) to maximize throughput. It is very customizable, and has several attributes:

TIMEOUT Read request timeout in seconds
MAXTHREADS Maximum number of threads to use for multithreaded read
URLSIZE Maximum length of an URL
MAXURLS Maximum number of URL's to read at one time
MAXDOCSIZE Maximum length of document
HTTP_PROXY HTTP proxy
FTP_PROXY FTP proxy
NO_PROXY Domain for proxy exclusion

Like the file datastore, if the web page changes, then you should manually update the column value to itself to trigger the reindex.

DETAIL_DATASTORE

Many legacy applications store text line-by-line, in multiple rows of a detail table. The detail datastore constructs documents by concatenating the text of these lines into a single, coherent document.

The detail datastore must be told the specifics of the detail table:

DETAIL_TABLE Name of the detail table
DETAIL_KEY The foreign key column(s) in the detail table
DETAIL_LINENO The sequence column in the detail table
DETAIL_TEXT The text column in the detail table

For instance, let's say the master and detail tables look like this:

    the_master                   my_detail      ID     TITLE                 M_ID   SEQ  LINETEXT     1      Grate Expectations    1      1    It was the best of times                                  1      2    it was the worst of times. 
"That's G-R-A-T-E Expectations, also by Edmund Wells."

The attributes of the preference should be set like this:

DETAIL_TABLE my_detail
DETAIL_KEY m_id
DETAIL_LINENO seq
DETAIL_TEXT linetext

There is one final attribute: BINARY. By default, the detail datastore treats each row as a separate line, and, when concatenating the contents, will automatically stick a newline at the end of the text for each row. BINARY set to TRUE suppresses this. In our example, we should set BINARY to FALSE or simply leave it unset.

The index itself should to be built on the master table:

    create index myidx on master(somecolumn)       indextype is ctxsys.context       parameters ('datastore my_detail') 
The master table must have a primary key column, just like any other table indexed by interMedia Text. This primary key column is used to find the corresponding detail rows, where detail.fk = master.pk.

The indexed column can be any column of allowable type -- the contents are not used by interMedia. If you can afford the space, you can add a dummy char(1) column called "text" or "body", to make the queries more readable:

    select author, title from master      where contains(text, 'best & worst') > 0;  

If detail rows are added, removed, or changed without changing the master text column, you should manually update the text column to itself to trigger a reindex.

For those of you familiar with previous versions of ConText, this is similar to the master detail new datastore. The old master detail datastore is no longer supported

USER_DATASTORE

The user datastore is new for Oracle8i. You write a stored procedure which, given a rowid, synthesizes a document. interMedia calls this procedure whenever document data is required. Let's look at an example. Say we have a table like this:

  articles     id       number     author   varchar2(80)     title    varchar2(120)     text     clob 
and we want to automatically have author and title be part of the indexed document text. We can write a stored procedure following the user datastore interface:
    create procedure myproc(rid in rowid, tlob in out NOCOPY clob) is       offset number := 1;     begin       for c1 in (select author, title, text from articles                   where rowid = rid)       loop         append_varchar_to_lob(tlob, c1.title, offset);         append_varchar_to_lob(tlob, 'by '||c1.author, offset);         dbms_lob.append(tlob, c1.text);               end loop;     end; 
This procedure takes in a rowid and a temporary clob locator, and concatenates all the articles columns into the temp clob, This assumes a helper procedure append_varchar_to_lob has been written.

Note the use of the NOCOPY in the procedure signature. This is very important to achieve best performance and to reduce TEMP space usage.

To ensure that the DBA has signed off on the code, only ctxsys-owned stored procedures are allowed for the user datastore. Thus, we need to do something like this as ctxsys:

    create procedure s_myproc(rid in rowid, tlob in out NOCOPY clob) is     begin       appowner.myproc(rid, tlob);     end; 
And, we need to make sure that the index owner can execute the stub procedure, so:
    grant execute on s_myproc to appowner 

Now, back as the app owner, we create the preference, setting the PROCEDURE attribute to the name of the ctxsys stub procedure:

  ctx_ddl.create_preference('myud', 'user_datastore');   ctx_ddl.set_attribute('myud', 'procedure', 's_myproc'); 
When we then create the index on articles(text) using this preference, interMedia indexing will see author and title in the document text.

The user datastore can be used for any on-the-fly document synthesis, including more complex master-detail relationships, nested tables, text preprocessing, or multicolumn concatenation, like the example.

There are four constraints on the procedures used in user datastores:

  1. They must be owned by ctxsys
  2. They must be executable by the index owner
  3. They cannot issue DDL or transaction control statements like "commit"
  4. They cannot be safe callouts or call safe callouts
If you change the stored procedure, indexes based upon it will not be notified, so you should manually recreate such indexes. interMedia cannot tell what you are doing in the stored procedure, so if the stored procedure makes use of other columns, and those column values change, the row will not be reindexed. The row is only reindexed when the indexed column changes.

When using the USER_DATASTORE, you may run into OCI_INVALID_HANDLE errors. This is bug 881904, and is fixed in patch release 8.1.5.1.

Filters

The FILTER Class

The filter class takes the document data provided by the datastore class, and filters it to readable text, passing it to the sectioner. With Oracle8i, this does not have to be plain text -- it can be a text format such as HTML.

Filter preferences are plugged into indexes using the filter keyword:

    create index     ...     parameters ('filter myfilter'); 
The filter class has three objects.

NULL_FILTER

The null filter is used when the document contents are not in a binary format. It simply passes text from the datastore to the sectioner. The default CTXSYS.NULL_FILTER preference can be used in any index to employ the null filter.

CHARSET_FILTER

 

The charset filter is new for Oracle8i. It converts documents from a foreign character set to the database character set. This is most useful for Japanese customers, who have to deal with two widely-used and incompatible character sets (and one infrequently-used incompatible character set).

The charset filter has one attribute, CHARSET, which takes the NLS name of the source character set. A list of NLS charsets can be found in the Oracle National Language Support Guide. The destination character set is always the database character set, so does not need to be specified.

Additionally, you can specify JAAUTO, which is a custom setting for Japanese character set auto-detection. Oracle8i will automatically detect Shift-JIS, JIS7, or EUC for each document and convert it to the database charset if needed.

USER_FILTER

The user filter is a method for plugging in your own filters. You write a filtering program, place it in ?/ctx/bin, and the indexing engine will invoke it to filter each document. Here's an example -- we'll create an uppercase filter which will uppercase every word.

We start by creating a program to do the filtering -- in this case, we'll write a perl script:

  #!/usr/local/bin/perl    open(IN, $ARGV[0]);   open(OUT, ">".$ARGV[1]);    while ()   {     tr/a-z/A-Z/;     print OUT;   }    close(IN);   close(OUT); 
This perl script converts a file to uppercase.

User filter programs like this take two arguments. The first argument is the name of the input file. The second argument is the name of the output file. Our filter reads the contents of the input file, filters it, and writes the output to the output file.

the program (called upcase.pl) is placed in ?/ctx/bin. Ensure that it's executable by the oracle operating-system user. Now the preference is created:

    ctx_ddl.create_preference('mypref', 'USER_FILTER');     ctx_ddl.set_attribute('mypref', 'EXECUTABLE', 'upcase.pl'); 
When this preference is used in an index, the indexing engine will invoke the user filter program for each document when the text is required.

INSO_FILTER

The INSO filter automatically recognizes and filters over a hundred different formats, including Word 97 and Acrobat. The full list can be found in the documentation.

The INSO filter does not have any attributes at this time, so the CTXSYS.INSO_FILTER preference can be used in any index needing filtering.

IMPORTANT: This filter outputs HTML, not plain text. Make sure you employ the HTML section group or all these tags will be indexed. The default system will employ the HTML section group when the INSO filter is detected.

The INSO filter uses code from Inso, "the world leader in filtering and viewer technology," and Adobe. This code is not integrated code directly, it instead follows the user filter architecture. The user filter executable for the INSO filter is ctxhx.

ctxhx requires shared libraries and other files (.flt and .tpt files) located in ?/ctx/lib. The installer should copy the correct platform- specific files to this directory. Additionally, ensure that the external procedure agent has ?/ctx/lib in the LD_LIBARY_PATH environment variable (PATH on NT). This can be done using the ENVS section in the listener.ora. On NT you may need to have ORACLE_HOME set in the ENVS section, as well.See Installation section above for details.

If you are encountering problems using this filter, it can be run by hand. First get a formatted binary file. Let's call it testfile.doc. You do:

    ctxhx testfile.doc testfile.txt 
This should create the file testfile.txt with the HTML representation. The error messages you get, if any, should help you determine what's wrong.

The INSO filter is supported only on Solaris, HP-UX, AIX, and NT. Other platforms are not able to use this filter at this time.

Where's the HTML filter ?

Some of you familiar with previous versions of ConText may be wondering about the HTML filter. In 8i, this code has been moved to the new sectioner class, which allows any filter to spit out HTML if it so desires.

Section Groups

The SECTION_GROUP Class

The section group class is new for Oracle8i, and incorporates functionality that was part of the wordlist and filter classes in ConText. It takes a text format, such as XML or HMTL, as input. It is unique in having two outputs -- the section boundaries, which go to the engine, and the plaintext, which goes to the lexer.

Creating Section Groups

Section groups are not created using create_preference -- they have their own API in ctx_ddl:

    ctx_ddl.create_section_group('mygroup','html_section_group'); 
The first argument is the name of the new section group. The second argument is the section group type. This specifies the input text format, and tells the sectioner the rules for detecting section boundaries and transforming to plaintext. Each section group type is covered briefly below.

An empty section group can transform formatted test to plaintext, but will not index any section boundaries. You must first tell it which sections to index by adding sections to the group.

Sections have three important attributes: TAG, NAME, and TYPE. TAG tells the section group how to recognize the section. Because the section group already knows the format, you do not need to specify start and end tags; a section in an HTML section group with TAG of B knows that the section starts with <b> and ends with </b>, for instance. Tags are unique across the sections of a section group.

NAME is how you refer to the section in queries. You may want to name the B tag BOLD, for readability. Multiple tags can be mapped to the same name -- they are then treated as instances of the same section. H1, H2, H3 can all be treated as instances of the HEADLINE section. We recommend avoiding non- alphanumeric characters -- such as underscore -- in section names. Using these characters will force you to escape the names in queries.

TYPE is totally new for 8i. There are three different types of sections: ZONE, SPECIAL, and FIELD.

Section Types

ZONE sections are like sections from previous versions. interMedia records where in the document the section start and end tags occur. WITHIN queries check that the hit words occur between the start and end word offset.

If a ZONE section repeats, each instance is treated separately in query semantics. See Examples, below. ZONE sections can enclose other sections, including themselves, and can be enclosed by other sections. ZONE sections are added like this:

    ctx_ddl.add_zone_section('groupname', 'sectionname', 'tag'); 
SPECIAL sections are so named because they are not recognized by tags. There are two special sections in Oracle8i -- SENTENCE and PARAGRAPH -- both recognized by punctuation in the lexer. They index the start and end of sentences and paragraphs, respectively. You add special sections like this:
    ctx_ddl.add_special_section('groupname', 'sectionname'); 
No tag argument is needed. There are only two allowed values for section name: SENTENCE, and PARAGRAPH.

ZONE and SPECIAL sections index section start and end word offsets, but do nothing to the words in the section. FIELD sections, on the other hand, extract their contents and index them separately from other words in the document. WITHIN queries are run on this separate, smaller index. This makes field section query faster than zone section query -- up to three times as fast in tests we've conducted -- especially when the section tags occur in every document.

This speed comes at the cost of flexibility. FIELD sections are meant for non-repeating, non-overlapping sections. If a field section repeats, it is treated as a continuation of the section, not a distinct instance. If a field section is overlapped by itself or by another field section, it is implicitly closed at the point where the other section starts. Also, there is a maximum of 64 field sections in any section group. This is 64 section names, not tags. Remember that you can map multiple tags to the same section name. You add field sections like this:

    ctx_ddl.add_field_section('groupname', 'sectionname', 'tag'); 
To illustrate, let's work through a couple of examples to illustrate the different types of sections and the impact on query semantics. We'll use the following document as an example:
    <A>rat</A><A>ox</A>     <B>tiger rabbit</B>     <C>dragon<C>snake</C></C> 
This is an XML-style markup, but without a DTD, so we will use the basic section group type:
    ctx_ddl.create_section_group('mygroup','basic_section_group'); 
Let's start with ZONE sections:
    ctx_ddl.add_zone_section('mygroup', 'asec', 'a');     ctx_ddl.add_zone_section('mygroup', 'bsec', 'b');     ctx_ddl.add_zone_section('mygroup', 'csec', 'c'); 
This tells the section group to recognize A tags as the section ASEC, B tags as the section BSEC, etc. To do section queries, we use the WITHIN operator. Remember to use the section name and not the tag:
    contains(text, 'rat within asec') > 0 
This finds the document.
    contains(text, 'tiger within asec') > 0 
This does not find the document. Although it has "tiger", it does not occur in the ASEC. If instead of the original setup we had mapped A and B to the same section name:
    ctx_ddl.add_section('mygroup', 'asec', 'a');     ctx_ddl.add_section('mygroup', 'asec', 'b'); 
Then both:
    contains(text, 'rat within asec') > 0     contains(text, 'tiger within asec') > 0 
would find the document, because both A and B are treated as ASEC.

An important facet of ZONE section query semantics is that each instance is treated distinctly. That is, a query like this:

    contains(text, '(tiger and rabbit) within bsec') > 0 
finds the document, but a query like this:
    contains(text, '(rat and ox) within asec') > 0 
does not find the document. Although the document has "rat" and has "ox", and "rat" is in ASEC, and "ox" is in ASEC, "rat" and "ox" are not within the SAME ASEC. Note that
    contains(text, '(dragon and snake) within csec') > 0 
finds the document, since they are both in the outer CSEC, even though the inner CSEC contains only "snake".

Special sections follow the same semantics, so a query like this:

    contains(text, '(jumbo and shrimp) within sentence') > 0 
is looking for documents with "jumbo" and "shrimp" in the SAME sentence.

Now let's look at field sections. What if in setup we made ASEC a field section instead of a zone section:

    ctx_ddl.add_field_section('mygroup', 'asec', 'a'); 
Unlike zone sections, each instance of a field section is considered a continuation of the section. The query
    contains(text, '(rat and ox) within asec') > 0 
finds the document, although it didn't when ASEC was a zone section. The field section simply stitches the two instances together. Field sections work best on sections like BSEC, which does not repeat like ASEC nor nest like CSEC.

One last subtlety with field sections. If ASEC is a zone section, then

    contains(text, 'rat') > 0 
finds the document -- it contains the word rat. However, remember that field sections work by extracting the document contents and indexing them separately. Thus, if ASEC were a field section, this query would not find the document. "rat" is in ASEC, which is separate from the document contents.

You can, however, change this by making the field section visible. There is an optional boolean fourth argument to add_field_section. If this boolean argument is set to true:

    ctx_ddl.add_field_section('mygroup', 'asec', 'a', TRUE); 
then the field section is visible, and the section contents are visible to non-within queries, just like zone sections. This is accomplished by double- indexing the word -- once as part of the extracted section, and once as part of the body, so this option has index space cost.

Using Section Groups

Section groups are plugged into indexes using the section group keyword:

    create index     ...     parameters ('section group mygroup'); 
Unlike previous versions, you do not need to set an attribute in the wordlist object. You also no longer need to set STARTJOINS and ENDJOINS in the lexer. However, the WHITESPACE, NEWLINE, and PUNCTUATIONS settings in the basic lexer affect sentence and paragraph boundary recognition, which then impact special sections. There are five section group types.

NULL_SECTION_GROUP

The null section group is used when there is no section information to be extracted -- it simply passes the text through to the lexer. Only special sections can be added to the null section group.

BASIC_SECTION_GROUP

The basic section group is for simple XML-style markup without going full- bore on the DTD stuff. It can be used in custom situations where full XML is unnecessary. Tags are in the form .... Entities and tag attributes are not supported. The only processing it does for plaintext is removal of the markup tags.

HTML_SECTION_GROUP

The HTML section group is for HTML. Good name, huh? It knows the rules for HTML, including ISO Latin-1 entity translation, HTML to plaintext conversion, and

omission. It knows HTML 4.0 tags, and has code to deal with unknown tags.

SCRIPT and STYLE contents, and comments are removed. Contents of the TITLE section are preserved. META tag information indexing is a known customer requirement for a future version.

For those familiar with previous versions, there is no need for KEEP_TAG anymore -- this is automatic when you add the tag to the section group.

XML_SECTION_GROUP

The XML section group is for XML. It does no processing to plaintext except for entity translation and tag removal. It can handle non-system, non-parameter entities in the internal DTD. It cannot access external DTD's in this version.

NEWS_SECTION_GROUP

The news section group handles newsgroup-style postings -- RFC-1036 in particular. This format is used in newsgroups and is similar to the one used for e-mail. Note that while RFC-1036 defines the set of allowable header fields, Oracle does not require this -- any header fields can be used.

Messages in this format consist of a header section, a blank line, and a body section. The news section group recognizes <beginning of line>TAG: as the start of sections, and <end of line> as the end of the section. It stops looking for this pattern when the blank line is encountered.

Translation to plaintext consists of removing all header lines which are not important to section searching. A message like this:

  From:  me   To:    you   X-ref: 5!oowj    hello!  How are you! 
with only the from and to tags in the section group, will look like this to the indexing engine:
  <from section start>me<from section end>   <to section start>you<to section end>   hello!  How are you! 

Lexers

The LEXER Class

The lexer class gets plaintext from the sectioner and splits it into words.

Lexer preferences are plugged into indexes using the lexer keyword:

    create index     ...     parameters ('lexer mylex'); 
The lexer class has five objects.

BASIC_LEXER

The basic lexer can be used for most European languages. It is programmed with default rules for splitting whitespace-delimited text into words. You can modify its parsing behavior to some extent with the lexical attributes.

Ordinarily, non-alphanumeric characters end a word. SQL*Plus is lexed as "SQL" and "Plus". The JOINS attributes allow you to modify this behavior by declaring sets of non-alphabetic characters to treat as valid word letters.

PRINTJOINS are non-alphanumeric characters which are treated as valid characters, and become part of the word. Declaring PRINTJOINS of "*" lets SQL*Plus be lexed as "SQL*Plus".

SKIPJOINS are like PRINTJOINS, but they do not become part of the word. Declaring SKIPJOINS of "*" lets SQL*Plus be lexed as "SQLPlus".

STARTJOINS are like PRINTJOINS, but they only come at the start of the word. If they occur in the middle of a word, a new word is started. Having "*" as STARTJOINS lets SQL*Plus be lexed as "SQL" and "*Plus". Multiple STARTJOINS are allowed at the start of a word.

ENDJOINS are like STARTJOINS, but for the end of the word. They implicitly end a word, too. So, "*" as ENDJOINS lexes SQL*Plus as "SQL*" and "Plus". Multiple ENDJOINS are allowed at the end of a word. STARTJOINS and ENDJOINS used to be important for section searching in previous versions, but with the new sectioner class, they are no longer needed for that purpose.

Each of these four JOINS attributes is a set of characters. Setting PRINTJOINS to "_%*" means that "_", "%", and "*" are all PRINTJOINS.

NUMJOIN is a single character which is the numeric join character. NUMGROUP is the single numeric group character. They are defaulted based on your NLS_LANG setting. For US, NUMJOIN is "." and NUMGROUP is ",".

Finally, CONTINUATION are those characters which indicate line continuation. These and the following newline are removed from the text. Default value is hyphen and backslash.

Then there are three attributes important for sentence/paragraph indexing. PUNCTUATIONS are the set of punctuation marks (?!. by default). WHITESPACE is a set of additional whitespace characters other than space and tab, which you get for free and cannot change. NEWLINE can be set to NEWLINE or CARRIAGE_RETURN. This controls whether lines are ended by \n or \r\n.

A sentence is then recognized as a PUNCTUATION followed by one or more WHITESPACE. A paragraph is a PUNCTUATION followed by a NEWLINE. There are other patterns which are variations on this, but that's the basic idea.

The third set of basic lexer attributes controls term normalization -- the process of converting words to a standard form. BASE_LETTER is a boolean attribute which, if set to YES, will convert accented characters to their unaccented forms.

By default, each word is uppercased during indexing so that queries are case-independent. MIXED_CASE is a boolean, which, if set to YES, does not do this. This makes indexing and queries case-sensitive.

ALTERNATE_SPELLING is an attribute which, if set, uses letter transformation rules for common alternate spellings. It can be set to GERMAN, DANISH, or SWEDISH. In GERMAN mode, for instance, A-umlaut gets transformed to AE. Since the lexer is used at both indexing and query time, query for a word containing A-umlaut will find the word spelled with A-umlaut or AE.

Finally, COMPOSITE controls word decompounding. In German, for instance, multiple words are often combined into a single string. This makes lexing difficult, because the words are no longer whitespace-delimited. If COMPOSITE is set, then the lexer breaks up these compound words, which allows interMedia to index and query as usual. COMPOSITE can be set to GERMAN or DUTCH. Unset or DEFAULT means no decompounding.

The last set of basic lexer attributes control text and theme indexing. New for 8i, the basic lexer can generate and index themes for each document, in addition to splitting it up into words. This merges the functionality found in the separate THEME LEXER and BASIC LEXER of previous versions.

INDEX_TEXT is a boolean which controls word indexing. INDEX_THEMES controls theme indexing. Setting INDEX_THEMES to NO makes it a text-only index. Setting INDEX_TEXT to NO makes it a theme-only index. They cannot, of course, be both NO. Indexing themes takes longer and uses a bit more index space, but improves the efficacy of ABOUT queries. We'll talk more about this when we discuss ABOUT query.

The CTXSYS.DEFAULT_LEXER preference has theme indexing ON for the English language-specific scripts.

JAPANESE_VGRAM_LEXER

Japanese is not a whitespace-delimited language (except in textbooks) so it is very difficult for computers to pick out individual words. The Japanese V-gram lexer gets around this problem by indexing overlapping clusters of characters. The Japanese word ABC might be decomposed into AB and BC, for instance. Query terms are also decomposed this way. Thus, the contains query is not looking for words, but for patterns of characters.

The Japanese lexer does not have any attributes in this version.

CHINESE_VGRAM_LEXER

Chinese is also not whitespace-delimited, so a similar solution has been applied for Chinese, as well. The Chinese v-gram lexer also does not have any attributes.

CHINESE_LEXER

New for Oracle8i is a segmenting Chinese lexer, which can actually recognize some Chinese words as whole entities. The rest of the text is still broken into v-grams, but this should be more space-efficient and have faster query than the pure v-gram method. The Chinese segmentation lexer does not have any attribtues.

KOREAN_LEXER

Korean is whitespace-delimited, but has problems with verbs, which can have thousands of different forms. Our Korean lexer is a lexicon-driven engine (using a third-party 3-soft lexicon) which simply eliminates verbs from indexing. New for Oracle8i is the ability to eliminate adverbs and adjectives, do various form conversions and perform morphological and segmentation decomposition.

Other Indexing Classes

The STOPLIST Class

 

Not every word in a document is worth indexing. Linguistic lubricant like prepositions and conjunctions are important for language understanding, but are not very useful for information retrieval -- they are very common and so convey very little information about the document by their presence. Most of the time, it's not worth the space to index these words.

The stoplist class holds a list of these words, called stop words. During indexing, the engine consults the stoplist and filters out the stop words. The stoplist class does not use the create_preference API. Instead, it has its own API:

    ctx_ddl.create_stoplist('mylist'); 
Words are added using add_stopword:
    ctx_ddl.add_stopword('mylist', 'the'); 
Here the article "THE" has been added to the stoplist. You can see the words in a stoplist using the ctx_stopwords view. A stoplist can have a maximum of 4095 stopwords. Each can be up to 64 bytes in length.

Stopwords are case-sensitive in a stoplist, but if the index is not case- sensitive, the stopwords will not be case-sensitive. So, you could add "THE" and "the" to a stoplist, but if the index is not case-sensitive, there would be no difference -- you'd just be wasting a stopword slot.

Stoplists are plugged into indexes using the stoplist keyword:

    create index     ...     parameters ('stoplist mylist'); 
New for Oracle8i, stopwords can be added to the index without re-indexing:
    alter index myidx rebuild parameters ('add stopword AND'); 
This adds "AND" as a new stopword after indexing. Remember that the index makes its own copy of everything, so this does not affect the stoplist used to create the index.

The default stoplist, CTXSYS.DEFAULT_STOPLIST, is language-specific, and default stoplists are provided for most European languages.

Other Stop Objects

Stoplists can also hold two other kinds of objects: stop themes and stop classes. Stop classes are classes of words, rather than individual words. The only stop class available in Oracle8i is NUMBERS, which stops all numbers from being indexed. Alphanumeric words, such as 123.456.789 and abc231 are still indexed. This behavior depends on the lexer to recognize numbers.

Stop themes are used only during theme indexing. The entries added as stop themes are themed, and these themes AND THEIR CHILDREN are not indexed. For instance, say you enter "biology" as a stop theme. This will prevent themes like "Physical Sciences:Biology:Genetics" and "Physical Sciences: Biology:Botany" from being indexed. This is just an example, and does not actually reflect the knowledge base.

Stop themes are used only during theme indexing, so adding "biology" as a stop theme will not stop the word "biology" or biology-associated words from being indexed.

The WORDLIST Class

The wordlist class is an odd class because it has no effect on indexing. Instead, it holds fuzzy and stem expansion settings used at query time.

Wordlist preferences are plugged into indexes using the wordlist keyword:

    create index     ...     parameters ('wordlist mylist'); 
There is only one wordlist object: BASIC_WORDLIST, with the following attributes:

STEMMER This attribute is set to the default stemming type. Stemming is an expansion of a word to different forms. The stem expansion of GO might include GO, GOING, WENT, and GONE. The rules for this vary from language to language. (See expansion query operators).

STEMMER can be set to one of the following: ENGLISH (inflectional), (English) DERIVATIONAL, DUTCH, FRENCH, GERMAN, ITALIAN, SPANISH, and NULL, which means no stemming. The CTXSYS.DEFAULT_WORDLIST setting is language-specific.

FUZZY_MATCH This attribute is set to the default type of fuzzy match. Fuzzy match is an expansion technique designed to find words close in form, such as mistyped or mis-OCR'ed versions.

FUZZY_MATCH can be set to one of the following: GENERIC, JAPANESE_VGRAM, KOREAN, CHINESE_VGRAM, ENGLISH, DUTCH, FRENCH, GERMAN, ITALIAN, SPANISH, and OCR. The CTXSYS.DEFAULT_WORDLIST setting is language-specific.

FUZZY_SCORE This is the default score floor for fuzzy expansions. New for Oracle8i is the ability to limit the fuzzy expansion to the best matches. The fuzzy score is a measure of how close the expanded word is to the query word -- higher is better. Setting fuzzy score means that fuzzy expansions below this score will not be produced. You can set this from 0 to 80.
FUZZY_NUMRESULTS This is the default maximum number of fuzzy expansions. Setting fuzzy numresults limits the fuzzy expansion to a certain number of the best matching words. You can set this up to 5000.

The STORAGE Class

The storage class holds storage parameters for the index tables. The Text index is made up of five oracle objects, so the single storage clause in the create index statement is insufficient to specify storage. Storage preferences are plugged into indexes using the storage keyword:

    create index     ...     parameters ('storage mystore'); 
There is only one storage object: BASIC_STORAGE, with the following attributes:

I_TABLE_CLAUSE This attribute is the storage parameters for the I table, which is the main index table. There will be at least one row per unique word in your index -- probably multiple rows per word. Each row has a BLOB which tries to stay inline, making the row a bit large. The value is simply tacked on to the end of the create table statement, so you could set it like this:
      ctx_ddl.set_attribute('mystore', 'I_TABLE_CLAUSE',                              'tablespace foo storage (initial 1K)'); 
K_TABLE_CLAUSE This attribute is the same, but for the K table, which is an IOT for rowid to docid conversion. There will be one row per row in your base table. Each row consists of a number and a rowid.
R_TABLE_CLAUSE This attribute is the same, but for the R table, which is a BLOB table for docid to rowid conversion. There will be only a few rows in this table, and will have a large BLOB in each row.
N_TABLE_CLAUSE This attribute is the same, but for the N table, which has invalid docids waiting for garbage collection. The number of rows in this table will depend on the frequencies of update and delete DML, and of garbage collection optimization. Each row is short, holding two numbers.
I_INDEX_CLAUSE This attribute is the same, but for the unique index on the I table.

Indexing Memory

Those of you familiar with previous versions may be wondering where the index memory setting is. For new users, interMedia builds an inverted index, so documents are not added to the index one at a time, like a b-tree index. Instead, information from multiple documents is buffered up and periodically flushed to the index during indexing. In previous versions, you could set the size of this buffer in the analogue of the storage class. Setting this buffer higher reduces index fragmentation but uses more memory.

In Oracle8i, index memory can be set directly in the parameters string. This allows you to set index memory on a per-operation basis. The keyword used is MEMORY:

    create index     ...     parameters ('memory 2M'); 

Here a paltry 2 megabytes of memory is used for indexing. If not specified, the system parameter DEFAULT_INDEX_MEMORY is consulted for the systemwide default index memory. At install time, this is 12M.

The amount of memory cannot exceed system parameter MAX_INDEX_MEMORY. This allows the system administrator to disallow outrageous amounts of index memory.

The MEMORY keyword is also allowed in the alter index parameter string for batch DML (background DML uses DEFAULT_INDEX_MEMORY). This allows you to have different memory sizes for create index and DML, something which is not possible in previous versions. You simply specify a different memory amount in the create index parameter string then in the alter index parameter string.

DML

DML Processing

Most text search engines have asynchronous document updates and additions -- changes to the index are usually done in batches, after and separately from the document changes. This is due to three aspects of inverted indexing:

  1. Text indexing a single document is a lot of work. The values are usually long, and at the very least you have to break it into words. Depending on the objects, you may also have to do web page requests, filtering, and HTML parsing.
  2. Inverted indexes, composed of lists of documents by word, are best updated in batches of documents at a time. Updating the index one document at a time leads to either word lists one document long or a lot of appending to existing lists.
  3. Most text applications are fairly static, having relatively lower DML frequency, and less need for full consistency semantics than traditional transaction processing systems.
interMedia faces the same challenges. Here is what it does for each DML on an indexed column:

INSERT The document rowid is placed into a queue, dr$pending, for later addition to the text index. Queries before this DML is processed will not find the new document contents.
UPDATE The old document contents are invalidated immediately, and the document rowid is placed into the dr$pending queue for later reindexing. Queries before this DML is processed will not find the old contents, but neither will it find the new contents.
DELETE The old document contents are invalidated immediately. No further DML processing is required -- queries will no longer find the document.

Thus, interMedia has synchronous invalidation but asynchronous addition. This extends to transactions, as well:

  delete from table where contains(text, 'delword') > 0;    select * from table where contains(text, 'delword') > 0;   --> you get no rows returned    rollback;    select * from table where contains(text, 'delword') > 0;   --> you get the rows returned 
and:
  insert into table values (1, 'insword');     select * from table where contains(text, 'insword') > 0   --> you do not get the row    commit;    select * from table where contains(text, 'insword') > 0   --> you still do not get the row 
The synchronous invalidation is new to this version. Also new is that the DML notification is done through integrated kernel code, and not through triggers as in previous versions.

Processing Additions

So how do you process your inserts and updates when they are sitting in the queue? interMedia provides two methods: sync and background.

Sync is the manual way -- you control when and how it's invoked. Each invocation processes all pending inserts and updates for an index:

    alter index myindex rebuild online parameters ('sync') 
The ONLINE keyword is very important. Without this, during the sync operation queries are blocked and DML on the base table fails. Make sure to use it for sync operations.

Background DML requires you to start a ctxsrv background daemon in the operating system:

    ctxsrv -user ctxsys/<password> 
Once started, the daemon polls the dr$pending queue for DML, and automatically processes additions as they become available.

You can use either or both -- it's largely a matter of your application requirements. Background DML scans for DML constantly. This leads to new additions being indexed automatically and quickly. However, it also tends to process documents in smaller batches, which increases index fragmentation. Sync DML is done at the user's request, so usually the batches are larger and thus there's less index fragmentation. A happy compromise could possibly be reached by invoking sync through dbms_sql in a dbms_job which runs at longer intervals.

The DML Queues

The dr$pending queue holds the documents waiting to be indexed. It should be queried through the ctx_user_pending view, which makes it more readable. Additionally, there is a dr$waiting queue which is used when documents are waiting to be indexed and they are updated again. The drq_inprog, drq_batches, and drq_batchno tables from previous versions are all no longer needed.

Optimization

Index Fragmentation

The Text index is an inverted index, so essentially consists of words and lists of documents which contain that word. When new documents are added to this index, these lists have to be extended. Instead of appending to these lists, more rows for the word are created. However, this can lead to many rows with sub-optimal, short lists -- called "index fragmentation". For instance, say we have the following in our index:

    DOG   DOC 1 DOC 3 DOC 5 
Now a new document with "dog" comes along. Simply add it to the index as a new row:
    DOG   DOC 1 DOC 3 DOC 5     DOG   DOC 7 
Subsequent DML will also create new rows:
    DOG   DOC 1 DOC 3 DOC 5     DOG   DOC 7     DOG   DOC 9     DOG   DOC 11 
This is why background DML generally produces more index fragmentation than spot invocation of sync: processing DML less frequently in larger batches allows newly added rows to have longer lists, which reduces the number of rows in the index table.

Keep in mind that a certain amount of fragmentation is expected in 8i. interMedia will try to keep index table rows below 4000 bytes long, to inline the LOB and to speed up index row merges used during phrase search, AND, section search, etc. However, frequent DML will probably result in much more fragmentation than this.

Document Invalidation

During an update or delete, the old document information must be removed from the index so that queries will no longer hit it. However, because the inverted index consists of words and rows of occurrences, this would entail finding all words in the old version of the document (which may not be available), and removing an occurrence from the lists of those rows in the index table. This is too much work to do synchronously, and deletes must be synchronous in the integrated model.

Instead, interMedia marks the old document contents as invalid and does not touch the index. The mark signals queries to remove the document from any query result sets. However, this leaves the old information in the index, taking up space in the index tables.

Optimization

To solve these potential problems in the index, you run optimization. Optimization has two modes: FAST and FULL. FAST optimization targets fragmentation only:

    alter index myindex rebuild online parameters ('optimize fast'); 
The ONLINE keyword is important -- without it, queries are blocked and DML fails with an error. FAST optimization runs through the whole index table and glues fragmented rows together, to a maximum of 4000 bytes for a single row. This reduces the number of rows in the index table.

FULL optimization does both defragmentation and garbage collection, which removes the old information left over after document invalidation:

    alter index myindex rebuild online parameters ('optimize full'); 
It loops through each row of the index table, but, unlike fast optimization, cracks open each row's list, removing old document information. It also glues separate rows together where possible.

Because garbage collection is more involved and time-consuming than defragmentation, FULL optimization does not have to be run on the whole table at one time. Instead, you can run it for a limited period of time:

    ... parameters ('optimize full maxtime 5') 
which means run for a maximum of 5 minutes. At the end of 5 minutes, it will stop. The next time you run optimization, it will pick up where it left off. This lets you do a little of bit optimization each night during off times, and ensure that it is not loading the system when the off time ends.

Garbage collection is needed only after document invalidation, which happens only for delete and updates. If your document table is append-only, meaning that documents are inserted and never changed or removed, then it should suffice to run only defragmentation.

Optimization Concurrency

Unlike previous versions, optimization (either mode) does not block background DML processing. However, because Oracle prevents alter index operations from running concurrently, sync invocations will be prevented.

Querying

The Basics

At this point all aspects of building an index and keeping it up to date have been discussed. However, indexes are a means to enabling content- based queries.

IMPORTANT: Unlike a value query, which is slower without a b-tree index, contains queries are completely disallowed without a Text index.

You do a query using the contains operator:

    select id       from texttab      where contains(textcol, 'query') > 0 
The first argument to contains is the name of the text column. The second is the text query, which is limited to 2000 bytes in length. It returns a number, which indicates the strength of match. 0 means not a match, so it is constrained to > 0. The result is that this query finds the id's of all rows in texttab where the textcol has the word "query".

Scoring

Of course, now you're wondering how to use that return value in the select list and order by clause. An operator is provided just for that purpose:

    select id, score(1)       from texttab      where contains(textcol, 'query', 1) > 0      order by score(1) desc 
The "1" in the score and contains operators is called the "contains label" and it's a number which links the score in the select list to the contains in the where clause -- it can be any number, so long as it matches.

Score can be between 0 and 100, but the top-scoring document in the query will not necessarily have a score of 100 -- scoring is relative, not absolute. This means that scores are not comparable across indexes, or even across different queries on the same index. Score for each document is computed using the standard Salton formula:

    3f(1+log(N/n)) 
Where f is the frequency of the search term in the document, N is the total number of rows in the table, and n is the number of rows which contain the search term. This is converted into an integer in the range 0 - 100.

Query operators like AND and OR operate on the scores of their operands.

Where Can I Use contains?

Unlike the transparent query rewrite of previous versions, you can include a contains clause in any SQL statement using any tool which can issue oracle SQL. Here is a short list of new places where contains queries can pop up:

  • subqueries
  • virtual tables
  • PL/SQL
  • View definitions
  • DML (insert as select, e.g.)

There is no longer a limitation on the length of the contains SQL statement, although there is a limit on the query term length.

Other Query Methods

For those of you familiar with previous versions, there is no two-step or text cursors (n-memory query). If you absolutely must have result tables, use insert as select. Instead of text cursors, use real PL/SQL cursors. To sort by score, use an order by clause with a FIRST_ROWS hint. The hint will avoid an oracle SORT ORDER BY.

However, interMedia still supports count_hits. count_hits is a fast way to get the hit count for a text query, in PL/SQL:

    num_hits := ctx_query.count_hits('indexname','query',TRUE); 

The first argument is the name of the index. This should be a string, so use quotes if you are using a literal value. The second argument is the query string, and the third is a boolean toggle called "exact". If exact is TRUE, then the result is accurate. If exact is FALSE, then the result is only an upper bound. The actual count could be lower. exact FALSE is faster, because it doesn't screen out invalid documents (see issues on DML and optimization for more details).

You could also do "select count(*)" with a contains clause, and this is recommended when you have structured conditions, but count_hits is faster for text-only queries.

Simple Queries

The simplest query is a single word:

    contains(text, 'dog') > 0 
You can escape your word with curlies:
    contains(text, '{dog}') > 0 
This is useful when you want to query on a reserved word or your word has special characters which are query operators. The list of reserved words and the query operator characters are in the documentation.

You can query on a phrase just as easily:

    contains(text, 'dog my cat') > 0 
Unlike internet search engines, this searches for the phrase "dog my cat", not just any document which has these words.

If your phrase contains a stopword:

    contains(text, 'dog the cat') > 0 
then the stopword ("the" in this case) is treated as a wildcard, matching any word. This query would match all the following documents:
    dog the cat     dog my  cat     dog cat cat 
but not this document:
    dog cat frog 
In other words, the stopword must match something. Stopwords alone disappear from the query. The query:
    contains(text, 'the & cat') > 0 
is reduced to
    contains(text, 'cat') > 0 
Queries on just stopwords will return no rows.

Query Operators (Part 1)

Boolean Operators

Boolean operators are used to combine the results of subqueries using boolean algebra. interMedia has three boolean operators:

AND (&)
OR (|)
AND and OR are the basic boolean operators. However, the operands are returning numeric scores, rather than boolean values, so we've implemented AND as a minimum of its operand scores, and OR as a maximum of its operand scores. This results in expected behavior -- if any operand of AND scores 0 (the word is not in the document) -- then the AND retuns 0 (the document is not returned). You can use either word or symbol: 'dog and cat' is the same as 'dog & cat'.
NOT (~) NOT is not the unary negator. Instead, it is "AND NOT". 'dog NOT cat' returns all document which have "dog" except those which also have "cat". The score returned is the score of the left child.

Subqueries and Grouping

You can use parentheses for subqueries and precedence, as in:

    (dog my cat AND fido) OR horse 
Without parentheses, the operators follow a precedence hierarchy which can be found in the documentation. Since AND is higher precedence than OR, the parens are not actually needed in the example above. However, they are needed in this query:
    dog my cat AND (fido OR horse) 
in order to override the precedence hierarchy.

Scoring Operators

The scoring operators operate on the scores of their operands. There are four scoring operators:

WEIGHT (*) The weighting operator multiplies a search term's score to make it more or less important in the query. The multiplier can be from .1 to 10:
      contains(text, '(dog*2) AND cat') > 0 
This query looks for documents with "dog" and "cat", but the score for "dog" is multiplied by 2. This makes documents with "dog" more likely to have a higher score in the result set than those with "cat". WEIGHT does not have a word equivalent -- you must use the character *.
THRESHOLD (>) The threshold operator filters out documents below a particular score. 'dog > 10' will result only those documents containing "dog" which score higher than 10. You can also do it this way:
      contains(text, 'dog') > 10 
but threshold is implemented at a lower level, so will be faster. THRESHOLD does not have a word equivalent -- you must use >.
MINUS (-) The minus operator takes the left operand score and subtracts the right operand score. 'dog - cat' looks for documents with "dog", but those documents which also have "cat" will probably score lower than those without. This is not the same as NOT, which completely filters out the document if the right operand search term is present.
ACCUM (,) Accumulate groups several words or phrases, and scores higher when more of its operands are hit. 'dog, cat, frog' will score any document with all three words higher than any document with two of the three, which will score higher than any document with only one of the three. This is changed from previous versions, where ACCUM merely added the scores of its operands. You can either use the symbol , or the word ACCUM.

Set Operators

For those of you familiar with previous versions, the MAXDOC (:) and FIRST/NEXT (#) operators have been removed. You can use cursor fetching loops to get the same effect.

Word Expansion Operators

Word expansion operators expand a word to find similar forms of the word in the document set. For Oracle8i, the limit on maximum number of expansions has been removed. However, the more words a term expands to, the slower the query will run. There are five word expansion operators:

WILDCARD (% _) You can use the SQL like wildcards % and _ to find words matching a pattern. 'do%', for instance, finds all documents with words beginning with do, such as dog, door, etc. This is done via a like query on the index table, so a wildcard query like '%do%' will be slow, as a full table scan of the index table is required.
FUZZY (?) Fuzzy query finds words with similar form, using a proprietary algorithm. This is useful for finding mis-typed or mis-OCR'd words. The fuzzy operator is ?, as in '?dog'. You can change the rules used for fuzzy and limit the expansion using the wordlist class. FUZZY has no word equivalent -- you must use the ? operator.
STEM ($) Stem query finds words with the same stem form, using integrated Xerox linguistic code. This is useful for finding "GOING" and "WENT" from "GO", for instance. You can change the settings for stem expansion using the wordlist class preferences. STEM has no word equivalent -- you must use the $ operator, as in '$go'.
SOUNDEX (!) Soundex query finds words which sound alike. This no longer uses a separate table, but is instead a specialized fuzzy expansion. SOUNDEX has no word equivalent -- you must use the ! operator, as in '!dog'.
EQUIV (=) Equivalence is for manually inputting the various forms of a word, as in 'dog = cat', searching for documents with 'dog' or 'cat', treating them as different forms of the same word. Equiv only works on simple words -- you cannot equiv phrases. You can either use the symbol = or the word EQUIV.

Proximity Operator

Other than phrases, interMedia has one proximity operator -- NEAR, which finds documents where the input phrases are close to each other. The closer the words are to each other, the higher the score. NEAR has two forms. The shorthand form uses the ; character;

     dog ; cat ; boat 
This query finds documents which have dog, cat, and boat, and scores the document higher the closer they are to each other.

The function form has options to set range and directionality:

    NEAR((dog,boat), 10, TRUE) 
The first argument to NEAR is the list of words. In the example, we are looking for dog and boat. The second argument is the maximum allowed span. In the example, we are constraining it to within a 10-word span. That means there cannot be more than 10 words between "dog" and "boat". This range can be up to 100. It is optional -- if omitted, then it defaults to 100. The third argument is directionality. If TRUE, as in the example, then the words have to appear in the order specified in the list. If FALSE, the words can appear in any order, as long as they fit within the specified word span. This can be omitted, in which case FALSE is the default.

Query Operators (Part 2)

Within

The within operator limits a subquery to a particular section:

    dog within title 
will find all documents with "dog", but only if it occurs in the "title" section. The name of the section is used rather than the tag. The "title" section has to have defined in your section group -- you cannot specify arbitrary tags. If your section name has non-alphanumeric characters, you should enclose the name in curlies:
    dog within {my_title} 
which is why it is't recommended to use non-alphanumeric characters in section names. Within is high on the precedence hierarchy, so if you have a subquery, it is safer to use parentheses to group it:
    (dog and cat) within title 
If you don't use parentheses, it will search for documents with "dog" anywhere and with "cat" in the title section:
    dog and cat within title     ==     dog and (cat within title) 
which is a different query.

If the section is a zone or special section, the subquery is constrained to a particular instance of the section:

    (dog and cat) within sentence 
means dog and cat within the SAME sentence. This is different from:
    (dog within sentence) and (cat within sentence) 
for instance -- which means dog in any sentence, and cat in any sentence. Section types and query semantics were covered in issue #7.

Within does not nest, so this is not allowed:

 

ABOUT is an advanced query operator. We take the input to ABOUT and do our best to increase precision and recall. If your index has a theme component (INDEX_THEMES is YES for BASIC_LEXER) then ABOUT does a theme query. It determines the theme of the input term, and finds documents with the same theme. This allows a query like this:

    contains(text, 'about(canines)') 
to find relevant documents even if they don't have the word "canines" in them -- they might have "dogs" or "wolves" instead, for instance. It will also filter out documents which have the word "canines" but are not significantly about canines, such as an article on dentistry, for instance.

The input to ABOUT does not have to be a single term. It can be an unstructured phrase:

    about(japanese banking investments in indonesia) 
This phrase will most likely have multiple themes -- banking, Japan, etc. The documents retrieved by this query will have all those themes. This allows you to use plainly-worded descriptive phrases instead of rigid query syntax and still find relevant documents.

If your index does not have a theme component, then ABOUT will do a query expansion to increase recall -- it will split the input into words, stem each, then string it back together using accumulate:

    about(go home now) 
is transformed to
    $go,$home,$now 
Note that the purpose of ABOUT is to have interMedia do what it can to find the best-fitting documents. It does not have a formulaic definition like the other operators.

ABOUT can be combined with other operators:

    contains(text, 'about(dogs) and $cat') 
for instance. This allows you to do a combined theme and text search, which was not as easy to do in previous versions.

Thesaurus Operators

The thesaurus operators allow you to expand a thesaurus phrase using a thesaurus. The SYN operator, for instance, expands a word into synonyms:

    SYN(dog)   ==    {dog} | {mutt} | {canine} 
The expansion uses a previously loaded thesaurus. This is usually a thesaurus of your own creation -- interMedia does not install a thesaurus by default, although a sample thesaurus is included in ?/ctx/sample/thes. You can see the expansion by using the ctx_thes package functions:
  declare     output varchar2(80);   begin     output := ctx_thes.syn('dog');     dbms_output.put_line(output);   end; 
Thesaurus operators take simple thesaurus terms as input -- they cannot nest or take expansions as input:
    SYN($dog)         <-- BAD     SYN(BT(dog))      <-- BAD 

SQE

SQE's are Stored Query Expressions -- a macro operation. You store a query string and give it a short name, using the ctx_query PL/SQL package:

    ctx_query.store_sqe('cats', 'cat = feline = kitty'); 
The first argument is the name of the SQE -- "cats". The second argument is the stored expression. When I use it in a query:
    contains(text, 'SQE(cats)') 
interMedia expands it to the stored query expression, making this equivalent to:
    contains(text, 'cat = feline = kitty') 
SQE text can even refer to other SQE's:
    ctx_query.store_sqe('animals', 'frog | hog | sqe(cats)'); 
Although if you set up a circular reference, you will get an error at query time. There are fundamental changes to SQE in 8i. First, stored query expressions are no longer tied to an index -- you can use any SQE in any query on any index. You can even use other user's SQE's by prepending the owner name:
    SQE(bob.cats) 
Second, SQE partial results are no longer stored. Instead, SQE works as a macro operation, replacing lexically during query parsing. Finally, there is no longer the notion of session-level SQE -- store_sqe does a commit.

PL/SQL

For those of you familiar with previous versions, the PL/SQL operator (@) has been removed. This is due mainly to stricter transactional restrictions resulting from tighter integration with the kernel

Thesaurus

Overview of Thesaurus Functionality

interMedia provides a program to load formatted, file-based thesauri into ctxsys- owned tables in the database. These thesauri can then be used at query time to expand query terms for improved recall. interMedia also provides PL/SQL functions for browsing the thesaurus while it's in the database, and a program for exporting the thesaurus back out to a file.

These thesauri are usually customer-defined. interMedia provides a sample thesaurus in ?/ctx/sample/thes, but this is not imported by default.

The Thesaurus File

The first step in thesaurus import is construction of the thesaurus file. This is a list of terms in the thesaurus, each followed by their relationship to other terms. For instance:

    canine       SYN mutt       BT  mammal       NT  wolf      lizard       BT  reptile 
The first term in this thesaurus is "canine". The second line declares that "canine" has a synonym, "mutt". The following lines declare that the broader term of "canine" is "mammal", and a narrower term of "canine" is "wolf".

Terms, such as "canine" or "lizard" in our example above, must be on their own line, and must be at the start of the line -- no whitespace is allowed between the beginning of the line and the term.

Relationship lines, such as "SYN mutt", must have whitespace before them. The relationship word (SYN) is not case-sensitive. You cannot have more than one relationship word / word pair on a single line:

 

canine

Relationships cannot be on the same line as the term:

 

These rules are required for proper parsing of the thesaurus file. The more relaxed rules in previous versions were not able to correctly parse certain cases, especially when terms had common substrings with relationship words, such as "NT 4.0".

Thesaurus Relationships

The thesaurus file supports the following set of relationship words:

SYN Synonymity. A SYN B means that A and B mean the same thing. A SYN B implies B SYN A. A SYN B and B SYN C implies A SYN C.
UF Use For. Same as SYN.
PT Preferred term. A PT B means that B is the preferred term for A. There can be only one preferred term for any thesaurus term. If multiple PT lines are seen, latter PT lines override previous lines.
USE Same as PT.
SEE Same as PT.
BT Broader Term. A BT B means that B is a broader term of A. A BT B implies B NT A (A is a narrower term of B). Terms can have multiple broader terms. Circular references should not be introduced.

You can specify whole hierarchies under a single term using BTn:

        dog           BT1 canine             BT2 mammal               BT3 vertebrate                 BT4 animal 
This isn't declaring multiple broader terms of dog, but instead the whole hierarchy above dog -- canine is the first-level BT of dog, then mammal is the second-level BT -- meaning that mammal is the BT of canine. BTn goes up to BT16.
BTP Broader Term Partative. A BTP B means that B is a broader term of A in the sense that A is a part of B. Hard drive BTP Computer, for instance -- Computer is a broader term of hard drive in the sense that a hard drive is a part of a computer (NCA notwithstanding). A BTP B implies B NTP A, but does not imply A BT B -- the partative hierarchy is separate from the normal hierarchy. This relationship is sometimes called "meronymity", although not by anyone with a life.
BTG Broader Term Generic. A BTG B means that B is a broader term of A in the sense that B is a generic name for A. Rats BTG Rodents, for instance. A BTG B implies B NTG A, but does not imply A BT B. This relationship is sometimes called "holonymity".
BTI Broader Term Instance. A BTI B means that B is a broader term of A in the sense that A is an instance of B. Cinderella BTI Fairy Tale, for instance. A BTI B implies B NTI A, but does not imply A BT B. I don't know any fancy name for this relationship.
NT Narrower Term. The opposite of BT. You can use NTn, just like BTn.
NTP Narrower Term Partative. The opposite of BTP.
NTG Narrower Term Generic. The opposite of BTG.
NTI Narrower Term Instance. The opposite of BTI.
RT Related Term. A RT B means that A and B are associated, but are not synonyms and cannot be arranged into a hierarchy. A RT B implies B RT A. A RT B and B RT C does not imply A RT C.
<lang>: Foreign language translation. <lang> can be any label you wish to use for the language, 10 characters or less. This applies to the immediately preceding term, even if it is a relationship word:
        cat           french: chat           RT hat             french: chapeau 
Here "chapeau" is the french translation for "hat", not "cat".
SN Scope note. You can attach a little note to a term as a comment. This can be up to 2000 characters. If you need multiple lines, simply repeat the SN keyword -- the text is concatenated:
        cat           SN I had a cat once.  He was brown and            SN grey and well-behaved. 

Our thesaurus functionality also supports homographic disambiguation using parenthesis:

    mercury (god)       BT Greek mythology        mercury (element)       SYN quicksilver 
as well as compound terms using the + sign:
    thermometer       SYN temperature + instrument 

Importing a Thesaurus File

Once you have the file, you need to import it into the interMedia data dictionary. From the OS command line:

    ctxload -user ctxsys/ctxsys -thes -name mythes -file mythes.txt 
The login user can be any CTXAPP user -- it does not have to be ctxsys. -thes is the toggle for thesaurus import. mythes is the name for the new thesaurus, and mythes.txt is the file to import.

If a thesaurus of this name already exists, you will get an error and you must drop the existing thesaurus before importing this one. Thesaurus names are database-global.

Optionally, you can specify the thescase argument to make the thesaurus terms case-sensitive:

    ctxload ... -thescase Y 
The default is N, meaning that the thesaurus is not case-sensitive. Mercury (the god) and mercury (the element) are the same term if found in the thesaurus file.

Thesaurus Query Operators

Once the thesaurus is loaded, any user can use it in any query on any index using the thesaurus operators. These mimic the relationship names for the most part. For instance:

    contains(text, 'SYN(dog, mythes)') > 0 
will search for documents which contain any of the synonyms of "dog" defined in the "mythes" thesaurus. The actual expansion uses OR like this:
    {canine}|{doggie}|{mutt}|{dog} 
Homographic disambiguators are not included in the expansion:
    mercury (element)       SYN quicksilver      SYN(quicksilver) === {quicksilver}|{mercury} 
Compound phrases are treated as a conjunctive:
    thermometer       SYN temperature + instrument      SYN(thermometer) == {thermometer}|({temperature}&{instrument}) 
Thesaurus operators take simple thesaurus terms as input -- they cannot nest or take expansions as input:
    SYN($dog)         <-- BAD     SYN(BT(dog))      <-- BAD 
Here are the specific thesaurus functions:

SYN( term [, thesname] ) The input term is always part of the output; if no synonyms are found then term is the expansion. thesname is the name of the thesaurus to use. If omitted, thesaurus DEFAULT is consulted. There is no DEFAULT thesaurus installed automatically -- it is up to the user to load a thesaurus named DEFAULT.
PT( term [, thesname] ) Preferred term. Only the preferred term is returned in the expansion.
BT( term [, level [,thesname]] )
BTP( term [, level [,thesname]] )
BTG( term [, level [,thesname]] )
BTI( term [, level [,thesname]] )
NT( term [, level [,thesname]] )
NTP( term [, level [,thesname]] )
NTG( term [, level [,thesname]] )
NTI( term [, level [,thesname]] )
Broader/Narrower terms of all types. The optional second argument is a numeric scope indicating how many levels of hierarchy you want. A BT with a level of 2, for instance, would search for the term, all broader terms of the term, and all broader terms of those terms.

NOTE: the syntax demands that if you want to specify thesname then you MUST specify level too, eg
  'NT(term,99,mythes)'

TT( term [, thesname] ) Top term. Travel up the BT hierarchy to the root term. Only this root term is searched.
RT( term [,thesname] ) Related term. Search for term and all related terms.
TR( term [, lang [, thesname]] ) Foreign language translation of the term. Lang argument should match label used in the input thesaurus. If omitted all foreign language translations will be searched. Note that the foreign language translation must have been entered in the input thesaurus -- we don't do automatic translation.
TRSYN( term [, lang [, thesname]] ) Foreign language translation of term synonyms. This is functionally equivalent to taking SYN of the term, then TR of each resulting term. Like TR, if the target language is omitted, all available translations will be searched.

NOTE: the syntax demands that if you want to specify thesname then you MUST specify lang too, eg
  'TR(term,all,mythes)' and 'TRSYN(term,all,mythes)'

Browsing a Loaded Thesaurus

There are numerous functions in the ctx_thes package for browsing the thesaurus through PL/SQL. For the most part, they have the same names and arguments as the query functions, but they return the expansion text. For instance:

    declare       exp varchar2(4000);     begin       exp := ctx_thes.syn('dog','mythes');     end; 
Here the synonyms of "dog" in the "mythes" thesaurus are returned. The expansion is returned as a string, so exp will get the value:
    {canine}|{doggie}|{mutt}|{dog} 
The return can be up to 32000 characters in length.

Exporting a Thesaurus

Thesaurus export is also done through ctxload:

    ctxload -thesdump -user ctxsys/ctxsys -name mythes -file mythes.txt 
The exported version will be structurally similar to the imported version -- all the words will be there in the right places, but it may not be exactly the same -- in particular, it will probably be more verbose, specifying explicitly relationships which were implied in the input thesaurus.

Thesaurus Standards

The phrase ISO-2788 or -5964 thesaurus standard is somewhat misleading. The computing industry considers a "standard" to be a specification of behavior or interface. These standards do not specify anything. If you are looking for a thesaurus function interface, or a standard thesaurus file format, you won't find it here. Instead, these are guidelines for thesaurus compilers -- compiler being an actual human, not a program. They give general advice on how to build and maintain a thesaurus, such as "Adverbs should not be used alone as indexing terms".

What Oracle has done is taken the ideas in these guidelines and in ANSI Z39.19 -- an American version of ISO-2788 -- and used them as the basis for a specification of our own creation.Therefore interMedia can handle all the inter-word relationships mentioned in Z39.19, and even use the same abbreviations, but the interface is our own. the interMedia file format looks a bit like "Figure 1" of ISO-2788, but Oracle imposed its own rules to make it parseable. interMedia can implement foreign language lookups like ISO-5964, but the input thesaurus looks nothing like their examples. So, Oracle supports ISO-2788 relationships or ISO-2788 compliant thesauri.

Query Tuning

Analyzing Text Queries

In Oracle8i, tuning Text queries is not very different from tuning regular SQL queries. This is demonstrated with an extended example, using this table of magazine articles:

    create table articles (       id         number        primary key,    -- article id       issue      number,                       -- issue number       title      varchar2(80),                 -- article title       published  date,                         -- publication date       text       clob                          -- article text     ); 
Assume issue has a non-unique b-tree index called art_issx, and text has a Text index called art_textx.

The first step in query tuning is analysis of the query plan. This is done using explain plan. Note that the SQL explain plan command is different from ctx_query.explain. explain plan works on a SQL query, and prints the SQL query plan executed by oracle. ctx_query.explain works on the input to the contains function, and prints out the Text query plan executed by interMedia. ctx_query.

The Oracle Server Tuning manual details how to set up and run SQL explain plan, so check there if you do not know how this is done. You don't need to do anything special for interMedia queries -- thanks to tighter integration with the kernel, the optimizer now understands what the contains clause means and can generate plans for Text queries -- it's all transparent to the end user. Here is a simple query:

    select title from articles                              -- query 1      where contains(text, 'iced lolly') > 0 
This will normally generate a query plan like this:
    1.1 TABLE ACCESS BY INDEX ROWID ARTICLES       2.1 DOMAIN INDEX  ART_TEXTX 
The DOMAIN INDEX part is our equivalent of an INDEX RANGE SCAN -- it means that the index (the Text index, in this case) is producing a stream of rowids for the kernel. This is called rowsource invocation, and, like an index range scan, is a fairly efficient way of finding matching rows.

Score

add score to the select list:

    select score(1) score, title from articles              -- query 2      where contains(text, 'iced lolly', 1) > 0 
This generates the same query plan as above -- score does not appear in the query plan. However, score does make query 2 execute a bit more slowly than query 1. The SCORE function is called "ancillary data", and it is delivered separately from the rowid stream. The kernel will ask for some rowids, then, in a separate call, will give a rowid and ask for that row's score. This second call imposes a bit of overhead, which is why query 2 is slower than query 1.

But how about a query like this:

    select score(1) score, title from articles              -- query 3      where contains(text, 'iced lolly', 1) > 0      order by published desc      1.1 SORT ORDER BY       2.1 TABLE ACCESS BY INDEX ROWID ARTICLES         3.1 DOMAIN INDEX  ART_TEXTX 
The kernel classifies the SORT ORDER BY as a "blocking operation". The presence of a blocking operation means that Oracle8i cannot guarantee that it will ask for scores in the same order as the produced rowid stream. It could produce rowid ABC first, but get asked for its score last. This means that the scores for all rows must be kept in memory, called "batch processing". If the hitlist is large, this could use disk, and be slower than query 2 score resolution. Of course, the order by would normally make query 3 slower anyway.

The presence of any blocking operation switches interMedia to batch processing, even something like this:

    select score(1) score, title from articles              -- query 4      where contains(text, 'iced lolly', 1) > 0      order by score(1) desc 
could use incremental processing, but still uses batch processing.

Now, query 4 is important in information retrieval, because it is very common to show only the first few top-scoring hits in the hitlist. Web search engines work this way, for instance. This case puts a premium on response time -- the time to get the first page is critical, the time to get the second page is not as important. Unfortunately, in the database world, the order by decreases response time, because all the hits have to be produced and sorted before the first row is displayed.

Because this is a common situation, interMedia has an alternative query method which improves response time for score sorting:

    select /*+ FIRST_ROWS */ score(1) score, title          -- query 5       from articles      where contains(text, 'iced lolly', 1) > 0      order by score(1) desc 
The addition of the FIRST_ROWS hint causes the kernel to notify interMedia of the sort by score. interMedia produces the rowid stream in score-sorted order, and the kernel eliminates the SORT ORDER BY -- similar to a b-tree index range scan ordered by the key prefix:
    1.1 TABLE ACCESS BY INDEX ROWID ARTICLES       2.1 DOMAIN INDEX  ART_TEXTX 
This is called "pushing down" the sort. Also, the elimination of the SORT ORDER BY allows us to use incremental score processing instead of batch score processing.

The payoff of FIRST_ROWS is improved response time; the cost is decreased throughput. interMedia cannot produce the hits in score-sorted order naturally. Instead, it runs the Text portion internally, and remembers the top-scoring 500 or so rows. Because the set size is small and because we keep it all in memory rather than disk, Oracle8i can sort it faster than a SORT ORDER BY. However, when it runs out of hits in this batch, it has to re-execute the query to get another batch. Thus, if you're fetching all the results, do the SORT ORDER BY; if you just want the first few hits, use the FIRST_ROWS hint.

The FIRST_ROWS hint will not just affect interMedia -- the entire query will be optimized for response time. This is not usually a problem, since you want the whole query optimized for response time or throughput. However, if you want to optimize the overall query one way and interMedia another way, you can use the DOMAIN_INDEX hints. To have interMedia score-sorting, but optimize the overall query for throughput:

    select /*+ ALL_ROWS DOMAIN_INDEX_SORT */ 
Conversely, if you want the overall query optimized for response time, but you don't want interMedia score-sorting:
    select /*+ FIRST_ROWS DOMAIN_INDEX_NO_SORT */ 

Mixed Queries

So far, the discussion has looked at rowsource invocation for contains -- Oracle8i asks for a stream of rowids which match the contains. However, there is an alternate invocation called "function invocation". In function invocation, Oracle8i produces the rowid stream some other way -- a b-tree index, a full table scan, etc. and then asks , for each rowid in that stream, if the row matches the contains:

    select /*+ FULL(articles) */ title                      -- query 6       from articles      where contains(text, 'face flannel') > 0      1.1 TABLE ACCESS FULL ARTICLES 
Here a full table scan is forced, which makes Oracle8i go through the table row by row, calling function contains for each one. Note that the contains function invocation is not shown in the plan, just as a structured condition filter would not be shown.

For the same number of rowids, function invocation is much slower than rowsource invocation. So why have it? Consider a query like this:

    select title from articles                              -- query 7      where contains(text, 'face flannel') > 0        and issue = 7 
Recall that issue has a non-unique b-tree index. rowsource invocation could be used - get a rowid stream from the Text index, and check each rowid to see if issue = 7:
    1.1 TABLE ACCESS BY INDEX ROWID ARTICLES                -- plan 7       2.1 DOMAIN INDEX  ART_TEXTX 
On the other hand, the b-tree index could be used on issue - find all rows where issue = 7, and call function contains individually for each of those:
    1.1 TABLE ACCESS BY INDEX ROWID ARTICLES                -- plan 8       2.1 INDEX RANGE SCAN ART_ISSX NON-UNIQUE 
The query plan is driving from the index on issue, which implies function invocation of contains. Why do that? Function invocation is slower than rowsource for the same number of rowids.

If the number of articles in issue 7 is very small, and the number of articles with "face flannel" in the text is very large (they were quite the rage in the 80's) then the number of rowids is not the same. It may be faster to drive from the b-tree index and use function contains on a small number of rows than to drive from the Text index and do a TABLE ACCESS for a large number of rows.

Of course, it is usually impossible to have a priori knowledge of the selectivities, which makes it impractical to use hints to specify which index to use. Thanks to integration with the cost-based optimizer, you don't have to hard-code it -- let the CBO decide which method is more efficient.

Using the CBO

In order to choose correctly, the cost-based optimizer needs to know the selectivities of the structured and Text conditions. You provide this information by generating statistics:

    analyze table articles      compute statistics 
This will (eventually) generate stats for the table and all indexes, including the Text index. You can use estimate statistics, as well -- is makes no difference with respect to the Text index.

When interMedia is asked to compute statistics, it analyzes term frequencies in the Text index. For the most common terms, we record the number of documents which contain that term. It also calculates the average number of documents per term to use for all other terms.

Given a Text query, and using these stats, it can compute an estimate of the query selectivity. Because it doesn't record exact term frequency for every term, and because a Text query can combine terms using AND, phrase, WITHIN, etc. this selectivity is only a rough estimate, not an exact number.

The CBO, confronted with a query like query 7, can then use the b-tree stats and the Text index stats and -- without hints -- can choose the best index to use to drive the query.

If your b-tree data is not uniformly distributed, then histograms for the data can help:

    analyze table articles      compute statistics for column issue size 127 
See Oracle SQL Reference and Server Tuning manuals for more information. Nothing special is done for histograms -- it simply provides more accurate stats to the CBO. Like any SQL query, you can use hints to override the CBO when it's wrong.

Rowid Bitmap Merges

The CBO can choose a third option for query 7 -- use both indexes, and perform a rowid bitmap merge of the two results. You can force this using the INDEX_COMBINE hint:

    select /*+ INDEX_COMBINE(articles art_issx art_textx) */ -- query 9            title       from articles      where contains(text, 'face flannel') > 0        and issue = 7 
which produces the following plan:
    1.1 TABLE ACCESS BY INDEX ROWID ARTICLES                 -- plan 9       2.1 BITMAP CONVERSION TO ROWIDS         3.1 BITMAP AND           4.1 BITMAP CONVERSION FROM ROWIDS              5.1 SORT ORDER BY               6.1 DOMAIN INDEX  ART_TEXTX           4.2 BITMAP CONVERSION FROM ROWIDS             5.1 INDEX RANGE SCAN ART_TEXTX NON-UNIQUE 
This is the best performer when the result lists for the structured and contains are roughly equal, somewhat large, and the intersection is small. In comparison, plan 8 would be hobbled by function contains overhead, and plan 7 would require unnecessary TABLE ACCESSes to check the structured condition. A similar plan is used when bitmap indexes are employed.

Bitmap merge, like SORT ORDER BY, is a blocking operation, which, if you are selecting SCORE, will force batch processing instead of incremental processing.

Query Feedback

Text Query Tweaking

Text queries rarely give you exactly what you want. Consider web search engines like AltaVista. Sometimes there's too many documents. Sometimes you wonder why a certain document turned up in the hitlist. Sometimes the document you wanted to find isn't in the results at all.

Similar results could ocur with interMedia, so it provides two query feedback methods which allow you to refine your Text query.

Explain

Explain is the interMedia analogue of explain plan, and let's you see exactly what your query looks like. This lets you trim unintended wildcard expansions, for instance, or add parentheses to correct precedence errors.

You need a result table, just like explain plan, with the following columns, in the following order:

    create table xres (       explain_id      varchar2(30),       id              number,       parent_id       number,       operation       varchar2(30),       options         varchar2(30),       object_name     varchar2(64),       position        number     ) 
explain_id A label which distinguishes one query plan from another.
id A step sequence number. This is an ascending sequence number assigned to each step which ignores hierarchy. The root step of each plan has id 1, the second step has id 2, etc.
parent_id The id of the parent of this query plan step. The root step has parent_id = 0.
operation The type of operation for this step. This can include WORD, AND, WITHIN, etc.
options The expansion operators (fuzzy, wildcard, etc.) are transformed into equivalence operators. The options column tells you what the original expansion function was. options is also used to show arguments to NEAR.
object_name The operand of the query step. For WORD operations, this is the actual word being searched. It also holds section names for SECTION operations, weight values, etc.
position A sequence number to sequence children of the same parent. The first child step will have position 1, the second will have position 2, etc.

You then run explain for your query on your index:

    ctx_query.explain('myindex', 'dog and ($run spot)', 'xres'); 
The first argument is the name of the index. The second argument is the query string, and the third is the name of the result table. The query is explained and the results inserted into the plan table. These results are not committed, which mimics the behavior of SQL explain plan. You can then print it out in explain plan style using this SQL:
     select lpad(' ',2*(level-1))||level||'.'||position||' '||             operation||' '||             decode(options, null, null, options || ' ') ||             object_name plan        from xres       start with id = 1      connect by prior id = parent_id; 
Which produces the following plan:
    1.1 AND       2.1 WORD DOG       2.2 PHRASE         3.1 EQUIVALENCE ($) RUN           4.1 WORD RAN           4.2 WORD RUN           4.3 WORD RUNNING           4.4 WORD RUNS         3.2 WORD SPOT 
The optional fourth and fifth arguments to explain, sharelevel and explain_id, allow multiple explains to use the same result table. When sharelevel is 0, the default, then the result table is automatically truncated before the explain is run. When sharelevel is 1, then any rows in the result table with the same explain_id are deleted before the explain is run, but other rows are not touched. Thus, multiple explains can use the same result table as long as they use distinct explain_id's.

When using shared result tables, the start with and connect by clauses of the example sql have to be modified to limit them to a specific explain_id:

      start with id = 1 and explain_id = 'YOURID'     connect by prior id = parent_id and explain_id = 'YOURID' 

Hierarchical Query Feedback

Explain lets you see exactly how your query is run, which allows you to adjust your query to refine the results. Hierarchical query feedback goes one step further and actually suggests other search terms you might want to try.

You start by creating a result table. This result table looks a bit like the explain result table, but with the addition of nested tables for suggested alternative terms:

    create table fres (       feedback_id     varchar2(30),       id              number,       parent_id       number,       operation       varchar2(30),       options         varchar2(30),       object_name     varchar2(64),       position        number,       bt_feedback     ctxsys.ctx_feedback_type,       rt_feedback     ctxsys.ctx_feedback_type,       nt_feedback     ctxsys.ctx_feedback_type     )     nested table bt_feedback store as fres_bt     nested table rt_feedback store as fres_rt     nested table nt_feedback store as fres_nt 
The bt_feedback, rt_feedback, and nt_feedback store suggested broader terms, related terms, and narrower terms, respectively. Each is a nested table of ctx_feedback_type objects, which stores the suggested term in the attribute "text". The other attributes are not currently used, but are reserved for future use.

Now you run hfeedback on your query:

    ctx_query.hfeedback('myindex', 'felines', 'fres'); 
hfeedback will then draw on the linguistics knowledge base to suggest broader, narrower, and related terms. The results are inserted into the result table. Like explain and SQL explain plan, these inserts are not committed implicitly. You can select from the nested tables to get feedback:
    select text from the(select nt_feedback from fres                            where object_name = 'felines') 
    TEXT     ----      cat leopards tiger 
Be aware that when getting feedback for queries with ABOUT clauses, the input is broken into themes, so you cannot simply select back. For instance, if your query were "ABOUT(African safari)", there would be no row with object_name = 'African safari' -- more likely, there would be separate rows for "Africa" and "safaris".

The feedback terms are obtained from the built-in linguistics knowledge base. If the knowledge base has been extended, feedback can also draw on these extensions.

The results will depend on the contents of your index -- only worthwhile terms are shown. "Lion" is a narrower term of feline, but my index does not have any documents with "Lion" in them, so it does not appear in the narrower term table. Note that for performance reasons, we do not screen out invalid documents, so it is possible (but unlikely) to get feedback terms which would not produce query hits.

The feedback for non-ABOUT portions of the query can contain alternate forms of words, as well -- narrower terms for "learning" might include "education", "educating", "educational", etc.

If the index is case-sensitive, then case-sensitive feedback terms are provided. For instance, narrower terms for "databases" might include both "PL/SQL" and "Pl/sql".

When the index is case-insensitive, queries are case-insensitive -- a contains query for "oracle" will pull up the same records as "Oracle". However, feedback uses the linguistic knowledge base, which is case-sensitive. So, even if your index is case-insensitive, the feedback you get depends on the case of the query -- the feedback for "oracle" will not be the same as the feedback for "Oracle".

You can also give hfeedback multiple terms in a query string:

    ctx_query.hfeedback('myindex', 'felines and canines', 'fres'); 
hfeedback will generate query plan steps as in explain, but phrases are not broken into individual words and expansion operators such as fuzzy, wildcard, soundex and stem are ignored. Each word or phrase step will have bt, nt, and rt tables. Operator steps like AND will not have anything in these tables.

Like explain, hfeedback has sharelevel and feedback_id arguments for result table sharing.

Highlighting

Document Services

Highlighting and linguistic extraction have been combined into a unified document services package called ctx_doc. A document service is any ad hoc operation on the document text.

Filter

The filter command fetches the document contents and filters it to readable text. This allows you to see what the indexing engines sees. You first make a result table:

  create table fres ( query_id number, document clob ) 
Now request a filter:
  ctx_doc.filter('myindex', '123', 'fres');  
The first argument is the name of the index -- you can use USER.INDEX, but the issuing user must have SELECT access to the base table. The second argument is the primary key value of the document you wish to filter. If you have a composite primary key, use commas to separate the key fields:
  ctx_doc.filter('myindex', 'ABC,123', 'fres');  
Alternatively, you can use the function ctx_doc.pkencode:
  ctx_doc.filter('myindex', ctx_doc.pkencode('ABC','123'), 'fres');  
The third argument to filter is the name of the result table. interMedia will translate the primary key to rowid, fetch the document contents by rowid, and filter to readable text. It will insert a new row into the result table and the readable text output will be placed in the document clob column.

Query_ID

There is an optional fourth argument -- query_id -- which can be any integer. This will be the value of the query_id column for the row in the result table. If you are sharing result tables across several filter requests, you can have each request use a different query_id, and distinguish the outputs by this column. filter does not pre-truncate or pre-delete rows in the result table, regardless of the setting of query_id. The default value of query_id is 0.

Plaintext

Recall part of the indexing pipeline for a moment:

     +-----------+           +----------+ filtered   +-----------+ plain       |           | doc data  |          | doc text   |           | text      | datastore |---------->|  filter  |----------->| sectioner |----->      |           |           |          |            |           |      +-----------+           +----------+            +-----------+ 
There are two "readable text" outputs in this segment of the indexing chain -- the filtered text, output from the filter, and the plain text, output from the sectioner. When using the INSO filter, for instance, the filtered text is HTML and might have text like this:
      <BODY>         <P>This is the first             <B>sentence.</B>       </BODY> 
The HTML section group will transform this into plain text, removing tags and reformatting whitespace, much like a browser might save to text file:
    This is the first sentence. 
By default, filter will provide the filtered text. An optional boolean parameter "plaintext" (the fifth argument) allows you to request the output from the sectioner instead:
    ctx_doc.filter('myindex', '123', 'fres', plaintext=>TRUE); 
The plaintext output can be used for human readers, while the filtered text output can be used if you are sending the results to a browser. Note that if you are using the null section group, the plain text and filter text are the same, so the plaintext argument would have no effect.

The default value of plaintext is FALSE, meaning filter text is produced.

Highlight

The highlight service takes a query string, fetches the document contents, and shows you which words in the document cause it to match the query. The result table for highlight looks like this:

    create table hres (       query_id   number,       offset     number,       length     number     ) 
Now request the highlight:
    ctx_doc.highlight('myindex', '123', 'dog | cat', 'hres'); 
Like filter, the first argument is the name of the index and the second is the document primary key. The third argument is the text query, and the fourth is the name of the result table.

interMedia will parse the query and determine which words in the document match the query. It then creates rows in the result table, showing the character offset and length of each matching word. Let's say that our document looks like this for the highlight call above:

    I have a black cat and a white dog. 
"I" is character 1, per oracle convention, so "cat" is at character 16 and has length 3. "dog" is at character 32 and has length 3. If you are using a multi-byte character set, remember that these are character offsets and lengths, not byte offsets and lengths. The hres table will look like this:
      QUERY_ID     OFFSET     LENGTH     ---------- ---------- ----------              0         16          3              0         32          3 
A single row in the result table can cover multiple words in the source document -- the example simply uses single words.

Highlight has query_id as the fourth argument, working just like it does in the filter service. Offset and length in the result table can apply to either the plain text or filter text version of the document. This is controlled by the plaintext argument in the fifth position.

You can use the filter service output in conjunction with the highlight output to display the matching words:

    select dbms_lob.substr(fres.document, hres.length, hres.offset)       from fres, hres 
If your documents are in a varchar or clob you can do something similar with the base table instead of the filter output.

Markup

Markup takes the highlight service one step further, and produces a text version of the document with the matching words marked up. The markup result table is the same format as the filter result table:

    create table mres (       query_id   number,       document   clob     ) 
Now call markup:
    ctx_doc.markup('myindex', '123', 'cat | dog', 'mres'); 
And the result is:
    I have a black <<<cat>>> and a white <<<dog>>>. 
Like the other highlight services, markup has a query_id argument for result table sharing, and a plaintext argument to specify markup of filter text or plain text.

You can also override the text used to markup the matching words. This is broken into start and end tags. Start tag is the text which goes before the matching word, end tag is the text which goes after the matching word. By default, start tag is <<< and end tag is >>>. You can override this using the starttag and endtag arguments:

    ctx_doc.markup('myindex', '123', 'cat | dog', 'mres',                     starttag => '-->', endtag => '<--');      I have a black -->cat<-- and a white -->dog<--. 
The length limit on tags is 30 characters.

Highlight Navigation

Highlight navigation is a new feature for Oracle8i, and allows you to markup your text so that an interpreting program can navigate between highlights. Let's use as an example HTML output being sent to a browser. The start and end tags could be overridden to display highlighted words in bold:

    ctx_doc.markup('myindex', '123', 'cat | dog', 'mres',                     starttag => '<b>', endtag => '</b>');      <b>cat</b> ... <b>dog</b> 
Now you might want to have links next to each highlighted word which take you to the previous or next highlight. You would first have to anchor and name each highlight using the A tag:
    <A HREF=h1><b>cat</b></a> ... <A HREF=h2><b>dog</b></a> 
Note that the start tags for these two words are different -- cat has h1 and dog has h2. You can do this using markup macros. Each highlighted word is numbered in document occurrence order. If %CURNUM is in a tag, it is replaced with this number. This lets you do the anchoring above using start and end tag definitions like so:
    ctx_doc.markup('myindex', '123', 'cat | dog', 'mres',                     starttag => '<a name=h%CURNUM><b>',                      endtag => '</b></a>'); 
But clickable links for highlight navigation have not been achieved. A links would need to added before and after each highlight, like this:
    <a href=h1><b>cat</b></a><a href=#h2>></a> ...      <a href=#h1><</a><a href=h2><b>dog</b></a> 
The first problem is that start and end tag cannot be used for this. For highlight navigation, the first highlight will not have a prev link, and the last highlight will not have a next link. Start and end tags are added for every highlight, so those can't be used.

Luckily, markup has prevtag and nexttag specifically for this. Prevtag is added before starttag for every highlight except the first one. Nexttag is added after endtag for every highlight except the last one.

The second problem is that it is necessary to reference the next highlight number in nexttag, and the previous highlight number in prevtag. For this, two other markup macros, %PREVNUM and %NEXTNUM are used

Putting it all together, highlight navigation is obtained like this:

    ctx_doc.markup('myindex', '123', 'cat | dog', 'mres',                     starttag => '<a name=h%CURNUM><b>',                      endtag => '</b></a>',                     prevtag => '<a href=#h%PREVNUM><</a>',                     nexttag => '<a href=#h%NEXTNUM>></a>'); 
This gives highlighted words in bold, with < and > links next to each one to take the reader to the previous or next highlighted word.

Tagset

The call is complex so three sets of common definitions for starttag, endtag, prevtag, and nexttag have been defined. You can use these sets by using the tagset argument of markup:

    ctx_doc.markup('myindex', '123', 'cat | dog', 'mres',                    tagset => 'HTML_NAVIGATE') 
This will give you similar functionality to the call above. Here are the defined tagsets:
    TAGSET              -------------     TEXT_DEFAULT         starttag = <<<       endtag = >>>      HTML_DEFAULT       starttag = <b>       endtag = </b>      HTML_NAVIGATE         starttag = <A NAME=ctx%CURNUM><b>       endtag = </B></A>           prevtag = <A HREF=#ctx%PREVNUM><</A>       nexttag = <A HREF=#ctx%NEXTNUM>></A> 
You cannot define your own tagsets in this version. The tagset default is TEXT_DEFAULT. Any specified values for starttag, endtag, prevtag, and nexttag override the values for the tagset, so something like this:
    ctx_doc.markup('myindex', '123', 'cat | dog', 'mres',                    tagset => 'TEXT_DEFAULT',                    endtag => '<--') 
would highlight like this:
    <<<dog<-- 
That is, the endtag has been overridden, but the starttag definition from the tagset is still valid.

Linguistic Extraction

Linguistic Extraction

With the INDEX_THEMES attribute of the basic lexer, and the ABOUT clause, you can easily enable thematic indexing and searching for your document table. The linguistic extraction routines in the document services enable you to generate document themes or theme summaries, on-demand and per-document.

The big change from previous versions is the elimination of the serivces queue. In ConText, linguistics was asynchronous -- you submitted a linguistic request, then eventually the answer would appear in the result table. Vast improvements in linguistic engine performance have allowed us to make these requests synchronous in Oracle8i. Now you invoke the linguistic function, and when the function returns, you know the data is in the result table.

Linguistic extraction works only on English documents.

Themes

Themes allow you to generate the themes of a document. First you need a result table like this:

    create table tres (       query_id   number,       theme      varchar2(80),       weight     number     ) 
Then invoke the themes request:
    ctx_doc.themes('myindex', '123', 'tres'); 
Like the highlighting calls, the first argument is the name of the index, the second is the document primary key, and the third is the name of the result table.

interMedia will fetch the document text, filter it, run it through the sectioner, then parse it for thematic content. It will generate up to 50 themes, and place the results in the result table, in the theme column, one row per theme. The weight column is a number showing relative importance of the theme. It is only comparable to other themes of the same document. You cannot compare theme weights from different documents.

An optional fourth query_id argument lets you share result tables. Like highlighting, themes will not pre-truncate or pre-delete the result table.

The optional fifth argument, full_themes, allows you to generate themes with their full knowledge catalog hierarchy. The call above might generate a theme of "cats", for instance. Setting full_themes on:

    ctx_doc.themes('myindex', '123', 'tres', full_themes => TRUE); 
would generate
    :science and technology:hard sciences:life sciences:biology:zoology:     vertebrates:mammals:carnivores:felines:cats: 
instead. If you are using full themes, the length of the theme column in your result table should be increased to 2000 or so.

Gist

Gist allows you to generate a theme summary of the document text. This does not generate an original abstract like a human would. Instead, it is composed of selected paragraphs or sentences which contribute most to the themes of the document. The gist result table looks like this:

    create table gres (       query_id  number,        pov       varchar2(80),       gist      clob     ); 
and the gist function looks like this:
    ctx_doc.gist('myindex', '123', 'gres'); 

interMedia will fetch the document text, filter it, run it through the sectioner, then parse it for thematic content. It will generate the themes, then generate gists for those themes, one row per gist, with the gist column consisting of those paragraphs which contributed to the theme. The theme is contained in the pov (point-of-view) column. Additionally, it might generate a GENERIC gist, which consists of those paragraphs which best represent the overall themes of the document.

The optional fourth argument to gist is query_id, which allows you to share result tables, like all other calls in document services.

The optional fifth argument, glevel, is a flag which allows you to specify the gist level. If glevel is 'P', the default, the gists are composed of whole paragraphs from the document text. This is called the paragraph-level gist. If glevel is set to 'S', the gists are composed of individual sentences from the document text. This is called the sentence-level gist.

The optional sixth argument, pov, allows you constrain output to just one point-of-view. Usually, this is done to generate only the GENERIC gist. This is case-sensitive, so specify GENERIC in uppercase. This is not a way to force the gist to a particular point-of-view; you cannot specify pov's here which would not normally get produced. For instance, if you have a document in which "dogs" and "cats" are pov's, and you specify "frogs", you will get no output.

The optional seventh and eighth arguments, numparagraphs and maxpercent, allow you to constrain the size of the generated gist. numparagraphs specifies the maximum number of paragraphs (or sentences, if you are doing a sentence-level gist) which can make up each generated gist. maxpercent is similar, but limits it to a percentage of the source document. If both are specified, the smaller is the effective limit.

The Extensible Knowledge Base

All theme based features in interMedia -- themes, ABOUT queries, gists, ABOUT query highlighting, and hierarchical query feedback -- depend on an internal knowledge base (KB) that is supplied with interMedia. The knowledge base consists of a lexicon of English words and phrases organized in a hierarchy of categories. For the first time in Oracle8i, you can extend and customize this knowledge base by adding new terms or redefining existing ones.

Consider a simple example in the domain of coffee futures. Documents in this domain may mention the term ICO (International Coffee Organization). interMedia can identify the theme "ICO" in such documents, but it does not exist in the supplied KB -- it is known that the documents are talking about the ICO, but not what ICO is. ICO is not classified in the concept hierarchy, and is not associated with terms such as "coffee" or "International Coffee Organization". ICO would look like this in a full-themes listing:

   :ICO: 
if it showed up at all. Because it is an isolated term with no parents, it won't be judged a strong theme.

Loading a Custom Thesaurus

What is needed is to do is tell interMedia what ICO is all about. The first step is to create a simple thesaurus for the term:

international trading organizations       BT international trade and finance 
International Coffee Organization       BT international trading organizations      RT coffee 
ICO      USE International Coffee Organization 
Now load the thesaurus using ctxload, just like any other thesaurus. Make sure the thesaurus is loaded case-sensitive, since case matters for theme generation:
 ctxload -user ctxsys/ctxsys -thes -thescase Y -name COFFEE -file  
and run the KB compiler, ctxkbtc (a new executable in 8i):
 ctxkbtc -user ctxsys/ctxsys -name COFFEE 
This will generate a new, extended KB. ctxkbtc would interpret the thesaurus above as the following commands:
  add "international trading organizations" to the KB under        "international trade and finance"    add "International Coffee Organization" under        "international trading organizations"    make "coffee" a related term to "International Coffee Organization"    add "ICO" as an alternate form for "International Coffee Organization"  
Note that "international trade and finance" is an existing category in the supplied KB. All supplied categories are listed in the documentation. It is recommended that new terms be linked to one of these categories for best results in theme extraction. If new terms are kept completely disjoint from existing categories, they might not be strong enough to qualify as themes.

There can be only one user extension per installation -- each run of ctxkbtc affects all users. However, you can specify multiple thesauri (up to 16) in the -name argument of ctxkbtc to add them all to the extended KB:

 ctxkbtc -user ctxsys/ctxsys -name COFFEE TEA MILK 
Because ctxkbtc affects the entire installation, we recommend that only interMedia administrators should extend the knowledge base.

Using the Extended KB

Nothing special needs to be done to use the new KB, other than to close and re-establish open sessions -- iMT will use the new KB automatically. If you generate themes now, this time you will see:

    :business and economics:commerce and trade:international trade and      finance:international trading organizations:International Coffee      Organization: 
Note that ICO has been normalized to "International Coffee Organization", it now has parents, and its weight is higher because it is a known term.

There are some size limits and other restrictions on thesaurus terms for extending the KB. See the Oracle8i interMedia documentation for details.

Chapter 2 : Oracle 8.1.6

Oracle8i interMedia Text 8.1.6 - Technical Overview

Garrett Kaminaga, Product Development, Database and Application Server Technologies, Oracle HQ

This is a technical overview of the improvements found in interMedia Text, version 8.1.6. This is intended for an audience familiar with version 8.1.5. If you are new to interMedia Text, please start with Oracle8i interMedia Text 8.1.5 - Technical Overview.  Please refer to the documentation for detailed information.

 

Table of Contents

  1. Overview of 8.1.6 Improvements
  2. Upgrading 8.1.5 to 8.1.6
  3. Datastore Improvements
  4. Filter Improvements
  5. Section Group Improvements
  6. Lexer Improvements
  7. Wordlist Improvements
  8. Storage Improvements
  9. Index Creation and Maintenance
  10. Query Language Improvements
  11. Document Services Improvements
  12. Thesaurus Improvements



Overview of 8.1.6 Improvements

While 8.1.6 has features which improve index performance and ease-of-use, most of the new features are targeted in two key areas: structured documents and multi-lingual databases.

The improvements in structured document support include indexing and searching attribute text, nested within for sophisticated queries, doctype-limited tag detection, dynamic add section for existing indexes, and a new AUTO sectioner which requires no section pre-definition.

Multi-lingual databases can now store documents of different languages in a single column, thanks to the new MULTI lexer. Furthermore, improvements in INSO and charset filters allow documents of different formats and character sets in a single column. UNICODE support has been added to the Japanese and Korean lexers.

Other interesting features include limited parallel indexing, more flexible stem and fuzzy operators, and in-memory document services.



Upgrading 8.1.5 to 8.1.6

Automated Upgrade

For an automated upgrade from 8.1.5 to 8.1.6, use Oracle's Database Migration Assistant tool (ODMA) -- this will run all needed scripts for interMedia.

There's no easy way to verify that the upgrade completed successfully. However, you can, after the upgrade, issue:

  select * from ctx_version

as ctxsys. If this reports 8.1.6, the upgrade script has been run. It's probably wise to check for any invalid ctxsys objects anyway.

 

Manual Upgrade

In the event that the automatic upgrade fails, the manual process is below. This process assumes that you are starting with a valid 8.1.5 ctxsys data dictionary, and that the system is not being used for index, DML, or query.

1. run ?/ctx/admin/upgrade/s0801060.sql as SYS
this grants new, needed database privileges to ctxsys

2. run ?/ctx/admin/upgrade/u0801060.sql as CTXSYS
this brings your ctxsys data dictionary to 8.1.6

3. run ?/ctx/admin/dr0pkh.sql as CTXSYS
this re-creates all public and private package headers

4. run ?/ctx/admin/dr0plb.sql as CTXSYS
this re-creates all public and private package bodies

5. run ?/ctx/admin/dr0type.plb as CTXSYS
this re-creates the index type body

6. check for any invalid ctxsys objects

 

Upgrading Existing Indexes

Nothing needs to be done to upgrade existing 8.1.5 context indexes. These will still work without any problems in 8.1.6.



Datastore Improvements

USER_DATASTORE

The user datastore in 8.1.5 is limited to CLOB output for the generated documents. A new attribute, OUTPUT_TYPE, allows you to change the output type to BLOB or VARCHAR2. For instance, a preference like this:

  begin
     ctx_ddl.create_preference('mystore','user_datastore');
     ctx_ddl.set_attribute('mystore','procedure','foo');
  end;

mandates a foo procedure with the following signature:

  procedure foo (in rowid, in out clob)

If instead we add:

  ctx_ddl.set_attribute('mystore','output_type','varchar2');

then interMedia Text will expect the foo procedure to have the signature:

  procedure foo (in rowid, in out varchar2)

If the synthesized document is expected to be less than 32000 bytes, then a varchar2 output function is probably easier to write, easier to test, and faster to execute than the CLOB version.

You can also set the output_type to BLOB, for formatted documents or documents in a different character set. Note, however, than it is still not possible to use or call an external procedure in the user datastore.

If output_type is not specified, CLOB is assumed, for backward compatibility.

 

NESTED_DATASTORE

The new NESTED_DATASTORE object allows indexing the rows of a nested table as parts of a single document -- it is very much like the detail datastore, but here the detail table is a nested table.

For example, consider a table mytab, where the document text is stored in a nested table, one line per row of the nested table:

  create type nt_rec as object (
    lno  number,        -- line number
    ltxt varchar2(80)   -- text of line
  );

  create type nt_tab as table of nt_rec;

  create table mytab (
    id    number   primary key,
    dummy char(1),
    doc   nt_tab
  )
  nested table doc store as myntab;

If you are running the above in SQL*Plus, you'll need forward slashes after each statemeny to execute it.

 insert into mytab values (1, null, nt_tab());
  insert into table(select doc from mytab where id = 1) 
    values (1, 'this is the first line');
  insert into table(select doc from mytab where id = 1) 
    values (2, 'this is the second line');

For each row of mytab, interMedia Text needs to concatenate the rows of the nested doc table into a single document. In 8.1.5, this has to be done using a user datastore. 8.1.6 provides this as built-in functionality.

We start by creating a nested table datastore:

  ctx_ddl.create_preference('mynt','nested_datastore');

We need to provide the name of the nested table column in the base table:

  ctx_ddl.set_attribute('mynt','nested_column','doc');

We also need to provide the type of the nested table. This should include the type owner schema. This cannot be determined automatically because the table name is not known until create index time:

  ctx_ddl.set_attribute('mynt','nested_type','scott.nt_tab');

Finally, we need to specify the names of the line number and text columns within the nested table record type:

  ctx_ddl.set_attribute('mynt','nested_lineno','lno');
  ctx_ddl.set_attribute('mynt','nested_text',  'ltxt');

The extensible indexing framework does not allow creating an index on a nested table type, so we need to create the index on a dummy column, just like the detail datastore:

  create index myidx on mytab(dummy)
   indextype is ctxsys.context
   parameters ('datastore mynt');

For each record of the base table, interMedia Text will select the rows of the nested table using sql somewhat like this:

  select <nested_text>
    from table(select <nested_column>
                 from <base_table>
                where rowid = <current_row_rowid>)
   order by nested_table_id, <nested_lineno>

and concatenate the rows into a single document. Thus, a search like:

  select id from mytab where contains(dummy, 'first and second')>0

will return the one base table row, even though the hits are in different rows of the nested table.

Like the detail datastore, the nested table datastore has a BINARY attribute to control the concatenation of the nested table lines. If BINARY is TRUE, interMedia Text will simply concatenate the text together. If BINARY is FALSE, interMedia Text will insert a newline in between each nested table line.

Given the nested table sql, a unique index on the nested table, on the columns (nested_table_id, <lineno>) should improve indexing performance.

Because the index is created on a dummy column, any DML on the nested table must be manually propagated to the the dummy column -- interMedia Text cannot automatically detect such changes and mark rows for reindex.



Filter Improvements

INSO_FILTER Character Set Support

In 8.1.5, the INSO filter outputs ISO8859-1 text, which is then converted to the database character set. However, not all documents can be expressed in ISO8859-1, so the INSO filter cannot be used in all situations.

8.1.6 uses a newer revision of the INSO libraries which features UNICODE output. Since UNICODE is designed to be able to express any text, the INSO filter can now be used on any database, as long as the document text can be expressed in the database character set.

For instance, in 8.1.5, a Japanese document in MS Word on a JA16EUC database cannot be sent to the INSO filter, because the plain text output is forced to ISO8859-1, which is incapable of expressing the document. In 8.1.6, the plain text is UNICODE, and everything works fine. Sending the same document through on a US7ASCII database, however, will not work since the Japanese text cannot be converted to the database character set.

 

Format Column

Another improvement in the INSO filter is the addition of optional bypasses for heterogenous docsets. In 8.1.5, if you employ the INSO filter in your index, every document in the base table is sent through the INSO filter. This works fine when all the documents are formatted documents, but the docset may also have HTML and plain text documents. Filtering these wastes time and sometimes produces unexpected results.

8.1.6 adds a mechanism to bypass INSO filtering for selected rows. This requires a column in the base table -- called the format column -- which specifies that the data is BINARY or TEXT. For instance, let's consider the table:

  create table hdocs (
    id   number        primary key,
    fmt  varchar2(10), 
    text varchar2(80)
  );

We can insert a word doc, with the value 'BINARY' in the format column:

  insert into hdocs values (1, 'binary', '/docs/myword.doc');

and an HTML document, with the value 'TEXT' in the format column:

  insert into hdocs values (2, 'text',   '/docs/index.html');

When we create the index, we need to specify the name of the format column, using the new 'format column' clause of the parameters string:

  create index hdocsx on hdocs(text)
   indextype is ctxsys.context
   parameters ('datastore file_datastore filter inso_filter 
                format column fmt');

Since we have specified a format column, for each row, the INSO filter will examine the value in 'fmt'. If this value is 'BINARY', as it is for row 1, then it will send it through the INSO filter. If the value is 'TEXT', it by-passes the INSO filter and sends the text straight through.

The format column must be in the same table as the text column, and it must be CHAR, VARCHAR, or VARCHAR2 type. The value is case-insensitive, so 'BINARY' and 'binary' are the same. If the column value is null or not 'BINARY' nor 'TEXT', then the document is assumed to be binary, and it is filtered by the INSO filters.

The format column does not have to be specified when using the INSO filter. This bypass mechanism is optional. The format column can be specified when not using the INSO filter, but it has no effect.

 

Charset Column

If the document set is even more heterogenous, the TEXT documents may not all be in the same character set, and may not be in the database character set. The INSO filter can do per-row character set translation for such documents, if the character set is specified in another column, called the charset column. For instance, let's expand our example:

  create table hdocs (
    id   number        primary key,
    fmt  varchar2(10),
    cset varchar2(20), 
    text varchar2(80)
  );

We can insert a plain text Japanese document in EUC:

  insert into hdocs values (1, 'text', 'JA16EUC', '/docs/tekusuto.euc');

and one in Shift-JIS:

  insert into hdocs values (2, 'text', 'JA16SJIS', '/docs/tekusuto.sjs');

When we create the index, we need to specify the names of the format and charset columns:

  create index hdocsx on hdocs(text)
   indextype is ctxsys.context
   parameters ('datastore file_datastore filter inso_filter 
                format  column fmt
                charset column cset');

The INSO filter first looks at the format column. If the format column value is TEXT, then it looks at the charset column for the character set of the document. The document text is converted from this character set to the database character set.

The charset column must be in the same table as the text column, and it must be CHAR, VARCHAR, or VARCHAR2 type. The value is case-insensitive, but should be an NLS character set string. If the column value is null or is not recognized, then the document is passed through without any conversion. Charset column does not have to be specified when using the INSO filter.

If the charset column is specified and the format column is not specified, then all documents are assumed to be text documents, and the INSO filter works like the CHARSET filter, described below.

 

CHARSET_FILTER

The CHARSET filter can also take advantage of the charset column, which allows you to index a table of text documents in different character sets. When using the charset column, the CHARSET attribute becomes the default value -- when the charset column is null or the value is not recognized, then the document is assumed to be in the character set named in the attribute, and is converted from that to the database character set.

Like the INSO filter, you do not need to specify the charset column -- it works just like the 8.1.5 version, and assumes all documents are in the CHARSET attribute character set.



Section Group Improvements

BASIC_SECTION_GROUP

The tagged section groups (basic, html, xml) have all been re-written to share a common parser. As a result, the BASIC section group can now parse tags with attributes, although currently it is not able to index those attributes.

 

HTML_SECTION_GROUP

The HTML section group can now index META tags. META tags in HTML have two forms. The first is used for embedding http header data, and is not interesting to information retrieval. The second form, however, is used for encoding document meta-data in name/value pairs:

  <META NAME="author" CONTENT="Mr. Roboto">

8.1.5 is unable to make use of this because the information is in attribute values, which were ignored. Furthermore, even if the product had attribute value searching, the name part of the name/value pair is actually an attribute value rather than an attribute name.

Instead, the HTML section group has been enhanced with special detection of the META tag. Like any other section, the META pairs which should be indexed are declared using add_section. However, the tag portion has a special syntax: META@<name>:

  ctx_ddl.create_section_group('mysg','html_section_group');
  ctx_ddl.add_field_section('mysg','author','meta@author');

With this tag syntax, the HTML section group will interpret

  <META NAME="author" CONTENT="Mr. Roboto">

as:

  <META@AUTHOR>Mr.Roboto</META@AUTHOR>

and index "Mr. Roboto" as the contents of the field section "author". The query looks like any other section query:

  Roboto within author

to find the document.

META name/value pairs which have not been added to the section group are ignored and not indexed. META tags which do not have NAME or CONTENT attributes are also ignored.

 

XML_SECTION_GROUP Doctype Limiter

One problem with the XML section group in 8.1.5 is the inability to distinguish between tags in different DTD's. For instance, perhaps you have a DTD for storing contact information:

  <!DOCTYPE contact>
  <contact>
    <address>506 Blue Pool Road</address>
    <email>dudeman@radical.com</email>
  </contact>

Appropriate sections might look like:

  ctx_ddl.add_field_section('mysg','email',  'email');
  ctx_ddl.add_field_section('mysg','address','address');

Which works out fine... until you start getting a different kind of document in the same table:

  <!DOCTYPE mail>
  <mail>
    <address>dudeman@radical.com</address>
  </mail>

Now your address section, originally intended for street addresses, starts picking up email addresses, because of tag collision.

8.1.6 allows you to specify doctype limiters to distinguish between these tags across doctypes. You simply specify the doctype in parentheses before the tag:

  ctx_ddl.add_field_section('mysg','email','email');
  ctx_ddl.add_field_section('mysg','address','(contact)address');
  ctx_ddl.add_field_section('mysg','email','(mail)address');

Now when the XML section group sees an address tag, it will index it as the address section when the document type is contact, or as the email section when the document type is mail.

If you have both doctype-limited and unlimited tags in a section group:

  ctx_ddl.add_field_section('mysg','sec1','(type1)tag1');
  ctx_ddl.add_field_section('mysg','sec2','tag1');

Then the limited tag applies when in the doctype, and the unlimited tag applies in all other doctypes.

Querying is unaffected by this -- the query is done on the section name, not the tag, so querying for an email address would be done like:

  radical within email

which, since we have mapped two different kinds of tags to the same section name, finds documents independent of which tags are used to express the email address.

 

XML_SECTION_GROUP Attribute Sections

Another commonly-requested feature in the 8.1.6 XML section group is the ability to index and search within attribute values. For instance, take a document with the following:

  <comment author="jeeves">
    I really like interMedia Text
  </comment>

in 8.1.5, the author information, like all attributes, is ignored and cannot be indexed or searched. In 8.1.6, the XML section group offers a new type of section called an attribute section, which nominates interesting attribute values to index:

  ctx_ddl.add_attr_section('mysg','author','comment@author');

The syntax is very similar to the other add_section calls -- the first argument is the name of the section group, the second is the name of the section, and the third is the tag, in the form <tag_name>@<attribute_name>.

This tells interMedia Text to index the contents of the author attribute of the comment tag as the section "author". Query syntax is just like any other section:

  jeeves within author

finds the document. Attribute text is considered to be invisible, though, so the following:

  jeeves

does NOT find the document, somewhat like field sections. Unlike field sections, however, attribute section within searches can distinguish between occurrences. Given the document:

  <comment author="jeeves">
    I really like interMedia Text
  </comment>
  <comment author="bertram">
    Me too
  </comment>

the query:

  (cryil and bertram) within author

will not find the document, because "jeeves" and "bertram" do not occur within the SAME attribute text.

Attribute section names cannot overlap with zone or field section names, although you can map more than one tag@attr to a single section name.

Attribute sections do not support default values. Given the document:

  <!DOCTYPE foo [
    <!ELEMENT foo (bar)>
    <!ELEMENT bar (#PCDATA)>
      <!ATTLIST bar
        rev CDATA "8i">
  ]>
  <foo>
    <bar>whatever</bar>
  </foo>

and attribute section:

  ctx_ddl.add_attr_section('mysg','barrev','bar@rev');

the query:

  8i within barrev

does not hit the document, although in XML semantics, the "bar" element has a default value for its "rev" attribute.

 

Attribute Value Sensitive Section Search

Attribute sections allow you to search the contents of attributes. They do not allow you to use attribute values to specify sections to search. For instance, given the document:

  <comment author="jeeves">
    I really like interMedia Text
  </comment>

You can find this document by asking:

  jeeves within comment@author

which is equivalent to "find me all documents which have a comment element whose author attribute's value includes the word jeeves".

However, there is no way to ask for something like:

  interMedia within comment where (@author = "jeeves")

in other words, "find me all documents where interMedia appears in a comment element whose author is jeeves". This feature -- attribute value sensitive section searching -- is planned for future versions of the product.

 

Dynamic Add Section

Because the section group is defined before creating the index, 8.1.5 is limited in its ability to cope with changing structured document sets; if your documents start coming with new tags, or you start getting new doctypes, you have to re-create the index to start making use of those tags.

8.1.6 allows you to add new sections to an existing index without rebuilding the index, using alter index and the new add section parameters string syntax:

  add zone  section <section_name> tag <tag>
  add field section <section_name> tag <tag> [ visible | invisible ]

For instance, to add a new zone section named tsec using the tag title:

  alter index <indexname> rebuild
  parameters ('add zone section tsec tag title')

To add a new field section named asec using the tag author:

  alter index <indexname> rebuild
  parameters ('add field section asec tag author')

This field section would be invisible by default, just like when using add_field_section. To add it as visible field section:

  alter index <indexname> rebuild
  parameters ('add field section asec tag author visible')

Dynamic add section only modifies the index meta-data, and does not rebuild the index in any way. This means that these sections take effect for any document indexed after the operation, and do not affect any existing documents -- if the index already has documents with these sections, they must be manually marked for re-indexing (usually with an update of the indexed column to itself).

This operation does not support addition of special sections. Those would require all documents to be re-indexed, anyway. This operation cannot be done using rebuild online, but it should be a fairly quick operation.

 

AUTO_SECTION_GROUP

The ultimate in coping with changing structured document sets is the new AUTO section group. The AUTO section group is like the XML section group, but requires no pre-definition of sections. Instead, any non-empty tag is automatically indexed as a zone section with the section name the same as the tag. For instance, given a document like this:

  <book>
    <author>Neal Stephenson</author>
    <title>The Diamond Age</title>
    <description>
      Decades into our future ...
    </description>
  </book>

The auto sectioner will create zone sections named book, author, title, and description. Queries like this:

  diamond within description

will find the document. Because sections are created dynamically from the tags, the auto sectioner is ideal when the structure of the documents is not known to the index creator, or when they are expected to change so that pre-definition with dynamic add is impractical.

The only quirk to watch out for is that section names are case-insensitive, but tags in XML are case-sensitive. For instance, given the above document, the query

  diamond within description

finds the document, as does

  diamond within DESCRIPTION

Both queries would also find a document:

  <DESCRIPTION>Diamond Earring</DESCRIPTION>

even though "description" and "DESCRIPTION" are different tags in XML.

You cannot add zone, field, or attribute sections to an AUTO_SECTION_GROUP in this version. Furthermore, because the AUTO_SECTION_GROUP may encounter documents using the tags PARAGRAPH or SENTENCE, it also does not support special sections.

 

AUTO_SECTION_GROUP Stop Sections

Even when using the AUTO section group, there may be some tags which are so common that indexing them would be a waste of time, since they are not useful for searching. For such tags, the AUTO section group has stop sections, which indicate tags to ignore -- much like stop words indicate words to ignore.

Stop sections use the new add_stop_section call:

  ctx_ddl.add_stop_section(<group name>, <tag>)

which indicates to the named AUTO section group that <tag> should be ignored and not indexed as a section. add_stop_section can be used only on AUTO section groups. The number of stop sections per AUTO group is unlimited.

The tag is case-sensitive, so:

  ctx_ddl.add_stop_section('mysg','description')

is different from

  ctx_ddl.add_stop_section('mysg','DESCRIPTION')

and these stop different things. If you have only:

  ctx_ddl.add_stop_section('mysg','description')

there may still be a section named "DESCRIPTION", if another document uses "Description" or "DESCRIPTION" as the tag. For this reason, a query that does "within <stop section>" does not error.

The tag also supports doctype limiters. So, for instance, if you wanted to stop tag "foo", but only in document type "bar":

  ctx_ddl.add_stop_section('mysg','(bar)foo')

Stop sections can be added to an existing index dynamically, using the parameters string syntax:

  add stop section <tag>

as in:

  alter index <indexname> rebuild
  parameters ('add stop section author')

Unlike stop words, adding a stop section dynamically does not apply to documents already indexed. This is partially to be consistent with the other dynamically add section calls, but mainly because there could be other tags using that section name due to case-insensitivity of section names.

 

AUTO_SECTION_GROUP Attribute Sections

The AUTO section group also automatically indexes attribute values as attribute sections named <tag>@<attr>. For instance, given:

  <book author="Edmund Wells">
    <title>Rarnaby Budge</title>
  </book>

You can find this document with the query:

  wells within book@author

since the author attribute is automatically indexed as an attribute section.

 

Nested Within

8.1.6 also adds nested within queries -- also known as hierarchical section searching. For instance, given a document with:

  <book>
    <title>Star Wars</title>
  </book>
  <video>
    <title>Star Wars<title>
  <video>

You can distinguish between the two types of titles using a nested within:

  (Star Wars within title) within video

for instance, which is not allowed in 8.1.5. 8.1.6 provides this feature with two important caveats. First, this only works with zone sections. Attribute or field sections -- even visible field sections -- cannot be used in a nested within.

Second, because of the way interMedia Text indexes zone sections, equal sections cannot be distinguished. For instance, the document:

  <A><B>foo</B></A>

can be found with either:

  (foo within A) within B

or

  (foo within B) within A

This applies only when the sections bound exactly the same range. Any intervening text allows interMedia Text to distinguish between the sections -- for instance:

  <A><B>foo</B>bar</A>

Another aspect of this is that nested within does not imply an exact parent-child relationship. In XQL, a query string with:

  A/B

means that the B section is a first-level child of the A section. This not equivalent to nested within:

  (x within A) within B

does not mean that the A section is a child of the B section -- it could be a grandchild or great-grandchild. Nested within is more like the XQL:

  A/*/B



Lexer Improvements

BASIC_LEXER Generic Base Lettering

The BASE_LETTER function of the BASIC lexer is designed to translate any accented characters to their unaccented representation for indexing and query. In 8.1.5, this feature uses an NLS function which is sensitive to the language part of the NLS_LANG setting. This means, for instance, that BASE_LETTER does not work for an AMERICAN database, because the NLS function does not work in this language. For 8.1.6, based on our input, NLS has provided a generic base lettering function which works in all languages. This means that the BASE_LETTER function now works no matter what language your database uses, which is ideal for multi-lingual and foreign-language databases.

This new function requires the NLS file lx30044.nlb to be installed, so if base lettering is not working in 8.1.6, check for this file.

 

BASIC_LEXER THEME_LANGUAGE

The French knowledge base and lexicon is now public in 8.1.6. This allows indexing and generation of themes for French documents. Normally, the BASIC lexer will choose which knowledge base to use (French or English) based on the NLS_LANG setting for the session. If, however, your NLS_LANG setting does not match your documents -- your documents are in French and your env is in English -- then this will choose the wrong lexicon, or give an error (if your environment is neither English nor French).

For such situations, the BASIC_LEXER has a new THEME_LANGUAGE setting to override environment detection. This attribute can be set to 'ENGLISH' or 'FRENCH' to choose a lexicon explicitly:

  ctx_ddl.set_attribute('mylexer','theme_language','french');

This lexer now produces French themes independent of the environment. This attribute can also be set to 'AUTO', or left unset. In such cases, the lexicon is chosen based on the environment.

 

BASIC_LEXER Theme Admissibility

8.1.5 theme indexing produces a maximum of 50 themes per document. This means that long documents may be difficult to find with a theme query, since lesser themes tend not to make it into the top 50.

8.1.6 solves this by removing the 50-theme limit for theme indexing -- all document themes are produced and indexed. Theme generation, via document services, however, is not changed, and is still subject to the 50-theme maximum.

 

JAPANESE/KOREAN_LEXER UTF-8 Support

In 8.1.5, the JAPANESE_VGRAM_LEXER works only if your database character set is JA16EUC or JA16SJIS. The KOREAN_LEXER works only in KSC5601. These lexers are enhanced in 8.1.6 to also work in UTF-8. This allows multi-lingual databases to finally use the Japanese and Korean lexers.

 

MULTI_LEXER

The biggest 8.1.6 addition in the lexer area is the new MULTI lexer, which allows documents of different languages to be stored in a single table. The MULTI lexer chooses a language-specific lexer for each row, based on a language column.

The first step with the MULTI lexer is the creation of language-specific lexer preferences for each language in the multi-lingual table. As an example, let's create English, French, and German lexers:

  ctx_ddl.create_preference('english_lexer', 'basic_lexer');
  ctx_ddl.set_attribute('english_lexer','index_themes','yes');

  ctx_ddl.create_preference('french_lexer', 'basic_lexer');
  ctx_ddl.set_attribute('french_lexer','base_letter','yes');

  ctx_ddl.create_preference('german_lexer', 'basic_lexer');
  ctx_ddl.set_attribute('german_lexer','composite','german');
  ctx_ddl.set_attribute('german_lexer','alternate_spelling','german');
  ctx_ddl.set_attribute('german_lexer','mixed_case','yes');

Once the language-specific lexer preferences are created, they need to be gathered together under a single MULTI lexer preference. First, we need to create the MULTI lexer preference, using the MULTI_LEXER object:

  ctx_ddl.create_preference('global_lexer','multi_lexer');

Now we must add the language-specific lexers to the multi lexer preference using the new add_sub_lexer call:

  ctx_ddl.add_sub_lexer(<multi_lexer>, <language>, <lang-specific lexer>)

For our example:

 ctx_ddl.add_sub_lexer('global_lexer', 'german', 'german_lexer');
  ctx_ddl.add_sub_lexer('global_lexer', 'french', 'french_lexer');
  ctx_ddl.add_sub_lexer('global_lexer', 'default','english_lexer');

This nominates the german_lexer preference to handle german documents, the french_lexer preference to handle french documents, and the english_lexer preference to handle everything else, using 'DEFAULT' as the language.

The MULTI lexer decides which lexer to use for each row based on a language column. This is a character column in the table which stores the language of the document in the text column. For instance, our table looks like:

 create table globaldoc (
    doc_id   number       primary key,
    lang     varchar2(10),
    text     clob
  );

  insert into globaldoc values (1, 'german',  'Ich bin Berliner');
  insert into globaldoc values (2, 'french',  'Je suis Parisian');
  insert into globaldoc values (3, 'english', 'I''m from New Yawk');

To index this table, we need to use the MULTI lexer preference, but we also need to specify the name of the language column:

  create index globalx on globaldoc(text)
  indextype is ctxsys.context
  parameters ('lexer global_lexer
               language column lang');

The MULTI lexer then reads the value of the language column for each row, and uses the associated language-specific lexer for that language.

The language column is case-independent, and can contain either the NLS name or abbreviation of the language. If the table uses some other value for the language, you need to specify that alternate value as the fourth argument when adding the sub lexers:

  ctx_ddl.add_sub_lexer('global_lexer', 'german', 'german_lexer', 'ger');

for instance, to use the ISO 639-2 string for german instead. Note that these alternate values are in addition to the NLS names and abbreviations, so "D" and "GERMAN" in the language column still select the german lexer. This means that alternate values cannot be NLS language names or abbreviations.

The DEFAULT lexer is used when the language of the document is unknown or is not mapped to a language-specific lexer. The DEFAULT lexer is also used to lex stopwords. A DEFAULT lexer MUST be specified, or you will get an error when using the MULTI lexer preference in create index. You cannot specify an alternate value for the DEFAULT sub lexer -- this does not make sense.

At query time, the MULTI lexer chooses a language-specific lexer to lex the query tokens. This is based on the NLS_LANG setting for the query session. Thus, a query session in the FRENCH language will use the lexer for FRENCH. A more explicit method of specifying the query language may appear in future versions of interMedia Text.



Wordlist Improvements

AUTO Setting for STEMMER and FUZZY_MATCH

The 8.1.5 stemmer and fuzzy_match attributes are set for a single, specific language at create index time, making it unable to handle multi-lingual tables. 8.1.6 provides the new AUTO setting for STEMMER and FUZZY_MATCH:

  ctx_ddl.create_preference('mywl','basic_wordlist');
  ctx_ddl.set_attribute('mywl', 'STEMMER', 'AUTO');
  ctx_ddl.set_attribute('mywl', 'FUZZY_MATCH', 'AUTO');

During queries, when set to AUTO, the stemmer or fuzzy match expander will examine the language setting of the session and choose the most appropriate setting for that language. A user with NLS_LANG French, for instance, will get French stemming and fuzzy matching. This mapping is not configurable in this release -- you cannot specify a particular stemmer or fuzzy match cluster to use for each language.

This setting is designed for use with the MULTI lexer, which switches lexers based on the query session's language setting. Together, they allow multi- lingual tables to be indexed and queried.

 

WILDCARD_MAXTERMS

All versions of interMedia Text and ConText offer wildcard searching -- a contains for any word matching a pattern. This is done by expanding the pattern against the main index table using LIKE, and searching on all the resulting words.

ConText limited the size of the expansion to around 2500 words. interMedia Text 8.1.5 has no limit, but when the pattern is unselective, it uses up memory quickly.

8.1.6 re-establishes the expansion limit, but with a configurable maximum expansion -- the new WILDCARD_MAXTERMS attribute.:

  ctx_ddl.create_preference('mywl','basic_wordlist');
  ctx_ddl.set_attribute('mywl', 'WILDCARD_MAXTERMS', '2500');

Wildcard patterns matching more words than this will receive a "query too complex" error. The default value, if unset, is 5000 words.

 

SUBSTRING_INDEX

Patterns which start with a wildcard pose another problem for wildcard searching. Because wildcard searches are processed by expanding the pattern using a LIKE, they force a full-table scan of the main index table. This type of query, sometimes called a left-truncated search, takes a long time.

For those applications which require faster left-truncated searching, 8.1.6 offers the new SUBSTRING_INDEX option. This is set like any other boolean attribute:

  ctx_ddl.create_preference('mywl','basic_wordlist');
  ctx_ddl.set_attribute('mywl', 'SUBSTRING_INDEX', 'TRUE');

When set, the index will include a new pattern table ($P) which stores rotations of all unique words in the main index table. Using the pattern table instead of the main index table allows interMedia Text to process left- truncated searches much much faster.

The query-time savings come at the cost of increased index time and space. Indexing time can take as much as four times as long, and use up to 25% more space, so this option should be used only when left-truncated wildcard searches are frequent and need to be fast.



Storage Improvements

P_TABLE_CLAUSE for BASIC_STORAGE

Since we have a new index table when SUBSTRING_INDEX is set, the BASIC storage object gets a new attribute -- P_TABLE_CLAUSE, which is tacked on the end of the create table statement for the P table:

  create table DR$<indexname>$P (
    pat_part1   varchar2(61),
    pat_part2   varchar2(64),
    primary key (pat_part1, pat_part2) 
  ) organization index
  <p_table_clause>

We estimate that the P table should have approximately the same number of rows as the $I table.



Index Creation and Maintenance

Parallel Indexing

ConText V2.x supported parallel index creation, using multiple ctxsrv processes. This was removed in 8.1.5 because tighter integration with the kernel made it unfeasible. While we continue to work for full support in future versions, 8.1.6 adds a temporary workaround which allows parallel index creation with two caveats.

First, in order to use 8.1.6 parallel indexing, the base table must be a partitioned table -- each parallel indexer will work on its own partition, so the parallel degree should not exceed the number of partitions.

Second, because of the limitations of the kernel-interMedia Text interface, we must use dbms_job to spawn the parallel indexers. This means that the database must be running with job_queue_processes > 0, and that the parallel degree cannot exceed this number.

Otherwise, the syntax is just as you would expect:

  create index textidx on doctab(text)
  indextype is ctxsys.context
  parallel 3

This will create the index with parallel degreee 3. Note that because we use dbms_job to spawn parallel indexers, so there is some lag before all indexers are running, proportional to job_queue_interval.

If logging is active, multiple log files will be created by this -- one for each parallel indexer. Those log files will be named <logfile>_<process#>. If, for instance, you did:

  ctx_output.start_log('mylog')

you could expect files "mylog", "mylog_1", "mylog_2", etc.

 

PL/SQL Sync and Optimize

PL/SQL functions for sync and optimize have been added. The preferred method for sync and optimize is alter index rebuild online. Ideally, these could be called in a dbms_job. However, separate sync and optimize jobs can conflict because two alter index operations cannot be run at the same time for a single index.

The new PL/SQL functions resolve both problems by proving an alternate PL/SQL API for sync and optimize:

  ctx_ddl.sync_index(<idx_name>)

idx_name is the name of the index to sync. The effective user must be the owner of the index or ctxsys. If effective user is ctxsys, idx_name can be NULL, which will cause this call to examine all context indexes for pending DML. This uses the same code as ctxsrv. We now recommend processing pending DML through a dbms_job -- an individual job for each index, a ctxsys job for all indexes, or some combination of the two. This eliminates the need for ctxsrv, and thus eliminates the problem of starting ctxsrv each time the database starts. The ctxsrv program itself is now deprecated, and may be removed in some future release.

  ctx_ddl.optimize_index(<idx_name>, <optlevel>, <maxtime>)

idx_name is the name of the index to optimize. optlevel is 'FAST' or 'FULL' for the optimization level. There are also symbols CTX_DDL.OPTLEVEL_FAST and CTX_DDL.OPTLEVEL_FULL for these values. maxtime is the maximum time, in minutes, to spend on optimization. This applies only to FULL optimization. By default, maxtime is unlimited. You can also pass in NULL or the symbol CTX_DDL.MAXTIME_UNLIMITED.



Query Language Improvements

New Accumulate Scoring

8.1.6 has a new algorithm for processing accumulate queries. The changes mostly resolve scoring inconsistencies the 8.1.5 algorithm has in certain cases. The new algorithm also handles weighted children better than its predecessor.

 

Nested Within

8.1.6 can process nested withins -- also known as hierarchical section searches -- which allows queries like this:

  (dog within title) within book

Please refer above for details and caveats.

 

BROWSE_WORDS

Many IR systems include a feature called vocabulary browse. This takes an input word and displays words in the text index which are alphabetically close. This feature is useful for users to correct misspelled words in the query. By including a rough document count for each word, it also allows the user to eliminate unselective words. Administrators may also use it to correct misspellings in document text.

8.1.6 provides a new function for this -- browse_words:

  ctx_query.browse_words(<index_name>, <seed>, <restab>, 
                         <browse_id>, <numwords>, <direction>)

index_name is the name of the index to browse.

seed is the start word. This must be a single word, and not a phrase.

restab is the name of a result table which should have the schema:

  browse_id  number
  word       varchar2(64)
  doc_count  number

The results of the browse are placed in this table. Existing contents are not deleted before browse.

browse_id is a numeric identifier for the operation. Output rows in the result table will have this value in the browse_id column. If not specified this defaults to 0.

numwords is the length of the produced list, in words. This must be between 1 and 1000. If not specified, this defaults to 10 words.

direction specified the direction of the browse. 'BEFORE' means the seed word and words alphabetically before the seed, 'AFTER' means the seed and words alphabetically after the seed, and 'AROUND' means the seed and words both before and after. There are also symbols CTX_QUERY.BROWSE_BEFORE, CTX_QUERY.BROWSE_AROUND, and CTX_QUERY.BROWSE_AFTER for these values. This defaults to 'AROUND'

An example might look like:

     exec ctx_query.browse_words('myidx','dog','myres', 5);
     select word, doc_count from myres order by word;

  WORD     DOC_COUNT
  -------- ---------
  CZAR     15
  DARLING  5
  DOG      73
  DUNK     100
  EAR      3

Note that the DOC_COUNT is an upper bound which does not take into account old information from deleted or updated documents. FULL optimization removes this old information from the index, and makes these counts exact.

Browse also has a variant which delivers the result in a PL/SQL table instead of a result table:

  ctx_query.browse_words(<index_name>, <seed>, <resarr>, 
                         <numwords>, <direction>)

Here, resarr is a PL/SQL table of type ctx_query.browse_tab:

  type browse_rec is record (
    word      varchar2(64),
    doc_count number
  );
  type browse_tab is table of browse_rec index by binary_integer;

The results are placed in the table, ordered alphabetically, with the alphabetically first word at index 1. Existing elements of the table are deleted upon entry into the function.



Document Services Improvements

Rowid Input

All document services work on single documents. To specify the document, these interfaces take in a primary key specification. This may not be convenient when the table uses a composite primary key or you have only the rowid. 8.1.6 allows you to specify a rowid instead.

The document services interfaces have not been changed for this feature. The textkey argument can now take either a rowid or a primary key spec. The key type of the document services determines how the value is interpreted.

You set the key type using the new set_key_type call:

  ctx_doc.set_key_type('ROWID');

After which, all document services will expect rowid values. This setting has session duration and scope. To set it back:

  ctx_doc.set_key_type('PRIMARY_KEY');

there are also symbols (CTX_DOC.TYPE_ROWID and CTX_DOC.TYPE_PRIMARY_KEY) defined for these values.

The default is primary key mode, but you can change the default using the new system parameter CTX_DOC_KEY_TYPE. For instance, if you want your system to default to ROWID input:

  ctx_adm.set_parameter('CTX_DOC_KEY_TYPE', CTX_DOC.TYPE_ROWID);

This has system scope and permanent duration. Note that this only sets the default -- sessions can always override this via set_key_type.

 

In-Memory Document Services

In 8.1.5, the document services use result tables to store their results. If you don't need to store the result of the operation, result table I/O is wasteful, and application code must be written to manage these temporary rows. When you do need to store the results, the schema of the result tables may not be sufficient for your needs -- you may need to store a varchar2 primary key, for instance.

8.1.6 introduces in-memory variants of all document services which avoid result tables and instead output to PL/SQL variables. This avoids unneeded I/O and allows the application to more easily process the transient results. We'll look at these new functions individually, starting with:

  ctx_doc.filter(<index_name>, <textkey>, <result>, <plaintext>)

The result argument is a CLOB locator. The document will be filtered and the filtered text placed into this CLOB. If the locator passed in is NULL, a temporary CLOB is allocated for you. It is up to the app to deallocate it.

  ctx_doc.markup(<index_name>, <textkey>, <text_query>, <result>,
                 <plaintext>, <tagset>, <starttag>, <endtag>,
                 <prevtag>, <nexttag>)

The result argument is a CLOB locator, just like in the filter call.

  ctx_doc.highlight(<index_name>, <textkey>, <text_query>, 
                    <restab>, <plaintext>)

The highlight call can return multiple highlights, so a single PL/SQL variable isn't sufficient. Instead, the restab argument is a PL/SQL table of type ctx_doc.highlight_tab:

  type highlight_rec is record (
    offset number;
    length number;
  );
  type highlight_tab is table of highlight_rec index by binary_integer; 

The first highlight will be placed at index 1. Existing contents of the table when passed in are deleted.

  ctx_doc.themes(<index_name>, <textkey>, <restab>, <full_themes>)

Like highlight, ctx_doc.themes can return multiple themes. restab is a PL/SQL table of type ctx_doc.theme_tab:

  type theme_rec is record (
    theme  varchar2(2000);
    weight number;
  );
  type theme_tab is table of theme_rec index by binary_integer;

The first theme is placed at index 1. Existing contents of the table when passed in are deleted. The themes are not guaranteed to be placed in the theme_tab in any particular order.

  ctx_doc.gist(<index_name>, <textkey>, <result>, <glevel>, <pov>,
               <numparagraphs>, <maxpercent>)

The in-memory version of gist is the only call which deviates from its result-table version. Unlike the result-table version, which returns multiple gists -- one per point-of-view -- the in-memory version returns a single gist, selected by the pov argument. If no pov is specified, the GENERIC gist is generated. result is a CLOB locator as in ctx_doc.filter or ctx_doc.markup.



Thesaurus Improvements

Thesaurus Maintenance

The thesaurus functionality in 8.1.5 supports file import and export, but once a thesaurus is in the database, it's difficult to change -- the API allows only the addition of new phrases and relations.

8.1.6 rounds out the API with new maintenance functions which allow changing, renaming, and removal of phrases and relations. There are also calls for adding new phrases and relations which are hopefully less confusing than the existing functions.

Let's look at the new additions call-by-call.

 

ALTER_THESAURUS

  ctx_thes.alter_thesaurus(<tname>, <op>, <operand>)

This call has two variations. To rename a thesaurus, op is set to 'RENAME' and operand is set to the new name:

  ctx_thes.alter_thesaurus('oldthesname','RENAME','newthesname');

To truncate a thesaurus -- that is, remove all phrases and relations -- op is set to 'TRUNCATE' and the operand is not needed:

  ctx_thes.alter_thesaurus('mythes','TRUNCATE');

'RENAME' and 'TRUNCATE' also have symbols defined -- CTX_THES.OP_RENAME and CTX_THES.OP_TRUNCATE.

 

ALTER_PHRASE

  ctx_thes.alter_phrase(<tname>, <phrase>, <op>, <operand>)

This call has three variations. To change a phrase, set op to 'RENAME':

  ctx_thes.alter_phrase('mythes','old phrase','RENAME','new phrase');

This leaves all the relations intact -- it just changes the phrase. You can also use this to add, change, or remove qualifiers:

  ctx_thes.alter_phrase('mythes','crane','RENAME','crane(bird)');
  ctx_thes.alter_phrase('mythes','crane(bird)','RENAME','crane(stork)');
  ctx_thes.alter_phrase('mythes','crane(bird)','RENAME','crane');

The second variation makes a phrase the preferred term in its synonym ring:

  ctx_thes.alter_phrase('mythes','laser','PT');

If the phrase's synonym ring already has a preferred term, this operation overrides it. The old preferred term becomes a non-preferred synonym and the named phrase becomes the preferred term.

The third variant changes the scope note on a phrase:

  ctx_thes.alter_phrase('mythes','ships','SN', 'sea-going vessels');

The operand overwrites whatever scope note was on the phrase, so this can be used to add, change, or remove a scope note.

The operations all have symbols: CTX_THES.OP_RENAME, CTX_THES.OP_PT, and CTX_THES.OP_SN.

 

DROP_PHRASE

  ctx_thes.drop_phrase(<tname>, <phrase>)

This call removes a phrase from a thesaurus. BT*/NT* relations are patched around the dropped phrase. A relation is established between each narrower term and each broader term. For instance, a simple example:

  A
    BT B
  B
    BT C

and you do:

  ctx_thes.drop_phrase('mythes','B');

then the resulting thesaurus looks like:

  A
    BT C

For multiple broader and narrower terms:

  A   B
   \ /
    C
   / \
  D   E

when C is removed, the thesaurus looks like this:

  A  B
  |\/|
  |/\|
  D  E

This patching is done for all hierarchical relations -- BT, BTI, BTP, BTG, etc. Associative (RT) relations are not patched. If A RT B and B RT C, and B is removed, then no A RT C relation is created. Synonymity is also not affected. If the removed phrase is the preferred term, no phrase in the ring is elevated to preferred term.

 

CREATE_RELATION

  ctx_thes.create_relation(<tname>, <phrase>, <rel>, <relphrase>);

This call creates a relation between phrase and relphrase. You can do this in 8.1.5 using create_phrase, but the usage is confusing. For instance, if you want to create the thesaurus:

  dog
    BT  animal

It looks like:

  ctx_thes.create_phrase('mythes', 'dog');
  ctx_thes.create_phrase('mythes', 'animal', 'BT',  'dog');

Two problems here. The first problem is that you need two calls to create a thesaurus with one relation. The second problem is that in the second call, DOG and ANIMAL are in opposite order of how they look in the thesaurus file. This is confusing and unintuitive.

create_relation solves both by automatically creating unknown phrases or relphrases, and adhering closely to thesaurus file order. The above thesaurus is created using create_relation like this:

  ctx_thes.create_relation('mythes', 'dog', 'BT', 'animal');

create_relation also allows you to add translations to phrases, using <lang>: as the relation:

  ctx_thes.create_relation('mythes', 'dog', 'french:', 'chien');

Again, very similar to how it would look in the thesaurus file.

create_relation is the preferred syntax now, and the rel and relphrase arguments of create_phrase are now deprecated and should not be used. Users are discouraged from using these arguments.

 

DROP_RELATION

  ctx_thes.drop_relation(<tname>, <phrase>, <rel>, <relphrase>);

This is the opposite of create_relation, and removes a relation between two phrases, leaving the phrases and their other relations untouched. For instance, the sequence:

  ctx_thes.create_relation('mythes', 'A', 'BT', 'B');
  ctx_thes.drop_relation('mythes', 'A', 'BT', 'B');

results in mythes having the phrases A and B, but no relations between them.

You can remove BT*/NT* and RT relations. You can also use SYN to remove the phrase from its synonym ring:

  ctx_thes.create_relation('mythes', 'A', 'SYN', 'B');
  ctx_thes.create_relation('mythes', 'A', 'SYN', 'C');

--> A, B, and C are synonyms

  ctx_thes.drop_relation('mythes', 'A', 'SYN');

--> B and C are synonyms, but A is not in the ring

or PT to remove preferred term status, but leave the phrase in the ring:

  ctx_thes.create_relation('mythes', 'B', 'SYN', 'C');
  ctx_thes.create_relation('mythes', 'B', 'USE', 'A');

--> A, B, and C are synonyms, and A is the preferred term

  ctx_thes.drop_relation('mythes', 'A', 'PT');

--> A, B, and C are synonyms, but the ring no longer has any preferred term

Finally, you can use <lang>: to remove translations:

  ctx_thes.drop_relation('mythes', 'dog', 'french:', 'chien');

relphrase is actually optional. If omitted, all relations of that type are removed from the phrase. For instance:

  ctx_thes.drop_relation('mythes','A', 'BT');

drops all broader term relations of A.

  ctx_thes.drop_relation('mythes','A', 'french:');

drops all french translations of A.

 

Maintenance Security

One last security note for all the above functions -- only the owner of the thesaurus and ctxsys is allowed to modify the thesaurus, although all users can use a thesaurus during query.

 

PL/SQL Table Expansion

The thesaurus is really meant to be used during a contains query, with the thesaurus operators. However, 8.1.5 does include a simple PL/SQL expansion API. The functions work just like their query counterparts, and return the expansion as a single string. For instance, given the thesaurus:

  animal
    NT dog
    NT cat
    NT bird

and the PL/SQL:

  declare
    exp varchar2(80);
  begin
    exp := ctx_thes.nt('animal',1,'mythes');
    dbms_output.put_line(exp);
  end;

would print out something like:

  {animal}|{dog}|{cat}|{bird}

which allows you to see the expansion. However, the single string expansion must be parsed if you want the individual terms, for use in a web interface or your own thesaurus dumping mechanism.

8.1.6 adds PL/SQL table expansion for such applications. Expansions are placed into a PL/SQL table like this:

  type exp_rec is record (
    xrel    varchar2(12),
    xlevel  number,
    xphrase varchar2(256)
  );
  type exp_tab is table of exp_rec index by binary_integer;

xrel holds the relation of the input phrase to the output phrase (BT, NT, SYN, etc). The input phrase is also in the table, with xrel of PHRASE. For SYN and TRSYN expansions, the preferred term has xrel of PT.

xlevel tells you the distance from the input phrase to the output phrase. This is used mainly for hierarchical (BT*/NT*) expansions. For instance, in a thesaurus like:

  A
   NT B
  B
   NT C
  C
   NT D

and an NT expansion on A for 3 levels, the resulting table looks like:

  xrel   xlevel xphrase
  ------ ------ -------
  PHRASE   0       A
  NT       1       B
  NT       2       C
  NT       3       D

For non-hierarchical expansions, xlevel is 1 for all expanded terms, except the phrase itself, which is always 0, and translations of synonyms in TRSYN, which always have xlevel of 2.

xphrase is the output phrase itself. This will include the qualifier in parentheses, if one exists for the phrase.

The table API is like the existing string API, but instead of returning the expansion, it takes a table as the first argument. Our example above, using the new API, looks like:

  declare
    exp ctx_thes.exp_tab;
  begin 
  ctx_thes.nt(exp, 'animal', 1, 'mythes');
    for i in 1..exp.count loop
      dbms_output.put_line(lpad(' ',2*exp(i).xlevel)||
                           exp(i).xrel||' '||exp(i).xphrase);
    end loop;
  end;

and prints out:

  PHRASE animal
    NT dog
    NT cat
    NT bird

To complete the API, 8.1.6 adds two new expansion functions:

  sn(<phrase>, [<tname>]); 

returns the scope note for the given phrase, and

 thes_tt(<restab>, [<tname>])

returns thesaurus top terms into a PL/SQL table. This is different from TT. TT returns the top term for a given phrase. THES_TT searches the entire thesaurus and returns all top terms. A top term for THES_TT is defined as a term which has narrower terms but no broader terms. This is designed to be used in browsing applications, which may wish to start with the thesaurus top terms. Note that because THES_TT searches the entire thesaurus, it may take some time to complete for large thesauri. If the application needs the thesaurus top terms frequently, we strongly suggest caching the results.

Oracle

Chapter 3 : Oracle 8.1.7

Oracle8i interMedia Text 8.1.7 - Technical Overview

Garrett Kaminaga, Product Development, Database and Application Server Technologies, Oracle HQ

This is a technical overview of the improvements found in interMedia Text, version 8.1.7. This is intended for an audience familiar with version 8.1.6. If you are new to interMedia Text, please start with the 8.1.5, then 8.1.6 technical overviews.

This document is only a summary of new features and internal changes. As always, please refer to the 8.1.7 interMedia Text documentation addendum for detailed information.

 


 

Table of Contents

 


 

Overview of 8.1.7 Improvements

The improvements in version 8.1.6 of interMedia Text were mainly in the areas of structured document support and heterogenous document sets. While 8.1.7 includes some improvements which extend heterogenous document set handling, the focus is performance and ease-of-use.

For performance, 8.1.7 offers a new indextype which offers fast mixed query for small, simple text fields, prefix indexing for faster wildcard searching, and many internal tweaks -- suggested by exhaustive lab analysis on real-world data -- that improve query performance.

For ease-of-use, 8.1.7 has migrated entirely to the oracle shadow process, eliminating the need for extproc setup. It also provides a new datastore for easy multi-column concatenation, primary-key-less indexing, and many other changes designed to reduce the amount of work needed to create and maintain an interMedia Text application.

Other interesting features include a fuzzy score passthrough, a new tokenizing document service, and German composite query improvements.


Upgrading 8.1.6 to 8.1.7

Automated Upgrade

Manual Upgrade From 8.1.5

  1. run ?/ctx/admin/upgrade/s0801060.sql as SYS.
    This grants new, needed database privileges to ctxsys
  2. run ?/ctx/admin/upgrade/u0801060.sql as CTXSYS.
    This upgrades the ctxsys schema to 8.1.6

Manual Upgrade From 8.1.6

  1. run ?/ctx/admin/upgrade/s0801070.sql as SYS.
    This grants new, needed database privileges to ctxsys
  2. run ?/ctx/admin/upgrade/u0801070.sql as CTXSYS.
    This upgrades the ctxsys schema to 8.1.7
  3. run ?/ctx/admin/dr0typec.pkh as CTXSYS.
    This creates the CTXCAT index type, which is new for 8.1.7
  4. run ?/ctx/admin/dr0pkh.sql as CTXSYS.
    This re-creates all public and private package headers
  5. run ?/ctx/admin/dr0plb.sql as CTXSYS.
    This re-creates all public and private package bodies
  6. run ?/ctx/admin/dr0type.plb as CTXSYS.
    This re-creates the CONTEXT index type body
  7. run ?/ctx/admin/dr0typec.plb as CTXSYS.
    This creates the CTXCAT index type body, new for 8.1.7
  8. run ?/ctx/admin/dr0itypc.sql as CTXSYS.
    This creates the bindings and index type for CTXCAT, new for 8.1.7
  9. check for any invalid ctxsys objects and alter compile as needed

Upgrading Existing Indexes

A utility script -- ?/ctx/sample/util/dr817rx.sql -- is provided to rebuild existing $X indexes. It should be run as ctxsys, and should be passed the index owner and index name, as in:

   
 sqlplus ctxsys/ctxsys
  ...
  SQL> @dr817rx scott myindex

Please ensure that ctxsys' TEMP tablespace has sufficient space to re-create the $X index.

 


 

Catalog Index

What is the Catalog Index?

The context index type is designed for collections of news articles, technical reports, resumes -- relatively large, coherent documents. This does not describe the majority of database text, which is usually short text fragments -- names, item descriptions, addresses, etc.

While a context index can be applied to these text fragments, it's usually overkill. Filtering, section searching, relevance ranking, and the rest of the rich feature set that makes sense when applied to document collections are less useful when applied to text fragments. Furthermore, the maintenance demands of a context index -- asynchronous DML, optimization -- are too troublesome for documents that are just a few words long. As a result, these text fragments are usually unindexed or b-tree indexed, limiting search to exact equality or prefix searches with LIKE.

Text fragments need text searching. Since the context index is not optimal for fragments, we designed one that is -- a catalog index. Using a different index structure and sacrificing some of the context index's flexibility and feature set, the catalog index provides text indexing that is free from maintenance overhead and text searching that has structured query support and faster response time than the context index.

The description of the catalog index type in this section is a high-level overview. If, after reading this section, you find yourself wondering why ctxcat has a particular limitation, or how it all works, please refer to "Ctxcat Index Internals". If you would like to get an idea of ctxcat indexing speed, required space, or query performance, please see "Ctxcat Performance Overview". Both are separate documents which should be available from wherever you got this document.

Creating a CTXCAT Index

As an example in this chapter of the overview, we'll use an on-line auction site which holds its items in a table like this:
  create table auction (
    item_id           NUMBER   PRIMARY KEY,  -- auction item identifier
    item_desc         VARCHAR2(80),          -- free-form item description
    price             NUMBER,                -- current price of the item
    end_date          DATE                   -- end time of the auction
  );
We would like to catalog-index the ITEM_DESC field. The catalog index is a domain index type just like the context index. This means that to create a catalog index, simply use create index and specify the catalog index type (CTXCAT):
  create index auctionx on auction(item_desc)
  indextype is ctxsys.ctxcat;
You can specify preferences using a parameters clause, just like the context index. However, the keywords you can use in a ctxcat index parameters string are limited to:
  INDEX SET   LEXER   MEMORY   STOPLIST   STORAGE   WORDLIST 
Index set is new and specific to ctxcat -- we'll discuss this a bit later. ctxcat does not support the datastore, filter, or section group keywords, so it is limited to plain-text, unstructured data stored directly in database columns. Language, format, and charset columns are also not supported, meaning that the data must be homogenous. By extension, this also disallows use of the multi-language stoplist and multi-language lexer.

The CTXCAT Query Language

Once the ctxcat index is created, you query using the operator catsearch instead of contains:
  select item_desc from auction     where catsearch(item_desc, 'oracle', null)>0; 
This finds all rows in auction which have "oracle" in the item_desc text. Like contains, the first argument is the text column, and the second is the query string. The third argument we'll talk about below -- just ignore it for now. There is no score label argument, because catsearch does not provide scoring -- it is a simple boolean search.

The catsearch operator uses a much simpler query language than contains, with the intent that web-style queries can be passed through to catsearch without extra parsing. The query language rules are:

  • Multiple words are treated as an AND.
  • Vertical bar is used for OR.
  • A leading plus sign on words is ignored and discarded
  • A leading minus sign excludes words (must have included words)
  • Double quotes delimit phrases.
  • Parentheses group operations.
Or, in comparison:
  catearch Query  Equivalent contains query   --------------  -------------------------   A B             A & B   A | B           A | B   +A -B           A ~ B   "A B"           A B   "A B" C         (A B) & C    (A B) | C       (A & B) | C 
AND, OR, AND-NOT, and PHRASE are the only functions currently available in the catsearch query language. catsearch does not support any of the other features found in the context query language such as accum, within, fuzzy, stem, wildcard, thesaurus, near, etc.

Index Maintenance

Unlike context indexes, ctxcat indexes require no special maintenance. DML is completely transactional. On an insert, the new row is immediately indexed and the information inserted into $I. Newly inserted rows can be found by catsearch queries immediately -- even before they are committed. Deletes and updates are also transactional. The transactional DML processing means that a ctxcat index will cause DML on the base table to slow down. This is just the effect of a b-tree index, but the impact of a ctxcat index is much larger.

Optimization is also unnecessary for ctxcat indexes. The index structure ensures that it is always largely optimal.

Structured Query Support

The ctxcat index also provides structured query support, so that query criteria for other columns can be efficiently combined with the text search. Let's continue our auction example. A plausible requirement is to allow people to limit their auction search to a price range. We could certainly add this criterion to the top-level query:
  select item_desc from auction     where catsearch(item_desc, 'oracle', null) > 0      and price < 15 
and it will work, but may be slow as the text results are individually post- filtered to verify the price range. It would be much more efficient to process both the text and range criteria at once - just like a concatenated b-tree index on two columns is faster than two separate indexes when querying on two column criteria. ctxcat indexes support something similar using index sets.

We start by creating an index set:

  ctx_ddl.create_index_set('aucset'); 
An index set, like a stoplist, is only a holding object. We have to add content in order to define its behavior. Index sets hold indexes. Each index is an ordered list of base table columns which will be used in mixed query. In our case, we want to perform a mixed query with price criteria, so we add an index on price to the index set:
  ctx_ddl.add_index('aucset','price'); 
The index set can take up to 99 indexes, each made up of ordered lists of columns from the base table. However, there are two important restrictions on these columns:
  1. only allowable types are NUMBER, DATE, and CHAR/VARCHAR2 with maximum length <= 30 bytes
  2. NULL values are not allowed. A NULL in a user column will cause an index error and the row will not be indexed.
So far we have created our index set and added a "price" index. When we use this index set when we create our index:
  create index auctionx on auction(item_desc)   indextype is ctxsys.ctxcat   parameters ('index set aucset') 
ctxcat includes the "price" column of the base table in the text index. This is not the same as the widely-used context technique of concatenating multiple columns' data using user_datastore. ctxcat does it quite differently -- see technical description document more details.

Once price is part of the $I index, we can reference it in the third parameter of catsearch:

  select item_desc from auction     where catsearch(item_desc, 'oracle', 'price < 15') > 0 
which allows the ctxcat index to process both the text query and structured predicate at the same time.

Note that you can't put just any structured criteria in there. It is restricted and simplified. The restrictions are:

  • The left-hand side (the column name) must be a column named in at least one of the indexes of the index set
  • The left-hand side must be a plain column name no functions, expressions, etc.
  • The operator is limited to: <, <=, =, >=, >, BETWEEN, and IN
  • The right-hand side must be composed of literal values no functions, expressions, other columns, subselects, etc.
  • Multiple criteria can be combined with AND. OR is not supported.
So all these are supported:
  catsearch(text, 'dog', 'foo > 15')   catsearch(text, 'dog', 'bar = ''SMITH''')   catsearch(text, 'dog', 'foo between 1 and 15')   catsearch(text, 'dog', 'foo = 1 and abc = 123') 
And these are NOT supported:
  catsearch(text, 'dog', 'upper(bar) = ''A''')   catsearch(text, 'dog', 'bar LIKE ''A%''')   catsearch(text, 'dog', 'foo = abc')   catsearch(text, 'dog', 'foo = 1 or abc = 3') 
Of course, this is still a database, so if you really need them you can always move these more sophisticated criteria out of the catsearch and back in the main select -- it may be a bit slower, but the query can be processed.

You can also use the third argument to order the results:

  select item_desc from auction     where catsearch(item_desc, 'oracle', 'order by price') > 0 
Note that this query does not have an order by clause. The rows will come back from catsearch in price without it. In fact, adding an order by clause will actually slow the query down. The third catsearch argument is not integrated with the oracle SQL query processor, so it will not be aware that the rows coming back from catsearch are already in order. As a result, it will go through an unneeded SORT ORDER BY.

You can combine structured criteria and order by, and also use the ASC and DESC keywords, as long as all columns are going in the same direction:

  catsearch(text, 'dog', 'order by foo desc, bar desc')   catsearch(text, 'dog', 'foo = 15 order by bar') 
Some combinations of structured criteria and order by might not be able to be processed efficiently by the catsearch index. When this happens, the query will fail with the error:
  DRG-10844: index cannot execute this structured predicate efficiently 
This can happen when, for instance:
  • a column not in the index set is referenced
  • order by columns are going in different directions
  • the order by column list is not found in the index set
as well as other, more arcane occasions. When this error occurs, you can simply move parts of the structured criteria or order by out of the third argument and back in the main select.

Creating Catalog Indexes -- Further Notes

The catalog index does not support document services (highlighting, markup, themes, gists) or query services (explain, query feedback, browse words). Support may be added in a future release.

The catalog index does not support theme indexing -- if given a lexer preference with INDEX_THEMES set, it will ignore this setting.

In order to support Japanese, the catalog index must be built with prefix indexing -- this feature is new for 8.1.7 and is covered later in this document.

The catalog index has its own set of system parameters for setting preference defaults. This allows an installation to have one set of default preferences for context indexes, and another set for ctxcat indexes. The new system parameters are:

  DEFAULT_CTXCAT_LEXER   DEFAULT_CTXCAT_STOPLIST   DEFAULT_CTXCAT_WORDLIST   DEFAULT_CTXCAT_STORAGE   DEFAULT_CTXCAT_INDEX_SET 
While you can specify a wordlist preference for ctxcat indexes, most of the attributes do not apply, since the catsearch query language does not support wildcarding, fuzzy, and stemming. The only attribute of the wordlist preference that is useful is PREFIX_INDEX, for Japanese data.

ctxcat uses a $I table to store the text information, just like context. However, ctxcat $I tables have many more rows and take up much more space than context $I tables for the same data. Ensure that the indexing session has sufficient rollback and adequate storage space.

Index memory can be decreased to reduce rollback usage. ctxcat indexes have a different structure than context indexes, and do not produce sub-optimal indexes with a low index memory setting. ctxcat indexing time is also unaffected by the indexing memory unless set very low.

The ctxcat index will create several regular (b-tree) indexes on its $I table. Because they are created in a PL/SQL stored procedure owned by CTXSYS, these b-tree indexes are created using the temporary tablespace of CTXSYS. Ensure that this is set to a tablespace with sufficient working space.

For further technical and quantitative information, see the separate documents "Ctxcat Index Internals" and "Ctxcat Performance Overview".

How to Choose

In most cases, the choice between ctxcat and context indexes will be easy to make. The limitations of ctxcat preclude its use in many situations, while its zero maintenance and structured query support make it clearly superior in others. Use ctxcat for text fragments and context for information retrieval. ctxcat and context are complements, not competitors.

There is possible overlap is on medium-size (500 - 2000 bytes) unstructured plain text documents which need very simple text searching and structured query support -- resumes which need to be keyword searched and ordered by date, for example. The application needs to balance context's indexing speed and storage efficiency against ctxcat's structured query support. Unless the performance of the context index is unacceptable, the context index should probably be preferred in such cases. While the application may be able to bear ctxcat's limitations now, context offers more flexibility if the application requirements expand in the future.

Finally, if you simply cannot decide, you can always elect to create both. Because they are different index types, it is possible to create both a context and a ctxcat index on the same column.

Datastore Improvements

Multi-Column Datastore

Many customers use the user datastore to concatenate multiple table columns into a single virtual document, for more efficient mixed query. Some cases where user datastore is used in this way are good candidates for the ctxcat index type, discussed above, but many will need to stick with the context index for its more powerful query language and more extensive feature set. (The question of ctxcat vs. context section search is addressed further in "Ctxcat Performance Overview")

Since concatenation of multiple base table columns is such a popular use of the user datastore, we've created a datastore specifically for this purpose: the multi_column_datastore.

Using the multi column datastore is easier than doing the same thing in the user datastore -- you simply create your preference:

  ctx_ddl.create_preference('mymulti','MULTI_COLUMN_DATASTORE'); 
then set your column list:
  ctx_ddl.set_attribute('mymulti', 'COLUMNS', 'column1, column2, column3'); 
create the index using the multi-datastore:
  create index myindex on mytable(columnx)   indextype is ctxsys.context   parameters('datastore mymulti') 
and at indexing time, the engine will create a virtual document from all the specified columns like this:
  <column1>   column 1 contents here   </column1>   <column2>   column2 contents here   </column2>   etc. 
Note that the indexed column is NOT automatically included in the document. Only the specified column list columns are used to create the document. This format is ideal for sectioning with the basic section group, although a section group is NOT automatically created when creating or using a multi- column datastore preference.

The column list can contain expressions, function calls, etc.:

  ctx_ddl.set_attribute('mymulti', 'COLUMNS',                          'col1, col2 + 5, someowner.somefunction(col3) col3'); 
The SQL is executed as user ctxsys, so functions should be fully qualified with owner name.

The column list supports most atomic column types. NUMBERs and DATEs are converted to text using the default format mask. TO_CHAR can be used in the column list to use a different format mask. RAW and BLOB values are not converted to hex, but are directly concatenated as binary data. LONG, LONG RAW, NCHAR, NCLOB, object and collection columns are not supported.

Because the column list can contain expressions, there are no triggers created on the columns in the column list - updates must be propagated by application code or through user-defined triggers.

Because the column list can contain function calls, and because the executing user is CTXSYS, the multi datastore must be securely controlled. Only CTXSYS is allowed to create preferences using MULTI_COLUMN_DATASTORE, although once created any user can use it to create an index. If this is too restrictive, the DBA can create a stored procedure owned by ctxsys which calls create_preference. This stored procedure can then be called by any user to create MULTI_COLUMN_DATASTORE preferences.

MULTI_COLUMN_DATASTORE is implemented entirely in C, so avoids the PL/SQL and temporary lob overhead of user datastore. As a result, it indexes faster, but is less powerful since it is limited to expressions on the columns of the base table. The user_datastore is required for more complex logic or multiple table data concatenation.

Safe Callouts Allowed in USER_DATASTORE

interMedia Text no longer uses the safe callout (extproc) mechanism for indexing in 8.1.7. As a result, user_datastores can now call safe callouts, which was not allowed before.

Optional FILE/URL Security

FILE and URL datastores access files on the database machine. This may be undesirable for more sensitive sites, since any user can browse the file system accessible to the oracle user using these datastores. For those sites, 8.1.7 introduces the FILE_ACCESS_ROLE system parameter. The DBA can set this system parameter to the name of a database role. If set, any user attempting to create an index using FILE or URL datastores must have this role, or the index creation will fail. For instance, if the DBA does:
  ctx_adm.set_parameter('FILE_ACCESS_ROLE','WHITEHAT'); 
then when user SCOTT tries to do:
  create index foox on foo(text) indextype is ctxsys.context   parameters('datastore ctxsys.file_datastore') 

the engine will test if SCOTT has the role WHITEHAT. If he does, then the create index will proceed as normal. If not, then the create index will fail.

Note that this is checked only at create index time. Setting this parameter or granting/reovking the named role has no effect on existing indexes using the file datastore.

Username/Password Support in URL Datastore

FTP protocol URL's in the URL datastore now support embedded username and password, as in:
  ftp://username:password@ftp.hostname.com/dir/file.doc 

Filter Improvements

PL/SQL Procedure filter

The user filter object currently employs an external executable which is invoked once for each document. This can be slow because each invocation is burdened by shell overhead and startup costs. The PL/SQL procedure filter replaces the external executable interface with a PL/SQL interface, not unlike the user datastore. With the integration of indexing into the oracle shadow process, safe callout user filters are now supportable. Furthermore, a PL/SQL interface allows PL/SQL or Java stored procedure filtering, which is useful in a wide variety of situations.

When creating a procedure filter, you must specify the procedure name in the PROCEDURE attribute:

  ctx_ddl.create_preference('myfilt','procedure_filter');   ctx_ddl.set_attribute('myfilt','procedure','normalize'); 
The procedure filter is subject to the same security constraints as the user datastore. The procedure must be owned by CTXSYS, and any user using the preference to create an index must have been granted EXECUTE permission on the stored procedure.

By default, the filter procedure should have two parameters. The first parameter is a BLOB and is the incoming binary version, provided by the datastore. The second parameter is a CLOB and is the outgoing text version produced by the filter. So, from the example above, we are expecting a procedure like this:

  PROCEDURE NORMALIZE(IN BLOB, IN OUT NOCOPY CLOB) 
The input and output types can be changed, using the INPUT_TYPE and OUTPUT_TYPE attributes. For instance:
  ctx_ddl.set_attribute('myfilt','input_type','clob');   ctx_ddl.set_attribute('myfilt','output_type','varchar2'); 
specifies a signature of (IN CLOB, IN OUT NOCOPY VARCHAR2) for the filter procedure.

In addition to BLOB, CLOB, and VARCHAR2, both input and output type can be set to FILE. When input is set to FILE, the incoming document will be spooled to a file and the filename passed in as the first argument. When output is set to FILE, interMedia Text will pass a filename in as the second argument. Your filter procedure is expected to place the filtered content into that file. The type of the procedure parameter should be VARCHAR2. This facility is useful when the filter procedure is actually a safe callout, because dealing with OS files is a lot simpler than accessing LOBs.

The filter procedure can also request that the rowid, charset, or format columns be passed in. For instance:

  ctx_ddl.set_attribute('myfilt','ROWID_PARAMETER','TRUE') 
tells interMedia Text to pass the rowid in as the first parameter, followed by the input, followed by the output. ROWID_PARAMETER, CHARSET_PARAMETER, and FORMAT_PARAMETER are all independent. The order is rowid, then format, then charset, but the filter procedure is passed minimal parameters. For instance:
  ctx_ddl.create_preference('myfilt','procedure_filter');   ctx_ddl.set_attribute('myfilt','procedure','normalize');   ctx_ddl.set_attribute('myfilt','input_type','clob');   ctx_ddl.set_attribute('myfilt','output_type','varchar2');   ctx_ddl.set_attribute('myfilt','rowid_parameter','TRUE')   ctx_ddl.set_attribute('myfilt','charset_parameter','TRUE') 
is expecting CTXSYS.NORMALIZE to have the signature
  (IN ROWID, IN VARCHAR2, IN CLOB, IN OUT NOCOPY VARCHAR2) 
The first parameter is the document's rowid, the second is the value of the document's charset column, the third is the incoming document content, and the fourth is the outgoing document content.

Note that the INSO filter still uses the external executable method of filtering, but this may change to a safe callout in a future release.

Section Group Improvements

8.1.7 contains no changes to section groups or section searching. However, this area will get a lot more attention in the next release.

Lexer Improvements

Mixed Case and Composite

8.1.7 basic lexer now allows the MIXED_CASE to be FALSE when COMPOSITE is set. In past versions, setting composite forced a mixed case index.

Improved Composite Indexing/Query

Previously, when using German or Dutch decompounding, any word more than 5 characters long was sent through decompounding at index time. At query time, each query term was decompounded and a search done on the composites. This resulted in poor precision -- the excessive decompounding and normalization meant that any single query term could match many other words, with no way to limit the search to exact match.

8.1.7 limits index decompounding to real compound words, and eliminates decompounding at query time. This means that queries are now by default exact match, so are much more precise.

Stoplist Improvements

Multi-stoplist

interMedia Text 8.1.6 introduced the multi-lexer, which allowed storing documents of different languages in a single table. However, the index was limited to a single stoplist, so cross-language homographs (words from different languages which are spelled the same but mean different things - e.g. DIE in German and English, OR in French and English) could not be added to the stoplist, or the stoplist could unintentionally eliminate meaningful words.

Version 8.1.7 introduces a multi-language stoplist. You create a multi- language stoplist preference using the MULTI_STOPLIST object:

  ctx_ddl.create_preference('mystop','MULTI_STOPLIST'); 
When adding stopwords, each stopword's language must be identified as the second parameter to add_stopword:
  ctx_ddl.add_stopword('mystop', 'Die', 'german');   ctx_ddl.add_stopword('mystop','Or','english'); 
At indexing time, the language column of each document is examined, and only the stopwords for that language are eliminated. At query time, the session language setting determines the active stopwords, just like it determines the active lexer when using the multi-lexer.

Wordlist Improvements

Prefix Indexing

Wildcard queries in 8.1.6 are handled by expanding the input pattern to matching tokens in $I using LIKE, then running a query on an equivalence of those matches (if you aren't familiar with equivalence, think of it as an OR). For instance, a query like
  contains(text, 'wood%') 
would first go through a query like:
  select distinct token_text     from dr$index$i    where token_text like 'WOOD%' 
to expand the pattern. Let's say it returns wood, wooden, woody, and woods. The query would then be re-written to something like
  contains(text, 'wood = wooden = woody = woods') 
This is problematic for three reasons.
  1. The expansion step may take a long time if there are many matching terms or if the terms have many rows in $I.
  2. The pattern may match too many words, and the query will fail with "query too complex".
  3. If there are many individual tokens, it will take a long time to read and merge all their information.
So, for those applications which need faster wildcard searching, 8.1.7 adds prefix indexing. When prefix indexing on:
  ctx_ddl.create_preference('mywordlist', 'BASIC_WORDLIST');   ctx_ddl.set_attribute('mywordlist','PREFIX_INDEX', 'YES'); 
each token is chopped into multiple prefixes at index time, and each prefix indexed as token type 6. For example, a document with "TOKEN" would result in the following $I rows:
  TOKEN_TEXT      TOKEN_TYPE TOKEN_INFO   --------------- ---------- ----------   TOKEN           0          DOC 1 OFFSET 1   T               6          DOC 1 OFFSET 1   TO              6          DOC 1 OFFSET 1   TOK             6          DOC 1 OFFSET 1   TOKE            6          DOC 1 OFFSET 1   TOKEN           6          DOC 1 OFFSET 1 
When a prefix wildcard query is executed:
  contains(text, 'TOK%') 
interMedia Text simply removes the trailing wildcard and looks up the information for token "TOK" type 6. No LIKE expansion or multiple token merging is needed. The result is that prefix queries are executed much more quickly and there is no limit on the number of expansions. Incidentally, this is why TOKEN, type 6 must be indexed -- this allows a search for TOKEN% to find TOKEN.

The price for the query performance improvement is increased index size and time. This can be controlled using optional settings for minimum and maximum length. For instance, with:

 ctx_ddl.set_attribute('mywordlist','PREFIX_MIN_LENGTH', 3);  ctx_ddl.set_attribute('mywordlist','PREFIX_MAX_LENGTH', 4); 
Only prefixes between 3 and 4 characters in length are indexed. The token TOKEN results in
  TOKEN_TEXT      TOKEN_TYPE TOKEN_INFO   --------------- ---------- ----------   TOKEN           0          DOC 1 OFFSET 1   TOK             6          DOC 1 OFFSET 1   TOKE            6          DOC 1 OFFSET 1 
A wildcard search whose pattern is below the min length or above the max length resorts to the old style of resolution (i.e. expand the pattern using LIKE and merge the results using equivalence).

Prefix indexing should be especially helpful in Japanese environments, where queries are often translated under the covers to prefix queries, and whose V-GRAM tokens are fairly short to begin with.

Prefix indexing is supported for ctxcat indexes. However, catsearch does not have a wildcard operator, so setting it is not usually useful. If ctxcat indexing with Japanese VGRAM lexer, though, prefix indexing MUST be on -- it is required for query processing.

Prefix indexing and substring indexing (introduced in 8.1.6) are separate and independent. Prefix indexing speeds up right-truncated wildcard queries only (e.g. ABC%), while substring indexing speeds up left- and double-truncated wildcard queries (e.g. %ABC, %ABC%). Prefix indexing gets a speed boost by eliminating expansion and equivalence merging. Substring indexing gets a speed boost by eliminating a full-table scan on $I to do the expansion -- it still has to do equivalence merging on the expanded list.

Index Creation and Maintenance

Full Migration to Shadow Process

In current versions of interMedia Text, query code is linked into the oracle shadow process, while indexing and document services code is invoked through extproc. In 8.1.7, all functions have been migrated to the oracle shadow process, eliminating the need for extproc setup. As a side effect, this allows user datastores and PL/SQL procedure filters to reference safe callouts.

Commit After Every Error

Errors encountered during indexing are recorded in the dr$index_error table. However, in current versions of interMedia Text, these inserts are committed only at index flush time. In some cases, the error table fills up, runs out of extents, and rolls back, which means that the index creation fails, but no error messages are left in the error table to examine and determine the cause of failure. 8.1.7 uses autonomous transactions to commit each index error message immediately, avoiding this problem.

Keyless Index

The ConText Option, with its relatively low degree of integration with the kernel, relied on a primary key value in order to join the text results to the base table. The migration to extensible indexing in 8.1.5 eliminated the need for a primary key, because under the covers rows were identified by rowid. However, a primary key constraint on the table was still required before you could create a context index on it.

8.1.7 removes this restriction, so that you can now create a context or ctxcat index on a table without a primary key constraint. Document service requests for this index default to rowid input for the textkey argument. Note that creating a primary-key-less index may preclude some future operations, if a database-independent method of row identification is ever required.

Meta Column Replacement

interMedia Text 8.1.6 added new language, format, and charset columns for heterogeneous document sets. However, once set in create index, the column name could not be changed in alter index. You could always simply drop the index and re-create it, specifying the meta-data columns, but in 8.1.7 we have added meta-column support to alter index replace:
  alter index  rebuild parameters ('replace language column ') 
which is a little easier. Note that this is a usability enhancement only -- the index data is still dropped and rebuilt from scratch, just like any replace operation.

Full Optimization with No Garbage

FULL mode optimization removes deleted document information and defragments rows at the same time. Some installations have no deletes or updates, but would still like to use FULL optimization because its MAXTIME parameter allows it to optimize in time-limited batches. Unfortunately, since it was intended to be run to clean up deleted document information, if there were no deleted documents, FULL mode would return without doing anything. This has been changed in 8.1.7 so that FULL mode optimization runs, even if there are no deleted documents.

Single Token Optimization

A new optimization mode - token optimization - allows optimization of just a single token's information. The syntax is:
  alter index  rebuild online parameters ('optimize token '); 
or, in PL/SQL:
  ctx_ddl.optimize_index(, 'TOKEN', token=>'') 
This allows a site to always keep the most frequently queried tokens optimal.

Query Improvements

Fuzzy Score Passthrough

The FUZZY operator (represented by the question mark (?)) expands the input term to words which are spelled similarly. It can increase query recall when query terms or words in the indexed documents are frequently misspelled. However, it could also expand to hundreds of words, leading to low precision and long-running queries.

The size of the fuzzy expansion can be controlled at index time through wordlist attributes introduced in version 8.1.5. FUZZY_NUMRESULTS limits the expansion to no more than the specified number of terms. FUZZY_SCORE limits the expansion size by fuzzy score. Fuzzy score is a measure of the degree of similarity of the input term to each result term. This ranges from 50 to 80, with 80 meaning identical. For instance, ?grisam might, internally, expand to:

 Word    Fuzzy Score
  ------- -----------
  Grisham          73
  Grissom          66
  Gresham          65
  Graham           56
Setting FUZZY_SCORE limits the expansion terms to those with a fuzzy score equal to or greater than the specified value. If in our example we set FUZZY_SCORE to 70, ?Grisham would expand only to Grisham.

Query-time control over these limits was introduced in version 8.1.6, although this new feature was not documented in the official documentation nor in the interMedia Text 8.1.6 Technical Overview. The FUZZY function was expanded with two parameters for specification of fuzzy score threshhold or numresults limit:

 FUZZY(term [, fuzzy_score [, fuzzy_numresults]])
For instance,
 FUZZY(term)
uses the index-time wordlist attribute limits. However,
 FUZZY(term, 65)
overrides the index-time setting of FUZZY_SCORE and uses 65 as the fuzzy score threshhold. Similarly,
 FUZZY(term,,15)
overrides the index-time setting of FUZZY_NUMRESULTS and instead limits the size of the expansion to 15 terms.

Although you could limit the terms by fuzzy score, the contains score of the FUZZY function still reflected only the term frequency of the expanded word -- there was no way to find documents containing words similar to X, and order them by similarity to the input term. So, in 8.1.7, a new parameter has been added to the FUZZY function:

 FUZZY(term [, fuzzy_score [, fuzzy_numresults [, weight]]])
When the weight parameter is set to "weight" or "w", the FUZZY function weights each expanded term by ((3 * fuzzyscore) - 140)/10) -- a factor proportional to its fuzzy score. For instance, our example above, FUZZY(grisam), would normally be re-written as:
 grisham = grissom = gresham = graham
with the weight parameter, this becomes
 (grisham * 7.9) = (grissom * 5.8) = (gresham * 5.5) = (graham * 2.8)
And the more similar terms, having higher weights, would tends towards higher scores. Let's see this in practice. We'll start with a simple table, indexed with FUZZY_NUMRESULTS set to 500 and FUZZY_SCORE set to 50. The query:
 select score(1), text from mytable 
   where contains(text, 'FUZZY(Grisam)',1) > 0
   order by score(1) desc
results in:
  SCORE TEXT
  ----- -----------------------
     31 John Grisham
     23 Gus Grissom
     21 Douglas Gresham
     11 Heather Graham
which allows us to easily pick out the best matches.

It is important to remember that fuzzy weighting is not simply a normalized fuzzy score -- it is a weighting of the regular, inverse-document frequency score. Because of this, it is possible for less similar terms to outscore more similar terms, if the less similar terms occur frequently in the document. For instance, if we add the document "Graham Graham Graham" to the corpus, the weighted query returns:

  SCORE TEXT
  ----- -----------------------
     37 Graham Graham Graham
     35 John Grisham
     26 Gus Grissom
     24 Douglas Gresham
     12 Heather Graham
Consequently, weighted fuzzy is most effective when applied to indexes on small, short text fields (like names), where term repetition in a single document is rarer than in longer documents.

 

$I Query in Cost Estimate

interMedia Text is integrated with the cost-based optimizer. Contains queries are passed to interMedia Text for estimation of selectivity and cost, if the text index has been analyzed. In 8.1.5 and 8.1.6, analyze index scans all rows of the $I table and calculates the document counts for the most unselective tokens. An average measure is used for all other tokens.

One problem with this approach is that the statistics are gathered at analyze time, and not updated, so over time they become increasingly out-of- date. Furthermore, because of limited statistic storage space, accurate counts are stored for only a few hundred tokens - everything else uses the average measure, which is inaccurate. Finally, because it does so much work, analyze itself is a time-consuming operation.

In 8.1.7, analyze has been changed to do very little - it now simply records the fact that it has been run. At query time, $I is directly queried to get the token document count. The statistics are more accurate, always up-to-date, and analyze index takes seconds instead of hours. Although the cost and selectivity estimation takes longer, the cost estimation queries are only $X index scans, and are the same that the query itself will run, so essentially they cache the blocks that the query will read. Overall query time does not degrade.

 

Improved Performance on Large Datasets

We have made some internal changes to the code to optimize query-time disk access. This should result in improved text query performance for large datasets. The exact improvements implemented are too detailed for the scope of this document.

 


 

 


 

Knowledge Base Improvements

User-Defined Knowledge Bases

Version 8.1.7 allows you to load your own knowledge base, and create a completely new knowledge base from scratch. This allows a company with an extensive Spanish thesaurus, for instance, to generate a Spanish knowledge base, and use that when generating theme indexes, in ABOUT query, etc.

The steps are:

  1. Load a custom thesaurus using ctxload
  2. Set NLS_LANG so that the language portion is the target language The charset portion must be a single-byte character set.
  3. Compile the loaded thesaurus using ctxkbtc
        ctxkbtc -user ctxsys/ctxsys -name <thesaurus> 

Limitations:

  • Oracle supplies knowledge bases in French and English only. You must provide your own thesaurus for any other language,
  • This process only works for single-byte character sets. You cannot create a knowledge base for languages which can be expressed only in multi-byte character sets. If the database character set is multi-byte, such as UTF-8, NLS_LANG must be set to a compatible single-byte character set when compiling the thesaurus.
  • It may not be possible to create effective knowledge bases for languages with extensive compounding, such as Dutch or German, or with many word forms, such as Hebrew or Arabic.
  • You can have at most one knowledge base per NLS language.

During knowledge base compilation, you can also now specify a stoplist:

        ctxkbtc -user ctxsys/ctxsys -name  -stoplist <stoplist>
The words of this stoplist are marked as low-information words that are prevented from becoming or contributing to a theme.

 


 

Document Services Improvements

Tokens Document Service

TOKENS is a new document service which returns the words of a document as seen by the indexing engine. The document is fetched, filtered, sectioned, and lexed, and the tokens placed in a result table or PL/SQL table. This is useful for applications which may want to implement classification, clustering, or routing.

The result table form of the call is:

  ctx_doc.tokens(<index_name>, <textkey>, <restab>, <query_id>);
The result table should have the form
   
  QUERY_ID  NUMBER          -- The query id
  TOKEN     VARCHAR2(64)    -- the lexed token
  OFFSET    NUMBER          -- the character offset of the token
  LENGTH    NUMBER          -- the original length of the token before lexing

The PL/SQL in-memory form is:
 type token_rec is record (
    token   varchar2(64),
    offset  number,
    length  number
   );
   type token_tab is table of token_rec index by binary_integer;

NUM_THEMES Limit

CTX_DOC.THEMES and CTX_DOC.GIST have both been extended with a new numeric argument, NUM_THEMES. This argument limits the number of themes or POV gists generated. For instance:
  ctx_doc.themes('myidx','1','ttab',num_themes=>12)
will populate the "ttab" table with the 12 strongest themes, instead of the up-to-50 themes normally generated.

 


 

Thesaurus Improvements

In-Memory Thesaurus Build

ctxload thesaurus loading has been re-written to construct the thesaurus in-memory then array insert the phrase information.
Loading a large thesaurus is now much faster than in previous versions.

Create/Drop/Alter Translation

The ctx_thes API has been expanded with the following calls for maintaining translation information in an existing thesaurus:
  ctx_thes.create_translation(<tname>, <phrase>, <language>, <translation>)
  ctx_thes.drop_translation(<tname>, <phrase>, <language>, <translation>)
  ctx_thes.update_translation(<tname>, <phrase>, <language>, <translation>, <new_translation>)

Has_Relation

A relation testing procedure has been added for applications which may need to test that a relation exists without actually doing the expansion:
 ctx_thes.has_relation(<phrase>,<rel>,<tname>)
rel can be a single thesaural relation, or a comma-separated list of relations, except PT. The function returns TRUE if the phrase has any of the relations in the list. For instance:
 ctx_thes.has_relation('cat','BT,BTG')
returns TRUE if the phrase "cat" in the DEFAULT thesaurus has any broader terms or broader generic terms.

Chapter 4 : Oracle 9.0

Oracle Text 9.0.1 - Technical Overview

Garrett Kaminaga, Product Development, Database and Application Server Technologies, Oracle HQ

This is a technical overview of the improvements found in Oracle Text (the product formerly known as interMedia Text) version 9.0.1. This is intended for an audience fairly familiar with previous versions of the product.

If you are new to the product, you should familiarize yourself with the product before reading this document. The official documentation, especially the new Application Developer's Guide, is a good place to start, as are the interMedia Text technical overviews for 8.1.5, 8.1.6, and 8.1.7, in order.

This document is only a summary of new features and internal changes. As always, please refer to the official documentation for detailed information.

 



 

Table of Contents

 

 



Overview of 9.0.1 Improvements

Three big new features stand out in the set of 9.0.1 improvements: local partitioning, XML path searching, and document classification. With local partitioning, large text indexes are easier to manage and faster to create, maintain, and query. XML path searching enables sophisticated queries which can reference and leverage the embedded structure of XML documents. The new CTXRULE indextype is the inverse of the ConText indextype -- instead of using a text query to find documents, you use a document to find queries.

The rest of the improvements are smaller changes, but were driven by user feedback, and make using the product a little easier, a little faster, or a little more robust. These include new, more sophisticated lexers for Japanese and Korean, an INSO filter timeout, per-row logging during indexing to find faulty documents, transportable tablespace support, and an on-demand filter.

 



Upgrading 8.1.7 to 9.0.1

Manual Upgrade From 8.1.7

Oracle Text is now integrated with the Database Migration Assistant (DBMA), so the Text upgrade should happen as part of the overall database upgrade procedure. This section details the manual upgrade process, in case something goes wrong or you choose to not use DBMA.

This process assumes that you are starting with a valid 8.1.7 ctxsys data dictionary, and that the system is not being used for index, DML, or query. Use SQL*Plus to run all scripts.

1. run ?/ctx/admin/s0900010.sql as SYS
this grants new, needed database privileges to ctxsys

2. run ?/ctx/admin/u0900010.sql as CTXSYS
this upgrades the ctxsys schema to 9.0.1

It is not necessary to run anything else. The u0900010.sql script calls other scripts to upgrade the indextypes (t0900010.sql), recompile all packages (dr0pkh.sql and dr0plb.sql), and recreate the shared library (dr0ulib.sql).

The script is not expected to encounter errors other than ORA-955 ("name is already used by an existing object", for public synonyms). After the script is complete, all ctx-created objects owned by ctxsys should be VALID.

For iMT veterans who may be wondering, the path is not a typo. The upgrade scripts, which used to be in their own sub-directory, are now found in the ?/ctx/admin directory.

SQL*Net Listener Setup

interMedia Text 8.1.7 shifted all code to trusted callouts, eliminating the need for SQL*Net listener configuration (and there was great rejoicing).

Sorry, extproc is back again. But just for one new document service function -- ctx_doc.ifilter, the on-demand INSO filtering call (this is covered below). If you don't use this function, you don't need to set up the listener.

A brief description of setting up the listener is below; complete details are in the Oracle 9i Server Administrator's Guide.

1. Add an entry to the tnsnames.ora:

 

       extproc_connection_data = 
         (DESCRIPTION = 
            (ADDRESS = (PROTOCOL = ipc)
                       (KEY = DBSID))
            (CONNECT_DATA = (SID = ep_agt1)))

DBSID is the database SID. ep_agt1 can be named anything. extproc_connection_data should not be changed.

2. Add the following to the listener SID_LIST:

 

       (SID_DESC = (SID_NAME = ep_agt1)
                   (ORACLE_HOME = /oracle)
                   (ENVS = LD_LIBRARY_PATH=/oracle/ctx/lib)
                   (PROGRAM = extproc))

ep_agt1 matches the CONNECT_DATA SID for extproc_connection_data in the tnsnames.ora. The PROGRAM section tells the Net8 listener to start the external procedure process. The ENVS section, which is shown here for UNIX, will ensure that the environment includes ?/ctx/lib in LD_LIBRARY_PATH. This is needed so that indexing can use the INSO filters.

3. Since the extproc_connection_data ADDRESS section specifies ipc, make sure that the ADDRESS_LIST of listener.ora accepts ipc connections.

A quick way to test the Net8 configuration is to do:

 

    exec ctx_adm.test_extproc;

from SQL*Plus, as ctxsys. If you get the error:

 

    DRG-50704: Net8 listener is not running or cannot start external procedures

then things aren't set up correctly. OTN and Oracle Support have helpful debugging guides for extproc setup, compiled problem-by-problem, and paid for in past headaches.

Upgrading Existing Indexes

 

Nothing needs to be done to upgrade existing text indexes. These should still work without any problems in 9.0.1. However, we made ctxcat DML a bit more efficient, so you might choose to re-constitute the triggers for ctxcat indexes. All you have to do is rename the index:

 

  alter index <indexname> rename to <newname>

You can, of course, rename it back to the original name if you choose.

 

 



Local Partitioned Indexes

Introduction
Creating Local Partitioned Indexes

Local Partitioned Indexes

A local partitioned index is a partitioned index on a partitioned table with a one-to-one mapping of index partitions to table partitions. It's like splitting up a large table and index into a set of smaller tables and indexes.

Local partitioned indexes have been around since partitioning support was first added to the database. However, this support was limited to b-tree and bitmap indexes -- domain indexes (including ConText indexes) could only be global indexes. Kernel's extensible indexing framework has been enhanced in 9i, allowing local partitioned text indexes for the first time.

Why Local Partitioned Indexes?

We won't discuss why table partitioning is good for large data sets -- that's covered pretty well in the general database Concepts manual. Instead, we'll list some reasons why a local partitioned text index is better than a global index:

 

  • Easier Table Maintenance
    With a global index, partition-level DDL invalidates the entire index. You could truncate one partition, and spend days rebuilding the index, even though the data in the other partitions hasn't changed. A local partitioned index truncates the corresponding index partition and the rest of the index doesn't even notice. This makes managing large sets of data much easier. This is discussed further below in the section "Partitioned Table Maintenance"
  •  

  • Parallel Sync
    Oracle Text allows only one session to sync a given index at a time. Users with large datasets have found this too limiting -- the system can't keep up with huge amounts of incoming data. With local partitioning, the index partitions are independent, and can be sync'ed in parallel. We discuss this further in the section "Local Partitioned Index Maintenance".
  •  

  • Query Performance
    A local partitioned ConText index can take advantage of partition pruning and partition iteration for faster mixed query and sorted query. In contrast, a global index wastes time producing rowids in other partitions which are simply filtered out by the query plan. Query is examined in the section "Query Considerations".

Creating a Local Partitioned Index

Let's use the following partitioned table as an example:

 

  create table ptest (id number, text varchar2(80))
    partition by range (id)
    (partition p1 values less than (100),
     partition p2 values less than (200),
     partition p3 values less than (300));

To create a local partitioned index, simply specify the LOCAL keyword:

 

  create index ptestx on ptest(text)
    indextype is ctxsys.context
    local;

The local keyword must be after the indextype clause. You can stick a parameters clause before or after the local keyword:

 

  create index ptestx on ptest(text)
    indextype is ctxsys.context
    parameters ('filter ctxsys.inso_filter')
    local

 

  create index ptestx on ptest(text)
    indextype is ctxsys.context
    local
    parameters ('filter ctxsys.inso_filter')

are both OK. By default, the index partitions will have the same name as the corresponding table partitions, so in our examples above the index partitions are named p1, p2, and p3. You can override this by specifying individual partition clauses:

 

  create index ptestx on ptest(text)
    indextype is ctxsys.context
    local
    (partition larry, partition moe, partition curly)
    parameters ('filter ctxsys.inso_filter')

The partition clauses must come immediately after the local keyword. The partition clauses correspond to the table partitions in order. For instance, our first table partition is p1, so, in our example above, its index partition will be named larry, because that's the first partition clause in the set. Because it's done solely by order, this means that you can't just name a few partitions here -- you must specify a partition clause for each table clause. If your table has 100 partitions, you must put 100 partition clauses here.

Each partition clause can optionally have a parameters clause:

 

  create index ptestx on ptest(text)
    indextype is ctxsys.context
    local
    (partition larry, 
     partition moe parameters ('memory 5m'),
     partition curly)
    parameters ('filter ctxsys.inso_filter')

These partition-level parameters clauses can be used to specify index memory and storage attributes on a per-partition basis. We'll discuss the difference between the index-level and partition-level parameters clauses below.

Limitations

There are a few limitations to be aware of when creating local partitioned ConText indexes:

  • Base table must be RANGE partitioned
    Other partitioning methods, such as HASH, COMPOSITE, or LIST, are not supported for local indexes. You can still create a global index on tables partitioned by these methods, though.
  •  

  • ConText indextype only
    In this version, only the ConText indextype supports local partitioning. You will just get an error message if you try to create a local CTXCAT or CTXRULE index.
  •  

  • Index name must be 21 bytes or less
    Due to the table naming scheme used for local partitioned indexes (more on this below), the names of local partitioned indexes cannot exceed 21 bytes in length.
  •  

  • Maximum number of partitions is 9999
    Again due to the table naming scheme, the index (and therefore the table) cannot have more than 9999 partitions.

Local Partitioned Indexing Process

Now let's take a closer look at what happens under the covers when a local partitioned index is created:

 

  1. Create the index global meta-data
  2. For each partition in the table, do the following:
    1. create the index partition meta-data
    2. create the index partition internal tables
    3. populate the index data
    4. create the $X b-tree

First, the index global meta-data is created. Like a regular ConText index, it parses the index-level parameters clause, determines the preferences to use for all classes, then deep-copies the preference values to index meta-data.

In this version, the index meta-data is shared by all index partitions. This means that you cannot, for instance, specify a Japanese lexer for partition 1 and a Basic lexer for partition 2. The documents in the table are assumed to be partitioned by contiguous numeric qualities like date or id, rather than by document type. The existing heterogeneous data support (charset, format, and language column) can be used to work around this limitation.

Since all the index partitions share common meta-data, the partition parameters clause does not allow anything except MEMORY and STORAGE -- use the index-level parameters clause to specify everything else.

Once the index meta-data is created, we iterate over the table partitions. (Parallel index creation is covered below) For each table partition, an index partition is created. Creating an index partition is not that different from creating a regular ConText index. There are four stages: meta-data, internal table creation, population, and b-tree index.

The partition meta-data is usually just a single row in dr$index_partition, visible through the view CTX_USER_INDEX_PARTITIONS, which has the columns:

 

IXP_ID
This is a numeric ID for index partition. It is Text-assigned, so does not correspond to the index partition object id. The value ranges from 1 to 9999, and is unique within an index. The values are designed to be contiguous, so as index partitions are dropped and added, values will get re-used. Because most partition operations take in an index partition name, you generally won't care about the ixp_id with one exception -- the ixp_id makes up part of the internal tablenames.

 

IXP_INDEX_OWNER
IXP_INDEX_NAME
The owner and name of the index to which this index partition belongs.

 

IXP_INDEX_PARTITION_NAME
The index partition name. This is the same as the partition name in user_ind_partitions. Use this name to identify the index partition in all index partition operations.

 

IXP_TABLE_OWNER
IXP_TABLE_NAME
IXP_TABLE_PARTITION_NAME
The base table partition for this index.

 

IXP_DOCID_COUNT
IXP_STATUS
Like idx_docid_count and idx_status for normal text indexes, only now these apply on a partition level, so the values for these columns in CTX_USER_INDEXES can be ignored.

The second step in creating the index partition is internal table creation. Each index partition gets its own set of DR$ tables. The naming scheme is somewhat different -- DR#<indexname><ixp_id><suffix>. For instance, index partition p1 from our example above will have a $I table named DR#PTESTX0001$I. The # instead of $ avoids collisions with global ConText index tablenames. The addition of ixp_id is the reason why local partitioned index names are limited to 21 bytes instead of the 25 bytes for global indexes, and the four- digit decimal representation of the index partition id imposes the limit of 9999 index partitions.

The tablespace and storage for these internal tables are determined mainly by the storage preference specified in the index-level parameters clause. However, you can override this by using the partition parameters clause. Modifying our example above:

 

  create index ptestx on ptest(text)
    indextype is ctxsys.context
    local
    (partition larry, 
     partition moe parameters ('storage p2_storage'),
     partition curly)
    parameters ('storage global_storage')

The internal tables for larry and curly will be created using the global_storage preference, while moe will be created using p2_storage. Specifying a per-partition storage preference uses a little more index meta- data (stored in table dr$index_values).

The third step in creating the index partition is population. This is the real work -- reading the documents, inverting them, and dumping the index data to the $I table. This is the same as a global index population, except that the datastore will use a partition-extended tablename; we are indexing only a particular partition of the base table here.

Index memory size determination works like storage preference determination. First, we check if MEMORY was specified in the partition-level parameters clause. If not, we then check MEMORY in the index-level parameters clause. If not there, then we take system parameter DEFAULT_INDEX_MEMORY.

The fourth and final stage of index partition creation is the creation of the $X index on the $I table. The tablespace and storage for this index comes from the same storage preference used for internal table creation.

After these four stages are complete, it goes to the next table partition and starts all over. When all table partitions have been indexed, the index creation completes, and you can do contains queries.

Parallel Local Partitioned Index Creation

Normally the index partitions are created serially -- one at a time. More powerful systems may be able to create index faster by creating the partitions in parallel (inter-partition parallelism). This is not as simple as adding a PARALLEL clause to the create index statement. Principally this is because we currently ignore it for local partitioned text indexes. However, even if it were supported, it wouldn't create the partitions in parallel. With local b-tree indexes, specifying a PARALLEL degree still creates the index partitions serially, though each partition is parallelized (intra-partition parallelism).

The strategy for creating a local partitioned index with inter-partition parallelism is:

 

  1. create an empty index
  2. rebuild the index partitions in parallel

Normally, step 1 is accomplished using the UNUSABLE keyword on create index. Unfortunately, this is not supported for domain indexes -- you'll just get syntax errors. Instead, you can use the Text way of creating an empty index -- the NOPOPULATE keyword:

 

  create index ptestx on ptest(text)
    indextype is ctxsys.context
    local
    parameters ('NOPOPULATE');

This will create the index meta-data, then create partition meta-data, tables, and $X index for each partition. The partitions are processed serially, but since we're not doing index population, this should take only a few seconds per partition.

For step 2, rebuilding the index partitions in parallel, the database has a nifty utility -- DBMS_PCLXUTIL -- which uses DBMS_JOB to kick off multiple sessions which rebuild the index partitions. Unfortunately, this does not work with domain indexes, because the jobs that it submits include the PARALLEL clause, which is not supported for domain indexes, and raises a syntax error.

Instead, you have to manually submit the rebuild jobs. The easiest way to do this is to simply open multiple SQL*Plus sessions and, in each one, run scripts that have statements like:

 

  alter index ptestx rebuild partition p1;

Issuing alter index rebuild partition like this will drop the internal tables, re-create them, do the indexing, then re-create the $X index. So, there's a little extra work involved because we've created the tables already during create index, but it's not that wasteful. You can specify an index memory size like this:

 

  alter index ptestx rebuild partition p1 parameters ('replace memory 50M');

We are working with extensible framework development to simplify parallel local domain index creation for 9i Release 2.

 

 



Local Partitioned Index

Query Considerations

Query Syntax

There is no difference between local partitioned text index queries and global text index queries -- the syntax is the same. However, local partitioning can change the performance characteristics of certain types of queries. We'll examine this in this section using the following test table:

 

  create table ptest(a number not null, b varchar2(128))
  partition by range(a)
  (
    partition p1  values less than (10000),
    partition p2  values less than (20000),
    partition p3  values less than (30000),
    partition p4  values less than (40000),
    partition p5  values less than (50000),
    partition p6  values less than (60000),
    partition p7  values less than (70000),
    partition p8  values less than (80000),
    partition p9  values less than (90000),
    partition p10 values less than (100000)
  );

populated with synthetic data. There are 100,000 rows. Every row has the token EVRY. TM07 occurs every tenth row, between 1 and 7 times within the row (to give a score range). HSPC occurs every 100 rows, and KSPC occurs every 1000 rows.

We will be comparing two runs -- one with a global ConText index on this table, and one with a local partitioned index. The timing numbers given are for comparison against each other only, and do not represent tuned, publishable benchmark numbers.

SCORE

The contains score is calculated from several statistics, including number of documents containing the query token, and number of documents in the corpus. In a local partitioned index, each partition is independent, so these numbers are partition-scope. That is, it uses the number of documents in the partition that contain the token, and the number of documents in the table partition. As a result, the SCORE values for any given row that you get from a query on a local partitioned index may be different than if the index were a global index. However, we do not believe that the partition-scope of the score will have noticeable adverse effects for most applications.

Simple Text Queries

Each partition of a local partitioned ConText index is an almost-independent index. If many index partitions have to be scanned to fulfill a query, local index query time can be much slower than global index query time on the same data, because essentially you are scanning many indexes.

A plain text query like this:

 

  select * from ptest where contains(b, 'HSPC')>0;

with a global index shows a single domain index scan:

 

  1.1 DOMAIN INDEX  PTESTX

The local partitioned index, on the other hand, includes a partition iterator:

 

  1.1 PARTITION RANGE ALL
    2.1 DOMAIN INDEX  PTESTX

This shows that the query plan is looping over the index partitions. For each partition, it is doing a separate domain index scan.

Each domain index scan incurs some fixed amount of overhead, plus a query time which is proportional to the amount of indexed data and number of hits. The partitions of the local index have much less data than the global index, so that portion of the query time is lessened. However, the overhead is much more because the same fixed overhead is incurred once per partition. This slowdown increases as the number of partitions increases.

The following table illustrates this effect. We ran successive tests with the same data at different levels of partitioning -- for instance, the first test partitioned the 100,000 rows into 10 partitions of 10,000 rows, the second test partitioned the 100,000 rows into 50 partitions of 2,000 rows, etc. up to 1000 partitions. We created a local ConText index each time, and ran the query above, for HSPC. HSPC occurs every 100th row, so there's 1000 hits overall, and a minimum of 1 hit per partition. Times given below are in minutes:seconds and are to select all 1000 rows

 

  # of partitions in ptest
  10 50 100 250 500 1000
global index 1.14 1.11 1.14 1.11 1.12 1.18
local index 1.34 11.37 21.23 50.42 2:25.48 5:01.82

As the number of partitions increases, the overhead for scanning all those index partitions swamps the query time -- at 1000 partitions, the local index is about 250 times as slow.

Local partitioning may not be looking too good so far. Keep in mind, though, that this is the worst case scenario -- where local partitioning is weakest. After all, it makes no sense to split up your index unless you're going to leverage that split -- and this is what we turn to next.

Also, these times are for selecting all 1000 hits. The runtimes would be faster if you were interested in only the first 10 hits, for instance. In the 50 partition case, for instance, the query can get 10 hits from the first partition alone, and would stop without scanning the other 49 partitions. This would yield times very close to global index.

Partition Pruning

We've seen that local partitioned indexes have the disadvantage when all the partitions have to be scanned; the flip side of this is that they have a distinct advantage when you only have to scan a handful of partitions.

The query class that most obviously benefits from local partitioning is mixed text and structured, with structured condition on the partition key. Finding news articles from this week that contain some phrase, for instance. With global indexes, mixed queries work by getting all the text hits, then, for each hit, evaluating the structured condition. If most of the text hits don't match the structured condition, ConText does a lot of wasted work.

With local indexes, performance can be improved using partition pruning. Partition pruning acts as a first-level filter, avoiding query on those partitions that will never meet the structured criteria -- this results in a higher text hit to structured hit ratio, and less wasted work in ConText. For instance:

 

  select /*+ FIRST_ROWS(50) */ 12345 
    from ptest
   where contains(b, 'EVRY') > 0
     and a between 90000 and 100000
     and rownum < 50;

With a global ConText index, the query plan is:

 

  1.1 COUNT STOPKEY
    2.1 TABLE ACCESS BY GLOBAL INDEX ROWID PTEST
      3.1 DOMAIN INDEX  PTESTX

The ConText index produces a list of rowids in the table which contain the token EVRY. Since every row contains this token, this results in 100,000 rowids, in docid order. For each of those rowids, the kernel does a table access by rowid to get the value of column A. In this testcase, the rowids resulted from the ConText index happen to come back in ascending A order. As a result, the ConText index wastes time producing 90,000 rowids that are just thrown away by the structured condition filter. (This is only for a domain index scan. This is not true of a functional invocation, though that has its own problems, discussed in "Sort by Partition Key" below)

With a local partitioned ConText index, the query plan does not look much different from the global index:

 

  1.1 COUNT STOPKEY
    2.1 TABLE ACCESS BY LOCAL INDEX ROWID PTEST
      3.1 DOMAIN INDEX  PTESTX

And, in fact, it works pretty much the same -- ask the ConText index for a list of rowids containing EVRY, then for each rowid, do a table access to get the value of A, and ensure that the value is between 90,000 and 100,000.

The difference in this case is that partition pruning has occurred. Since we have a local partitioned index, instead of one big index, we have 10 small indexes. The optimizer is smart enough to know that 9 of them will never get any hits (because of the structured condition on A, the partition key) so it only asks p10. All of the rowids produced by p10 fulfill the structured condition, so none are thrown away.

Time for global index: 4.07. Time for local partitioned index: 0.09

The example above prunes to a single index partition, but this technique can still be useful with multiple partitions. The query plan simply iterates over the selected index partitions. For instance:

 

  select /*+ FIRST_ROWS(50) */ 12345 
    from ptest
   where contains(b, 'EVRYS') > 0
     and a between 80000 and 100000
     and rownum < 50;

Involves two partitions, using the query plan:

 

1.1 COUNT STOPKEY
  2.1 PARTITION RANGE ITERATOR 
    3.1 TABLE ACCESS BY LOCAL INDEX ROWID PTEST
      4.1 DOMAIN INDEX  PTESTX

EVERYS is a special token just for this test, which occurs in every row from 1-79999, and then every 400th row after that. This lopsided distribution is to force the partition query to hit two partitions.

Time for global index: 3.59. Time for local partitioned index: 0.20

We are scanning multiple partitions, so there's increased overhead, as the section "Simple Text Queries" above illustrated. The query takes about twice as long as the pruned-to-single-partition query above (which took 0.09). However, scanning two index partitions is still a lot better than producing all the text hits and filtering out the ones that don't match the range.

In order for true partition pruning to happen, the query must be parsed, and the structured condition must be literals. If the structured condition uses bind variables, the optimizer will be forced to produce a plan that iterates over all index partitions. However, our informal experiments showed that at execution time there was no performance difference between the pruned query plan and the iterator query plan. Evidently, the SQL execution engine's partition iterator recognizes that certain partitions will not yield hits, and skips over them without opening an index scan. So, using bind variables should not be cause for alarm. If you want to be sure, you can use literal values, or use partition-extended tablenames in your select.

Score Sorting

Score-sorted queries are generally slower on local partitioned indexes than on global indexes. The need to scan multiple partitions vs. a single index makes it slower just to produce the results, as we saw above in "Simple Text Queries", but the major factor is the sorting difference.

With a global index, ConText can "push down" the sort condition and return the rows in sorted order:

 

  select /*+ FIRST_ROWS(50) */ 12345 
    from ptest
   where contains(b, 'TM07', 1)>0
     and rownum < 50
   order by score(1) desc;

 

  1.1 COUNT STOPKEY
    2.1 TABLE ACCESS BY GLOBAL INDEX ROWID PTEST
      3.1 DOMAIN INDEX  PTESTX

The contains produces the 10,000 rows containing TM07, sorts them internally, then returns them in order.

A local partitioned index, on the other hand, is split up, so we cannot produce a list of all the hits and sort internally. Instead, the local index scans every partition individually, then does a regular SORT ORDER BY on the amassed results:

 

  select * from (
    select /*+ FIRST_ROWS(50) */ 12345 from ptest
     where contains(b, 'TM07', 1)>0
     order by score(1) desc
  ) where rownum < 50

 

  1.1 COUNT STOPKEY
    2.1 VIEW
      3.1 SORT ORDER BY STOPKEY
        4.1 PARTITION RANGE ALL
          5.1 TABLE ACCESS BY LOCAL INDEX ROWID PTEST
            6.1 DOMAIN INDEX  PTESTX

The SORT ORDER BY makes the query much slower. The global index query runs in 1.19 sec, while the local index takes 7.87 seconds. Also, the query syntax has to be changed to incorporate a view, because of the different semantics introduced by using a sort instead of internal ordering.

Because of this effect, if your application does a lot of text-only, score- sorted queries, local indexes are not a good choice. If you have score-sorted mixed query that can do partition pruning, however, local indexes can offer performance advantages even with the sort order by. For instance, taking the same TM07 text query, but adding a two-partition condition:

 

  select /*+ FIRST_ROWS(50) */ 12345 
    from ptest
   where contains(b, 'TM07', 1)>0
     and a between 80000 and 100000
     and rownum < 50
   order by score(1) desc;

The global index does a lot of wasted work producing and sorting rows that get filtered out by the text condition. The global index here takes 3.61 sec, while the local index finishes in only 1.31 sec, thanks to partition pruning.

Even better, if your structured condition prunes down to a single partition, then the score sorting WILL be pushed down:

  select /*+ FIRST_ROWS(50) */ 12345 
    from ptest
   where contains(b, 'TM07', 1)>0
     and a between 90000 and 100000
     and rownum < 50
   order by score(1) desc;

 

  1.1 COUNT STOPKEY
    2.1 TABLE ACCESS BY LOCAL INDEX ROWID PTEST
      3.1 DOMAIN INDEX PTESTX

Note the absence of the SORT ORDER BY. For this query you get the benefits of both partition elimination and ConText score sorting. The time for the global index is 3.51 seconds, while the local index takes only 0.56 seconds. Again, the order by score pushdown occurs only when you prune to a single partition.

Sort by Partition Key

Another situation where local partitioning can improve performance is sort by partition key. Selecting news articles that contain some phrase, with most recent articles first, for instance. Or, in our example:

 

  select /*+ FIRST_ROWS(50) INDEX(ptest ptestx) */ 12345 from ptest
   where contains(b, 'TM07') > 0
   order by a

We've chosen TM07 here as a medium-selectivity text query, which is the most problematic for this query class. A very selective text query produces few hits which can be sorted very quickly, while a very unselective text query produces so many hits that it's faster to do an ordered b-tree scan, invoking contains as a function for each row -- you'll only have to do a few to fill the 50-row buffer. Medium-selectivity text queries produce too many rows to sort quickly, but too few rows to drive off the b-tree. (the example sql includes an INDEX hint to force driving from the domain index scan)

With a global index, the ConText index produces all the hits, then does a SORT ORDER BY to sort them by the structured column. In this case, we have to produce and sort 10,000 rows:

 

  1.1 SORT ORDER BY
    2.1 TABLE ACCESS BY GLOBAL INDEX ROWID PTEST
      3.1 DOMAIN INDEX  PTESTX

This takes 2.34 sec for the first 50 rows.

With a local partitioned index, it is more efficient to do an ordered partition iteration. The base table is range-partitioned, so the value of A for any row in partition p1 is guaranteed to be less than the value of A for any row in partition p2. Therefore, any hits from partition p1 are guaranteed to be first in the result set when you order by A, and so it is correct to produce, sort, and result the hits from partition p1 before touching any other partition. This is an ordered partition iteration, and the query plan looks like this:

 

  1.1 PARTITION RANGE ALL
    2.1 SORT ORDER BY
      3.1 TABLE ACCESS BY LOCAL INDEX ROWID PTEST
        4.1 DOMAIN INDEX  PTESTX

The SORT ORDER BY is under the partition iterator. This indicates that it is sorting on a partition level, rather than a global level. This is actually more efficient, because it is faster to sort smaller batches (sort is O(n lg n) if you can recall those dimly-remembered computer science lectures), but the primary savings in this case are because we can saturate our 50-hit buffer with the results from the first partition alone. This query takes only 0.20 sec for the first 50 hits, something like 1200% faster than global's 2.34 sec.

On the other hand, with very selective queries, you may have to hit multiple partitions, and we have already seen how this can put the local index at a disadvantage. Changing the query to KSPC (1/1000 selectivity) forces the partition iterator to hit 5 of 10 partitions, and its performance -- 0.34 sec is more than the global index's 0.30 sec. However, the ordered iteration may still be a good all-around plan -- it's much faster on the queries that are now slow, and a little slower on the queries that are already fast, so you should see an overall improvement as long as the number of partitions is not large.

One note of caution: do not use an inline view to do the top-n hit buffering. Something like this:

 

  select * from (
    select /*+ FIRST_ROWS(50) INDEX(ptest ptestx) */ 12345 from ptest
     where contains(b, 'TM07') > 0
      order by a
  ) where rownum < 50
Produces a very different query plan:

 

  1.1 COUNT STOPKEY
    2.1 VIEW
      3.1 SORT ORDER BY STOPKEY
        4.1 PARTITION RANGE ALL
          5.1 TABLE ACCESS BY LOCAL INDEX ROWID PTEST
            6.1 DOMAIN INDEX  PTESTX

Note that the SORT ORDER BY is above the partition iterator. This means a global sort, and the timing for this -- 2.63 sec -- is slower than the global index. This seems to be a quirk in query plan parsing and optimization, and is currently being investigated (see bug 1800113).

Parallel Query

While local b-tree index queries can be run in parallel, parallel contains queries are not supported in this version. To illustrate, let's modify our test table with a numeric column C and its own local b-tree index:

 

  alter table ptest add (c number);
  begin for i in 1..100000 loop
    update ptest set c = mod(i, 113) where a = i;
  end loop; end;
  create index ptestc on ptest(c) local;

We can use the PARALLEL_INDEX hint to force parallel execution of this query:

 

  select /*+ INDEX(ptest ptestc) PARALLEL_INDEX(ptest, ptestc, 4) */ 12345
    from ptest
   where c = 111
     and a between 20000 and 100000;

 

  1.1 PARTITION RANGE ITERATOR                   PARALLEL_COMBINED_WITH_PARENT
    2.1 TABLE ACCESS BY LOCAL INDEX ROWID PTEST  PARALLEL_TO_SERIAL
      3.1 INDEX RANGE SCAN PTESTC NON-UNIQUE     PARALLEL_COMBINED_WITH_PARENT

Unfortunately, parallel query is not supported by the extensible indexing framework in this version:

 

  select /*+ PARALLEL_INDEX(ptest, ptestx, 4) */ 12345
    from ptest
   where contains(b, 'KSPC')>0
     and a between 20000 and 100000;

 

  1.1 PARTITION RANGE ITERATOR
    2.1 TABLE ACCESS BY LOCAL INDEX ROWID PTEST
      3.1 DOMAIN INDEX  PTESTX

Note the lack of any parallel comments in the plan. This capability might be added by the kernel's extensible indexing framework development in some future version.

 

Query Considerations Summary

These tests show that local partitioned indexes are much faster than global indexes as long as you are leveraging the partitioned nature of the base table, either through significant partition pruning or ordered partition iteration. Plain text queries are much slower because of the overhead incurred by scanning multiple partitions -- especially if the number of partitions is large.

You might therefore conclude that, if your application were doing plain text queries or mixed queries that do not involve the partition key, that global indexes would be much better. However, do not overlook the maintenance advantage of local partitioned indexes (discussed in the following sections). Overall application performance (including admin and indexing time) might still be better with local indexes, especially if the number of partitions is small.

 

 



 

Local Partitioned Index

Table Partition Operations
Partitioned Table Maintenance
Index Maintenance
Query Services
Document Services

Table Partition Operations

A major benefit of using partitioned tables is easier maintenance. To quote the concepts manual, "partitions enable data management operations like data loads, index creation, and data purges at the partition level, rather than on the entire table, resulting in significantly reduced times for these operations".

But what about the indexes? When you do partition-level DDL, a global ConText index built on the table is a monolithic entity, and knows nothing about table partitions. The only option is for the whole index to become unusable, so you have to rescan the entire table and rebuild the index from scratch -- even though most of the data has not changed.

A local partitioned index, on the other hand, can take more limited and appropriate action. If you truncate a table partition, for instance, the index simply truncates the corresponding index partition's internal tables automatically -- no extra rebuild time required. At worst, just the affected index partition is marked unusable. That partition must be rebuilt (which takes less time than a full rebuild because it has just a portion of the data) but the other index partitions are still available and can be queried. Rebuilding index partitions is covered below in the section "Rebuilding UNUSABLE Index Partitions".

We'll now examine the effects of the table partition operations individually.

RENAME PARTITION

 

  alter table ptest rename partition p5 to pnew;

This renames a table partition. It has no effect on the index. In fact, the index is not even notified when this happens. The index partition name does not change to match the table partition name. To rename an index partition, use alter index:

 

  alter index ptestx rename partition p5 to pnew; 

TRUNCATE PARTITION

 

  alter table ptest truncate partition p5;

This removes all data in a table partition. The internal tables of the affected index partition are truncated. The index partition remains valid and usable.

DROP PARTITION

 

  alter table ptest drop partition p5;

This removes the data and definition of a table partition. The corresponding index partition's meta-data and internal tables are dropped. The remaining index partitions are not affected. The index partition id's ( ixp_id in view ctx_user_index_partitions), which are normally contiguous, will have a hole. This doesn't have much effect, and the id will get reused if partitions are added (through ADD or SPLIT).

ADD PARTITION

 

  alter table ptest add partition p11 values less than (110000);

This creates a new table partition. This new table partition will be empty, because with RANGE partitioning (the only method currently supported) you can add partitions only at the end of the range, where there will be no data. The ConText index automatically creates a new index partition, creating the meta-data and internal tables. The new index partition will be valid and can be queried, although there is no data, because the underlying table partition is empty.

The name of the new index partition will be the same as the new table partition name. If that name is already used by an existing index partition, a system-generated name is assigned instead.

You cannot control the storage characteristics of the new index partition. It will use the default storage values for the index. You can override this by rebuilding the partition.

MOVE PARTITION

 

  alter table ptest move partition p5 tablespace testing;

This reconstructs the table partition's data. For example, the command above moves p5 to a new tablespace. Because ROWIDs and object ids might change because of this operation, the corresponding index partition will be marked as UNUSABLE. The internal tables will still exist and contain the same data, but the index partition cannot be queried until it is rebuilt. The index has to be rebuilt from scratch even though the data has not changed. This is because there is no way to determine the "old" rowid of each row, so it is not possible to simply rebuild the $R and $K tables.

SPLIT PARTITION

 

  alter table ptest split partition p5 at (45000)
    into (partition p5_1, partition p5_2);

This takes an existing partition and splits it into two adjacent partitions. The ConText index first drops the existing partition's meta-data and internal tables, then creates two new index partitions. The internal tables for these new index partitions will be created, but they will be empty.

After the operation, the space taken up by the old index partition is free, and the two new index partitions are empty and UNUSABLE -- they cannot be queried until they are rebuilt. (see Rebuilding UNUSABLE Index Partitions, below)

The two new partitions will have the same name as their table partitions. If an existing index partition already has that name, a system-generated name is assigned instead.

You cannot control the storage characteristics of the two new index partitions. They will use the default storage values for the index. You can override this when rebuilding the partitions.

MERGE PARTITION

 

  alter table ptest merge partitions p2, p3
    into partition p2and3;
This takes two neighbor partitions and merges them into a single partition. It's the opposite of SPLIT. The local ConText index drops the meta-data and internal tables for the two old index partitions, then creates a new index partition, with empty internal tables. The new index partition is UNUSABLE and cannot be queried until it is rebuilt. (see Rebuilding UNUSABLE Index Partitions, below)

The new index partition will have the same name as the table partition. If an existing index partition already has that name, a system-generated name is assigned instead.

You cannot control the storage characteristics of the new index partition. It will use the default storage values for the index. You can override this when rebuilding the partition.

EXCHANGE PARTITION

 

  alter table ptest exchange partition p7 with table p7_ext;

This is the most involved of the partition operations. It takes a specified table partition and a non-partitioned table and swaps their data. If you do not specify INCLUDING INDEXES, as above, or you explicitly exclude indexes:

 

  alter table ptest exchange partition p7 with table p7_ext
    excluding indexes;

then the corresponding index partition is marked as UNUSABLE and it must be rebuilt. If a ConText index exists on the non-partitioned table, then that is also marked as UNUSABLE and must be rebuilt.

If you include indexes:

 

  alter table ptest exchange partition p7 with table p7_ext
    including indexes;

Then the index partition will be automatically rebuilt. If there's a ConText index on the non-partitioned table, that will also be rebuilt. Since you will be re-creating two ConText indexes, this may take some time.

The local index and the index on the non-partitioned table are rebuilt with their existing preferences. It will not synchronize or reconcile divergent preferences, so you might find queries on the data return different results after the move.

Rebuilding UNUSABLE Index Partitions

Certain table partition operations cause associated index partitions to become UNUSABLE. When an index partition is UNUSABLE, it cannot be queried. Queries that involve that partition -- including global queries where no partition pruning occurs -- will fail. For example:

 

  alter table ptest move partition p5 tablespace testing;
  select * from ptest where contains(b, 'KSPC')>0;
  --> gets ORA-29954: domain index partition is marked LOADING/FAILED/UNUSABLE

Partition-pruned queries will still work, though:

 

  select * from ptest partition(p4) where contains(b, 'KSPC')>0;
  --> OK

You can find which partitions are UNUSABLE by querying USER_IND_PARTITIONS:

 

  select index_name, partition_name
    from user_ind_partitions
   where status != 'USABLE'

Once the UNUSABLE partitions have been identified, you must rebuild them using alter index:

 

  alter index ptestx rebuild partition p5;

Note that multiple partitions of a given index can be queried or even rebuilt at the same time. Alternatively, you can use:

 

  alter table ptest modify partition p5 rebuild unusable local indexes;

which will go out itself and find all UNUSABLE partitions in all local indexes for partition p5, and rebuild them.

Partition rebuild is also useful to reset storage characteristics of partitions which are implicitly created through split, merge, or add; these use the default storage values for the index when first created:

 

  alter index ptestx rebuild partition p5 
    parameters ('replace storage p5_store');

The amount of index memory to use when rebuilding the index can also be specified in the parameters string, although the REPLACE keyword must be used:

 

  alter index ptestx rebuild partition p5 
    parameters ('replace memory 5m');

Concurrency and Table Partition Operations

During a table partition operation, the entire domain index is unusable. For instance, if in session 1 we do:

 

  alter table ptest move partition p5 tablespace testing;

and, while it is in progress, in session 2 we query:

 

  select * from ptest where contains(b, 'KSPC')>0;

you would see the query plan:

 

  1.1 TABLE ACCESS FULL PTEST

and the query itself would get an error:

 

  DRG-10599: column is not indexed

Even partition-specific queries that do not touch the affected partitions will fail with this error. This is a quirk of the extensible indexing framework -- local b-trees and bitmap indexes do not have this same limitation. This is currently being investigated by extensible indexing framework development.

Most table partition operations are relatively quick, so the window where queries are unavailable should be small. However, be aware of this effect during potentially long-running partition operations -- especially EXCHANGE PARTITION INCLUDING INDEXES, which will force two ConText index rebuilds.

Local Partitioned Index Maintenance

Local partitioned ConText indexes are not fundamentally different from global ConText indexes -- they still have to be sync'ed to reflect new data, and still should be optimized once in a while to defragment. The only real difference is that a local index is made up of many sets of internal tables; each set is independent, so maintenance is on a partition level.

SYNC

CTX_DDL.SYNC_INDEX has been extended to take a partition name:

 

    PROCEDURE sync_index(
      idx_name  in  varchar2 default NULL,
      memory    in  varchar2 default NULL,
      part_name in  varchar2 default NULL
   );

Actually, the memory parameter is also new, and lets you pass in the index memory size to use for the sync.

For local indexes, you must specify both an index name and index partition name in the idx_name and part_name parameters. The ConText index will then scan dr$pending for rowids covered by that index partition, and update just that index partition's internal tables.

part_name must be specified when idx_name is a local index. This means that, in this version at least, you must know which index partitions need to be sync'ed, and you must issue one sync_index call for each of those. The view CTX_USER_PENDING has been extended with a PND_PARTITION_NAME column that shows the index partition name for each row, so you can find which partitions need sync with:

 

  select distinct pnd_index_name, pnd_partition_name
    from ctx_user_pending;

We are investigating enhancing sync_index in future versions to enable it to find and sync all out-of-date partitions without user specification.

Only one session can sync a given index partition at a time. However, you can sync different partitions of the same index at the same time. Just open multiple sessions -- either separate SQL*Plus windows, or, even better, use several DBMS_JOBs and multiple SNP slaves. This can improve ingest rates if data flows to multiple partitions, but it won't help something like date partitioning, where new records are all going into the same partition.

OPTIMIZE

Like sync, optimize is also now done at the partition level; so, like SYNC_INDEX, OPTIMIZE_INDEX has been extended to take a partition name:

 

    PROCEDURE optimize_index(
      idx_name  in  varchar2,
      optlevel  in  varchar2,
      maxtime   in  number    default null,
      token     in  varchar2  default null,
      part_name in  varchar2  default null
    );

For local indexes, you must specify both an index name and index partition name in the idx_name and part_name parameters. Only one session can optimize a given index partition at a time, but you can optimize different index partitions of the same index at the same time.

Each index partition records its own optimization state (it is not in a user view, but is stored in dr$index_partition) so running a time- limited FULL optimization on an index partition will not interfere with any other index partition. When FULL optimize on partition 1 runs out of time, the next FULL optimize on that partition will pick up where it left off, no matter what you do to the other index partitions.

Query Services

All query services (CTX_QUERY package) that require an index name -- COUNT_HITS, EXPLAIN, HFEEDBACK, and BROWSE_WORDS -- have been extended with a part_name argument. When specifying a local index, you must specify an index partition. The operation will be run against the internal tables of that partition only. So, for instance, you can count the hits in a particular partition, but to count hits in the whole table, you must call count_hits multiple times -- once per partition -- and manually add up the individual return values.

We are investigating multi-partition versions of these calls for future release. The results can be calculated only on a per-partition level, so there is a performance impact, but the main stumbling block is that, other than COUNT_HITS, merging multiple partition results is somewhat complicated.

Document Services

There are no changes to document services (CTX_DOC package) for local partitioned indexes. The document services operate on a per-document basis, specifying the document by rowid or primary key. The index is not even used, except to find out the base table and indexed column, and to know which indexing objects and attribute values to use. So, the document services don't really care whether the index is local or global.

 

 



New XML Features

Path Indexing

XML Path Searching

Every new version of Oracle Text/interMedia Text/ConText Option has added new structured document searching features, so we've built up a lot of functionality -- 8.1.7 has nested section search, doctype disambiguation, attribute value searching, automatic section indexing, and more. But with the industry embracing XML, demand is growing for even more sophisticated features which are beyond the capabilities of the current section architecture.

9.0.1 introduces a new section type and new query operators which support an XPath-like query language. ConText indexes with XML path searching are able to perform sophisticated section searches that were impossible in previous versions.

The Path Section Group

In order to use XML path searching, the index must be created with the new path section group:

 

  exec ctx_ddl.create_section_group('mypathgroup','PATH_SECTION_GROUP');

 

  create index myindex on mytable(doc)
    indextype is ctxsys.context
    parameters ('SECTION GROUP mypathgroup');

You can add only SPECIAL sections to a path section group; you do not have to add ZONE, FIELD, or ATTR sections, and it does not support STOP sections in this version. Most of the time, you won't even need to create your own -- just use the default one:

 

  create index myindex on mytable(doc)
    indextype is ctxsys.context
    parameters ('SECTION GROUP ctxsys.path_section_group');

The path section group is somewhat like the auto section group in that it automatically indexes all tags and attributes it encounters. For query, you can still use WITHIN, but the path section group also supports the more powerful INPATH and HASPATH operators.

 

Path Section Internals

Now we'll talk a little about how path sections are indexed, and how they differ from zone and attribute sections. You can skip this section if you're just interested in usage.

We'll illustrate the difference in index data with this simple doc:

 

  <OUTER><INNER ATTR="attrvalue">text</INNER></OUTER>

The auto section group produces the following in $I:

TOKEN_TEXT TOKEN_TYPE TOKEN_INFO
OUTER 2 (zone section) DOC 1 START 0 LENGTH 1
INNER 2 (zone section) DOC 1 START 0 LENGTH 1
INNER@ATTR 5 (attribute section) DOC 1 START 0 LENGTH 1
ATTRVALUE 4 (attribute value) DOC 1 POS 1
TEXT 0 (plain text) DOC 1 POS 1

A simple within query like:

 

  select id from mytable where contains(doc, 'text WITHIN inner')>0

can be fulfilled by fetching the info for word TEXT and section INNER, then looping through each word offset of TEXT, and checking to see if it is between INNER START and START + LENGTH. A simple attribute query like:

 

  attrvalue WITHIN inner@attr

(we'll just write the text query for brevity) can be fulfilled in much the same way, but using type 4 and 5 tokens instead of types 0 and 2.

This section type cannot support more complex queries. For instance, equal sections like INNER and OUTER are a problem. The query:

 

  (text WITHIN outer) WITHIN inner

matches this document. The index data records the same offsets for INNER and OUTER, so it is impossible to tell if INNER is inside OUTER or vice versa. Another problem is attribute value sensitive section search. For document:

 

  <SHIPMENT>
    <MEDIA TYPE="dvd">Ice Blue Eyes</MEDIA>
    <MEDIA TYPE="book">The Monopoly Companion</MEDIA>
  </SHIPMENTS>

(We won't include the xml declaration and DOCTYPE stuff -- just pretend that they are there) If you want to find shipments that include the DVD "Ice Blue Eyes", you'd have to include both section and attribute criteria:

 

  find documents where "Ice Blue Eyes" occurs within a "MEDIA" section
    whose "TYPE" attribute has the value "dvd"

Unfortunately, attribute values and sections (types 4 and 5) are completely separate from normal text and sections (types 0 and 2). There is no way to link an attribute section to the particular zone section occurrence in which it occurs.

The path section group solves both problems by indexing the document like:

TOKEN_TEXT TOKEN_TYPE TOKEN_INFO
OUTER 7 (path section) DOC 1 START 0 LENGTH 1 LEVEL 1
INNER 7 (path section) DOC 1 START 0 LENGTH 1 LEVEL 2
INNER@ATTR 8 (path attr section) DOC 1 START 0 LENGTH 1 OCC 1
ATTRVALUE 4 (attribute value) DOC 1 POS 1
TEXT 0 (plain text) DOC 1 POS 1

Zone sections and attribute sections (types 2 and 5) have been replaced with path sections and path attribute sections (types 7 and 8). Each occurrence of a path section has a LEVEL bit which indicates the nesting level of the tag. The root tag gets a level of 1, its children get a level of 2, their children get a level of 3, etc. Including level information solves the equal section problem, because we can now tell which tag is the outer tag.

Path attribute sections have a OCC part which links the attribute section occurrence to a path section occurrence. Here, INNER@ATTR has an OCC of 1 because it occurs in the first occurrence of INNER. The ability to correlate attributes and tags solves the attribute value sensitive section search.

 



New XML Features

Path Query

Querying XML Documents

The query interface is through SQL selects, so your XML queries return entire documents, not just selected parts of them. While we work on better extraction features for future release, you can explore using the new XMLType, which has extraction methods. Just use extraction methods in the select list and contains in the where clause.

The INPATH Operator

The way path sections are indexed enables more complicated section searches, but the WITHIN operator is not expressive enough to handle them. Instead of

 

  <text query> WITHIN <section name>      (e.g. Ice WITHIN MEDIA   )

indexes with a path section group use the INPATH operator in queries:

 

  <text query> INPATH(<path expression>)  (e.g. Ice INPATH(//MEDIA) )

but it functions in much the same way, limiting the scope of the text query to certain parts of the indexed documents. The parentheses around the path expression are required. The path expression is more than just a simple section name -- it is a mini query, with a specialized query language. The next section explores the path query language in more detail.

You can still use the WITHIN operator even if you are using the path section group. There should be no difference in behavior between the path section group or auto section group when using WITHIN queries.

Path Query Language

The Text path query language is based on XPath, and we will probably continue to use XPath as a guide for future development, but it is NOT XPath. Not all the XPath operators exist in the Text path query language, for instance. Also, the Text path query language operators are case-insensitive, while XPath's are strictly lower-case. There are other semantic differences covered below. Just don't make assumptions about the path language based on XPath expectations.

Tags

When specifying tags in path queries, you must specify it exactly as it appears in the document in order for it to match. There are two commonly-made mistakes you should avoid.

First, tag names are case-sensitive so the query "title" does not match the tag <TITLE> or the tag <Title>. It will match only <title>.

Second, there is no namespace support in this version. Take the fragments:

 

  DOC 1    <A xmlns:ORCL="http://www.oracle.com/"><ORCL:B> 
  DOC 2    <A xmlns:ORACLE="http://www.oracle.com/"><ORACLE:B> 

<ORCL:B> in DOC 1 is the same tag as <ORACLE:B> in DOC 2, because their namespace tags normalize to the same URI. However, when querying for these tags, you must specify it as written in the document, so "ORCL:B" to find the tag in doc 1, and "ORACLE:B" to find it in doc 2. "B" alone will not find either tag, nor will something like "http://www.oracle.com/:B". Future versions will probably add more sophisticated namespace support.

Top-Level Tag

The simplest INPATH query string is a single tag:

 

  perro INPATH(TITLE)

Like a WITHIN query, this query finds perro where it occurs between <TITLE> and </TITLE>. However, unlike a WITHIN query, <TITLE> must be the top-level tag. Take these two documents:

 

  DOC 1    <TITLE>Clifford El Gran Perro Colorado</TITLE>
  DOC 2    <BOOK><TITLE>Años De Perro</TITLE></BOOK>

The query

 

  perro WITHIN TITLE

will find both documents, but the INPATH query will find only document 1. It does not match document 2 because there the TITLE tag has a level of 2.

What's really happening is that no level for the query node is specified, so it uses the default context, which is always the top level for INPATH queries. You can explicitly specify the top level context with slash:

 

  perro INPATH(/TITLE)

or explicitly specify the default context using dot:

 

  perro INPATH(./TITLE)

both are equivalent to the query without the slash. All examples from here will include the top level slash for readability.

Any-Level Tag

A double slash indicates "any number of levels down". So, the query:

 

  perro INPATH(//TITLE)

is looking for perro inside a TITLE tag that occurs at the top level or any level below. In other words, this query is equivalent to:

 

  perro WITHIN TITLE

and finds both DOC 1 and DOC 2.

Ancestor/Descendant Searching

A child tag is a tag which is enclosed within another tag. For instance, in:

 

  DOC 2    <BOOK><TITLE>Años De Perro</TITLE></BOOK>

TITLE is a child of BOOK. We can find this document using the any-level tag searching, as in the previous section. But what if the corpus also contained:

 

  DOC 3    <MOVIE><TITLE>Mi vida como un perro</TITLE></MOVIE>

In order to find only books with perro in the title, we need to limit the search to title tags whose parent is a book tag:

 

  perro INPATH(/BOOK/TITLE)

Reading the path right-to-left, we are looking for a top-level BOOK tag with a child TITLE tag, which matches only DOC 2.

The single slash is direct parentage. The query above will not find:

 

  DOC 4    <BOOK><DESCRIPTION>
             <TITLE>Años De Perro</TITLE>
           </DESCRIPTION></BOOK> 

Because here TITLE is not a direct child of BOOK. TITLE's direct parent is DESCRIPTION, whose parent is BOOK -- TITLE is a grand-child of BOOK. To find this doc, you can use the any-level slashes:

 

  perro INPATH(/BOOK//TITLE)

Reading the path right-to-left, we are looking for a top-level BOOK tag with some descendant TITLE tag. This query will match both DOC 3 and DOC 4. Note that this is not the same as:

 

  ((perro WITHIN TITLE) WITHIN BOOK)

First, the INPATH query restricts BOOK to the top-level. Second, equal sections are not confused. That is, the query:

 

  ((perro WITHIN BOOK) WITHIN TITLE)

would match DOC 4, but the query:

 

  perro INPATH(/TITLE//BOOK)

would not. Path sections know that TITLE is a child of BOOK, even though they occur at the same text offsets.

Finally, if you wanted to match only DOC 4 and not DOC 3 -- that is, you want to match TITLE only if it is a grandchild of BOOK, and not a child or great grandchild, etc. -- you can use the single level wildcard:

 

  perro INPATH(/BOOK/*/TITLE)

The * matches exactly one level, so this path query filters out DOC 3.

You can combine these ancestor/descendant elements for even more complicated queries:

 

  felis INPATH(//kindgom/*/*/order/family//genus)

Attributes

You can search within an attribute value using the syntax <tag>/@<attribute>:

 

  perro INPATH(//MOVIE/@SPANISHTITLE)

matches:

 

  DOC 5    <MOVIE SPANISHTITLE="Mi vida como un perro">My Life As A Dog</MOVIE>

and is equivalent to the query:

 

  perro WITHIN MOVIE@SPANISHTITLE

One limitation resulting from how attributes are indexed is that all attributes must specify their direct-parent tags. The following:

 

  perro INPATH(//@TITLE)
  perro INPATH(A/*/@TITLE)

are not allowed, because the tag for the title attribute is not specified:

 

  select * from doc where contains(text, 'perro INPATH(//@TITLE)')>0;
                                         *
  ERROR at line 1:
  ORA-20000: Oracle Text error:
  DRG-50951: Unable to resolve element name for attribute TITLE

Existence Test

The square brackets are used to impose a condition on a node without changing the path context. For instance, the query:

 

  monopoly INPATH(/auction[image])

is looking for monopoly inside a top-level auction tag which has an image tag as a direct child. The search for monopoly occurs within the entirety of <auction> and </auction>, and not just within <image> and </image>. This document will match:

 

  <auction>Sailing Monopoly <image src="...">pic</image></auction>

but will not match:

 

  <auction>Sailing Monopoly</auction>    

because there is no image element. The default context inside a test element is the tag to which it is applied, so

 

  monopoly INPATH(/auction[image])

is actually the same as:

 

  monopoly INPATH(/auction[./image])

You need the dot to reference the default context. Without the dot:

 

  monopoly INPATH(/auction[/image])

it would mean top-level image tag. This is not supported, and will result in a syntax error.

The existence test for image will match only if image exists and is a direct child. It does not match:

 

  <auction>Sailing Monopoly<desc><image src="...">pic</image></desc></auction>

because here image is not a direct child of auction. You can match this document using the any-level wildcard, instead:

 

  monopoly INPATH(/auction[.//image])
You can also test for attribute existence:

 

  monopoly INPATH(/auction[@reserve])

The test node can be combined with other operators for interesting searches:

 

  monopoly INPATH(/auction[.//image]/title)

The test node does not change context, so the /title applies to /auction rather than /auction//image -- this query finds auctions where monopoly occurs inside a direct-child title tag, but only if the auction has an image tag in it somewhere. For instance, the doc:

 

  <auction>
    <title>Sailing Monopoly</title>
    <description>
       New Sailing Monopoly with custom pewter tokens from USAOpoly
       <image src="...">here is a picture</image>
    </description>
  </auction>

To test for non-existence, use the NOT operator:

 

  monopoly INPATH(/auction[not(taxinfo)])

this query looks for monopoly within an auction element that does not have a direct child taxinfo. The NOT operator is case-insensitive in our path query language. In XPath it only works in lowercase.

Attribute and Tag Value Equality

The test operator is capable of more than simple existence testing. More useful is attribute value testing, which contrains nodes by the value of their attributes. For instance, given a document like:

 

  <MOVIE>
    <TITLE LANGUAGE="German">Tiger und Dragon</TITLE>
    <TITLE LANGUAGE="French">Tigre et Dragon</TITLE>
    <TITLE LANGUAGE="Spanish">Tigre y Dragón</TITLE>
    <TITLE LANGUAGE="Mandarin">Wo hu cang long</TITLE>
    <TITLE LANGUAGE="English">Crouching Tiger, Hidden Dragon</TITLE>
  </MOVIE>

the query:

 

  dragon INPATH(//TITLE)

will search all language titles. To limit the search to just English titles, you can add an attribute value equality test:

 

  dragon INPATH(//TITLE[@LANGUAGE = "English"])

Only equality and inequality (using !=) are supported. Range searches are not supported in this version. The left-hand side must be an attribute or tag, while the right-hand side must be a literal. The query:

 

  gato INPATH(//A[@B = @C])

is not allowed, nor is something like

 

  gato INPATH(//A["dog" = "cat"]

Only string literals are allowed. Numeric literals, such as

 

  tora INPATH(//MOVIE[@PRICE = 5])

will raise a syntax error. This means that numbers are not normalized. The query above will not match:

 

  <MOVIE PRICE="5.0">Tora! Tora! Tora!</MOVIE>

because the string 5 is not equal to the string 5.0, although numerically they are equal.

The equality test is not strict equality -- it uses "contains-equality". Two text fragments are contains-equal if the lexer produces identical index info. Some of the significant ways that this deviates from strict equality are:

 

  • contains equality is probably case insensitive
    Unless the lexer has MIXED_CASE on, it would consider foo and FOO to be equal strings.
  •  

  • lexer transformation can make strings equal
    The lexer can be set to normalize word forms. Base letter, for instance, converts accented characters to non-accented forms. A word with u-umlaut (Müller) is contains-equal to the same word with an unaccented u (Muller) in this case.
  •  

  • lexer ignores whitespace and non-alphabetic, non-join characters
    The lexer deals in word offsets. In a phrase like WORD1 WORD2, the word offset of WORD2 is always 1 greater than the word offset of WORD1 -- it doesn't matter how many spaces or newlines there are between them. Also, any non-alphabetic, non-join character is converted to whitespace (and subsequently ignored). This can confuse names, with Chase Matthew being contains-equal to Chase, Matthew, or phrases, with fruit-plants being contains-equal to fruit, plants.
  •  

  • stopwords in the query string are considered any word wildcard
    Any stopword appearing in the query text fragment will match any single word in the document's text fragment. This means that the query Paris in the spring would be contains-equal to the document Paris: avoid during spring.

The rules for contains equality seem complex, but it works the same as regular text queries hitting document text -- you've probably internalized these rules already. One significant difference between equality and contains, though, is that the equality test always makes sure that the number of words in the attribute value is the same as the number of words in the query string.

 

  dragon INPATH(//TITLE[@LANGUAGE = "French"])

does not match any of these fragments:

 

  <TITLE LANGUAGE="Canadian French">dragon</TITLE>
  <TITLE LANGUAGE="French Colloquial">dragon</TITLE>
  <TITLE LANGUAGE="Medieval French Colloquial">dragon</TITLE>

Although each LANGUAGE attribute value has the word French, there are extra words. These would match a contains in the attribute value, but they do not meet the "same number of words" equality criteria.

While docu-head people use a lot of attributes in their DTD's, data-heads prefer child tags. For instance, a docu-head might write:

 

  <MOVIE YEAR="2001" TITLE="Moulin Rogue">...

While a data-head would prefer:

 

  <MOVIE>
    <YEAR>2001</YEAR>
    <TITLE>Moulin Rogue</TITLE>
    ...

To match the data-head version, you can use equality testing on tag values:

 

  moulin INPATH(//MOVIE[YEAR = "2001"])

Tag value equality uses contains-equality just like attribute value testing.

Inequality is also supported in both attribute and tag value equality, using the != operator:

 

  moulin INPATH(//MOVIE[@YEAR != "2000"])
  moulin INPATH(//MOVIE[YEAR != "2000"])

Note that inequality implies existence. The queries above do not match

 

  <MOVIE>Moulin Rouge</MOVIE>

Because the MOVIE tag does not have a YEAR attribute or YEAR child element. To test for non-existence, use the NOT operator.

AND and OR Operators

You can use boolean AND and OR to combine existence or equality predicates in a test. Say you have documents like:

 

  <MOVIE>
    <TITLE>Big Trouble in Little China</TITLE>
    <ACTORS>
      <ACTOR>Kurt Russell</ACTOR>
      <ACTOR>Kim Cattrall</ACTOR>
    </ACTORS>
    <DVD>2 DISCS</DVD>
  </MOVIE>

and you want to find movies with china in the title starring Kurt Russell and Kim Cattrall that are available on DVD:

 

  china INPATH(/MOVIE[DVD and 
               .//ACTOR = "Kurt Russell" and .//ACTOR = "Kim Cattrall"]/TITLE)

You can use parentheses for precedence:

 

  blue INPATH(/video[DVD and (discount or @rating = "4")]) 

AND and OR are case-insensitive in our path query language. In XPath they must be lowercase.

Nested INPATH

Nested INPATH operators are allowed, but the two are independent -- the default context of an INPATH is always top level. For instance:

 

  (perro INPATH(A)) INPATH(B)

will never hit any documents, because both INPATH's are looking for top-level tags, and, except for invalid documents, a document cannot have two different top-level tags.

The HASPATH Operator

The HASPATH operator is not a path query language operator; it's a ConText query language operator like INPATH. INPATH is used when you want to search for a text query within a path. HASPATH is used when all you want to do is test for path existence; it takes a path as its only argument, and returns 100 for a document if the path exists, 0 otherwise.

 

  select id from documents
   where contains(doc, 'HASPATH(/movie/dvd)')>0;

will return all documents where the top-level tag is a movie element which has a dvd element as a direct child.

HASPATH can also do tag value equality tests:

 

  HASPATH(//A = "dog")

Attribute value equality tests and AND and OR operators are not currently supported. You can use the ConText query language AND and OR, with multiple HASPATHs to achieve the same effect. Instead of:

 

  HASPATH(A and B)

write:

 

  HASPATH(A) and HASPATH(B)

HASPATH can return false hits when there are empty sections. Path sections are recorded with level information, but not true parentage. As a result, a document like:

 

  <A>
   <B>
    <C></C>
   </B>
   <D>
    <E></E>
   </D>
  </A>

is matched by the query:

 

  HASPATH(//B/E)

Since we do not have real parent information, we cannot detect that E is not the child of B. The index tells us only that E and B surround the same word offsets, and that E is a third-level tag and B is a second-level tag. Normally this indicates that E is a child of B. In this boundary case it does not. This limitation only applies to empty sections like this -- any words in the document would ensure correct behavior.

 



 

New XML Features

Highlighting
XMLType Indexing

Highlighting

Highlighting with the INPATH and HASPATH operators is not supported in this version. You can still highlight and markup regular words, and WITHIN queries, but use of the path operators will result in an error message. We are working on support for a future release.

XMLType Indexing

Oracle 9i introduces a new datatype for storing XML -- the XMLType. This is a core database feature, and you can find out about the type and its usage in the XML features manual.

You can create a ConText index on this type, but you need a few database privileges first:

1. the user creating the index must have query rewrite:

 

       grant query rewrite to <user>

Without this privilege, the create index will fail with:

 

       ORA-01031: insufficient privileges

<user> should be the user creating the index. The schema that owns the index (if different) does not need the grant.

2. query_rewrite_enabled should be true, and query_rewrite_integrity should be trusted. You can add them to the init.ora:

 

 
       query_rewrite_enabled=true
       query_rewrite_integrity=trusted

or turn it on for the session:

 

       alter session set query_rewrite_enabled=true;
       alter session set query_rewrite_integrity=trusted;

Without these, queries will fail with:

 

       DRG-10599: column is not indexed

These privileges are needed because under the covers a ConText index on an XMLType column is actually a function-based index on the getclobval() method of the type. These are the standard grants you need to use function-based indexes, as covered in the general Oracle documentation. However, unlike function-based b-tree indexes, you do not need to calculate statistics.

When an XMLType column is detected, and no section group is specified in the parameters string, the default system examines the new system parameter DEFAULT_XML_SECTION, and uses the section group specified there. At install time this system parameter is set to CTXSYS.PATH_SECTION_GROUP, which is the default path sectioner. The default filter system parameter for XMLType is DEFAULT_FILTER_TEXT, which probably means that the INSO filter is not engaged by default.

Other than the database privileges and the special default section group system parameter, indexes on XMLType columns work like any other ConText index.

Here is a simple example:

 

  connect ctxsys/ctxsys
  grant query rewrite to xtest;

  connect xtest/xtest
  create table xtest(doc sys.xmltype);
  insert into xtest values (sys.xmltype.createxml('<A>simple</A>'));

  create index xtestx on xtest(doc)
    indextype is ctxsys.context;

  alter session set query_rewrite_enabled = true;
  alter session set query_rewrite_integrity = trusted;

  select a.doc.getclobval() from xtest a
    where contains(doc, 'simple INPATH(A)')>0;

 

 



The CTXRULE Indextype

The CTXRULE Indextype

ctxrule is a new indextype designed for classification, which is the inverse of information retrieval. In traditional IR applications, for which the ConText indextype is designed, you index a set of documents and find documents with a text query; in classification, you index a set of queries, and find queries with a document. This inversion enables a new kind of application.

Take news, for example. New articles come in off the wire all the time. Say you want to get emailed any new news articles that mention Oracle. You could do something like this:

 

  1. set up a text table and a text index on the table
  2. as documents come off the wire, insert them into the text table
  3. periodically sync
  4. periodically run a query like this:
      select article_id from articles
       where contains(article_text, 'oracle')>0
         and article_date > :last_time_I_ran_this_query
    
  5. email the results

There are several problems with this approach. First, there can be a lot of lag time between when a document comes in and when it gets emailed out. You have to first wait for the periodic sync to index the document, then wait for the periodic query to run and find it. More importantly, though, this process is not scaleable as the number of stored queries increases -- the system spends a lot of time running queries that don't hit any new documents.

With ctxrule, the system looks like this:

 

  1. set up a query table and a ctxrule index on the queries:
      create table queries (
        username      varchar2(10),
        query_string  varchar2(80)
      );
    
      insert into queries values ('JSMITH', 'Oracle');
    
      create index queryx on queries(query_string)
        indextype is ctxsys.ctxrule;
    
  2. as documents come off the wire, run a query like this:
      select username from queries
       where matches(query_string, :article_text)>0
    
  3. email the article to each matched username

Instead of indexing the article text and periodically running stored queries, index the query strings, and use incoming document text to query the queries. The lag problem gets solved because we don't have to wait for a sync and for a periodic query -- the classification gets done in one step instead, and as part of a query. The scalability problem gets solved because you don't have to run every query all the time -- the index on the queries allows you to find only the ones that match the given document.

Now that you know what ctxrule is -- a classification engine that takes a given document and matches it against stored queries -- you need to know what it isn't.

 

  • ctxrule does not do automatic classification. In some systems, the engine works with just documents -- each document is analyzed and classified in some static hierarchy, like a library card catalog. With ctxrule, you can't start with nothing -- you need to provide the hierarchy, and the rules that govern classification before processing documents. The existing theme extraction features can be used for automatic classification, though.
  •  

  • ctxrule is not a clustering engine. Some classification systems work by examining a corpus of documents, and arranging them into groups of related documents based on discovered common traits. This is not what ctxrule does. ctxrule operates on single documents, and matches them against queries.
  •  

  • ctxrule is not a learning system. In some systems, you collect documents into groups. The system routes new documents to these groups, and you train the system by approving or rejecting its guesses. This is not how ctxrule works. The set of ctxrule queries are user-specified. It does not change those rules over time or have a feedback process.
  •  

  • ctxrule is not psychic. The indexed queries must be explicit and concrete, such as words or phrases. You can't use fuzzy criteria like "Financial News" or "Funny Stories". However, you can use ABOUT and theme searching, which may provide some of this functionality.
  •  

  • ctxrule query sets are not hierarchies. You can have a table of queries that is arranged in some kind of Yahoo-like hierarchy, but ctxrule sees it as a table and will treat each row as equal and independent. It will not follow hierarchical rules like parent-child uniqueness.

Even if ctxrule is not a bleeding-edge, fancy learning system, it's still an interesting indextype that enables new kinds of text applications. We'll now examine the index and its use in the following sections.

Creating a CTXRULE Index

To create a ctxrule index, specify the ctxrule type in the indextype clause:

 

    create index queryx on queries(query_string)
      indextype is ctxsys.ctxrule;

The supported datatypes that you can index are VARCHAR2 and CLOB. Since you are indexing stored query strings, VARCHAR2 will probably be sufficient.

The parameters clause can be used to specify indexing objects:

 

    create index queryx on queries(query_string)
      indextype is ctxsys.ctxrule
     parameters ('wordlist my_wordlist')

ctxrule supports the same parameters clause options that ConText does, but with a twist: the section group and filter preferences are inverted and applied at query time rather than at indexing time. When you specify a section group in a ConText index, it is applied to the document during indexing. In ctxrule's case, the document during indexing is a query string, and it wouldn't make sense to apply a section group. Instead, the section group is applied to the document that you use in matches, at query time, since the query document could be XML or some other structured format.

ctxrule has its own set of system parameters for defaulting some of the preferences:

 

    DEFAULT_CTXRULE_LEXER
    DEFAULT_CTXRULE_STOPLIST
    DEFAULT_CTXRULE_WORDLIST
    DEFAULT_CTXRULE_STORAGE

ctxrule does not support local partitioning in this version.

Querying with Matches

ctxrule indexes are queried using the matches operator, passing the entire document text as the query string. Assuming your document text is stored in a bind variable :doc_text, the query:

 

  select * from queries
   where matches(query_string, :doc_text)>0

will find matching queries in the queries table.

Putting it all together for a simple example:

 

    create table queries (
      query_id      number,
      query_string  varchar2(80)
    );

    insert into queries values (1, 'oracle');
    insert into queries values (2, 'larry or ellison');
    insert into queries values (3, 'oracle and text');
    insert into queries values (4, 'market share');

    create index queryx on queries(query_string)
      indextype is ctxsys.ctxrule;

    select query_id from queries
     where matches(query_string, 
                   'Oracle announced that its market share in databases 
                    increased over the last year.')>0

this query will return queries 1 (the word oracle appears in the document) and 4 (the phrase market share appears in the document) but not 2 (neither the word larry nor the word ellison appears, and not 3 (there is no text in the document, so it does not match the query).

The document text used in a matches query can be VARCHAR2 or CLOB. It does not accept BLOB input in this version, so you cannot match filtered documents directly. Instead, you must filter the binary content to CLOB using the INSO filter. Assuming the document data is in bind variable :doc_blob:

 

  declare
    doc_text clob;
  begin
    -- create a temporary CLOB to hold the document text
    doc_text := dbms_lob.createtemporary(doc_text, TRUE, DBMS_LOB.SESSION);

    -- call ctx_doc.ifilter to filter the BLOB to CLOB data
    ctx_doc.ifilter(:doc_blob, doc_text);

    -- now do the matches query using the CLOB version
    for c1 in (select * from queries where matches(query_string, doc_text)>0)
    loop
      -- do what you need to do here
    end loop;

    dbms_lob.freetemporary(doc_text);
  end;

ifilter is discussed later in this document. Similarly, to match files, you need to get the text into a CLOB. If your file is text in the database character set, you can create a BFILE and load it to a CLOB using the function dbms_lob.loadfromfile, or you can use UTL_FILE to read the file into a temp CLOB locator. If your file needs INSO filtering, you can load the file into a BLOB instead, and call ctx_doc.ifilter as above.

The Rule Language

The language used for the indexed queries is a simplified form of the ConText query language which is used in contains. The rule language supports the following:

Operator Example
AND dog and cat
OR dog or cat
PHRASE dog sled
ABOUT about(dogs)
NEAR dog ; cat
STEM $dog
WITHIN dog within title
THESAURUS SYN(dog)

The expansion operators -- about, stem, thesaurus -- are index-time snapshots. For instance, suppose that your default thesaurus looks like this:

 

  cat
    SYN kitty

when you index the query:

 

  SYN(cat)

when the create index is complete, the matches query:

 

  matches(query_string, 'kitty')>0

will hit that indexed rule. This is because at index time the thesaurus was used to expand SYN(cat) to {CAT}|{KITTY}. If the thesaurus then changes to:

 

  cat
    SYN kitty
    SYN gato

The already-indexed rule will not be affected. The matches query:

 

  matches(query_string, 'gato')>0

is not going to hit the query. In order to reflect the new thesaurus, the query has to be reindexed:

 

  update queries set query_string = query_string where query_id = 123;
  exec ctx_ddl.sync_index('queryx');

CTXRULE Mechanics

Now we'll discuss the internal details of how the ctxrule index works. If you're just interested in usage, you can skip to the next section, "Performance".

The ctxrule indexing process is:

 

     +-----------+                +----------+                +-----------+ 
     |           | query strings  |          | query string   |   query   |
     | datastore |--------------->|  lexer   |--------------->|   parser  |
     |           |                |          |<---------------|           |
     +-----------+                +----------+  parse tree    +-----------+
          ^                            |          
          |                            |                
          |column data                 |rules      
          |                            |                
          |                            V                
     +==========+                 +----------+          
     |          |                 |          |          
     | database |<----------------|  engine  |
     |          |    rules        |          |
     +==========+                 +----------+            

The datastore delivers query strings to the lexer, which passes them whole to the query parser. The parser -- in this case, a simplified version of the parser used for ConText contains queries -- creates a parse tree and sends this back to the lexer. The lexer normalizes the tokens, breaks the parse tree into rules (more on what a rule is below), and sends these to the engine. The engine builds up an inverted index of documents by rule, and stores it in the index tables.

ctxrule uses $I, $R, $K, and $N tables just like the ConText index. The major difference is in the structure of the $I table, which includes a new TOKEN_EXTRA column:

 

    TOKEN_TEXT    VARCHAR2(64)
    TOKEN_TYPE    NUMBER
    TOKEN_FIRST   NUMBER
    TOKEN_LAST    NUMBER
    TOKEN_COUNT   NUMBER
    TOKEN_INFO    BLOB
    TOKEN_EXTRA   VARCHAR2(4000)

The TOKEN_EXTRA column is not part of the $X index column list.

Instead of indexing document lists by token, ctxrule indexes query lists by rule. A rule is one of the tokens in the query string, called the filing term, stored in TOKEN_TEXT, plus conjunctive criteria stored in TOKEN_EXTRA.

The idea is that if a document has token X, a select on $I by TOKEN_TEXT = X is a good first-pass filter -- you may get some that don't match the document, but you will not miss any potentially-matching rules. Each rule returned by this select is a partial match -- if the TOKEN_EXTRA criteria can be fulfilled by the other tokens in the document, then the rule is a full match.

We'll illustrate token_extra with a simple example, using the queries:

 

  QUERY_ID  QUERY_STRING
  --------  ------------
    1       oracle
    2       larry or ellison
    3       text and oracle
    4       market share

Query 1 is a single word query. A document is a full match if it contains the word oracle. In this case, matching TOKEN_TEXT alone is sufficient, so TOKEN_EXTRA is NULL:

 

  QUERY_STRING         TOKEN_TEXT    TOKEN_EXTRA
  ----------------     ----------    -----------
  oracle               ORACLE        (null)

Query 2 is an OR query. A document is a full match if it contains the word larry or the word ellison. This can be reduced to two single-word queries, each of which has TOKEN_EXTRA NULL:

 

  QUERY_STRING         TOKEN_TEXT    TOKEN_EXTRA
  ----------------     ----------    -----------
  larry or ellison     LARRY         (null)
                       ELLISON       (null)

Query 3 is an AND query. A document must have both text and oracle to be a full match. The engine will choose one of these as the filing term, and place the other the TOKEN_EXTRA criteria:

 

  QUERY_STRING         TOKEN_TEXT    TOKEN_EXTRA
  ----------------     ----------    -----------
  text and oracle      TEXT          {ORACLE}

Documents that contain the word text will pull this rule up as a partial match. The query engine will then examine the TOKEN_EXTRA criteria, see that it requires the presence of the word oracle, check if the document contains that word, and judge the rule a full match if so.

Query 4 is a phrase. The engine will use the first word of the phrase as the filing term, and the whole phrase as the TOKEN_EXTRA:

 

  QUERY_STRING         TOKEN_TEXT    TOKEN_EXTRA
  ----------------     ----------    -----------
  market share         MARKET        {MARKET} {SHARE}

Putting it all together, our $I table looks something like this:

 

  TOKEN_TEXT  TOKEN_EXTRA       TOKEN_INFO
  ----------  ----------------- --------------------------------------
  ORACLE      (null)            DOC 1
  LARRY       (null)            DOC 2
  ELLISON     (null)            DOC 2
  TEXT        {ORACLE}          DOC 3
  MARKET      {MARKET} {SHARE}  DOC 4

Now that we have our index let's take a look at the query process:

 

 +-----------+           +----------+ filtered   +-----------+ 
 |           | document  |          | doc text   |           |
 | SQL Query |---------->|  filter  |----------->| sectioner |---------+
 |           |           |          |            |           |--+      |
 +-----------+<-------+  +----------+            +-----------+  |      | 
                      |                                         |      |
                      |               section offsets           |      |
              results |       +---------------------------------+      |
                      |       |                                        |
                      |       V                                        |
 +==========+         |  +----------+            +-----------+         |
 |          |         +--|          |            |           |         |
 |  index   |----------->|  query   |<-----------|   lexer   |<--------+
 |  tables  | index data |  engine  |   tokens   |           |   plain
 +==========+            +----------+            +-----------+   text

The document is passed through the filter, sectioner, and lexer to tokenize it. The query engine then iterates over the unique tokens in the document. For each token, it selects rows in $I which have that token as TOKEN_TEXT. The TOKEN_EXTRA for each of those rules is evaluated. If satisfied, then the query list is added to a result buffer. The result buffer for all the tokens is merged to eliminate duplication, and returned to the query session.

Let's walk through the query:

 

  select query_id from queries
   where matches(query_string,
                 'Oracle boss Larry Ellison reported increased market share
                  of text applications')>0

First the filter, sectioner, and lexer tokenize the document:

 

  TOKEN_TEXT  TOKEN_INFO
  ----------  ----------
  ORACLE      OFF 1
  BOSS        OFF 2
  LARRY       OFF 3
  ...

Next the query engine iterates over each unique token. The first token is ORACLE, so the query engine executes:

 

  select token_extra, ...
    from dr$queryx$i
   where token_text = 'ORACLE' ...

This pulls up:

 

  TOKEN_TEXT  TOKEN_EXTRA       TOKEN_INFO
  ----------  ----------------- --------------------------------------
  ORACLE      (null)            DOC 1

the TOKEN_EXTRA is NULL, so this is a full match, and we add DOC 1 to the result buffer.

Next we move to the second token, BOSS. This pulls up no rows in $I, so we move immediately to the third token, LARRY, which results in:

 

  TOKEN_TEXT  TOKEN_EXTRA       TOKEN_INFO
  ----------  ----------------- --------------------------------------
  LARRY       (null)            DOC 2

Again TOKEN_EXTRA is NULL, so this is a full match, and we add DOC 2 to the result buffer. The buffer now contains DOC 1 and DOC 2.

On the fourth token, ELLISON, we match DOC 2 again. However, DOC 2 is already in the result buffer, and so we discard this hit as a duplicate.

When we get to MARKET, we have a non-null TOKEN_EXTRA:

 

  TOKEN_TEXT  TOKEN_EXTRA       TOKEN_INFO
  ----------  ----------------- --------------------------------------
  MARKET      {MARKET} {SHARE}  DOC 3

To verify that the phrase market share exists, the query engine accesses the occurrence lists for MARKET and SHARE, and finds a case where the word offset of SHARE is one more than that of MARKET. Since the document token information is all in memory, this is quick and requires no disk access. The TOKEN_EXTRA is fulfilled, and DOC 3 is added to the result buffer.

On reaching TEXT, the rule:

 

  TOKEN_TEXT  TOKEN_EXTRA       TOKEN_INFO
  ----------  ----------------- --------------------------------------
  TEXT        {ORACLE}          DOC 4

is returned by the select. The engine quickly checks that ORACLE exists in the in-memory token list, and adds DOC 4 to the result buffer.

Once the engine has processed all tokens in the document, the merged result list is resolved from docid to rowid (using the $R table just like ConText) and the rows returned to the query session.

Performance

The typical ctxrule query does a lot more work than your typical contains query. The document has to be filtered, sectioned, and lexed. The $I table has to be accessed many more times -- once per unique word in the document. Each TOKEN_EXTRA has to be evaluated. So the expected response time of a ctxrule query is generally longer than that of a contains query.

ctxrule query performance depends mainly on the size of the document. As the size of the document increases, there are more unique words, each of which results in a $I query. Performance is also affected by number of unique rules indexed. As this increases, so does the number of partial matches per $I query and the engine needs more time to evaluate all the TOKEN_EXTRA criteria. Furthermore, the number of rows returned by the select increases, so there's an $R overhead and the time to do whatever it is your application does with the results. However, the number of unique rules has less impact on query performance than size of the document:

Indexed
Queries
Document Size
  143 525 1034 2108 4023 8151 16636 32933 64337
100 0.09 0.16 0.25 0.42 0.73 1.32 2.51 4.10 7.15
500 0.09 0.17 0.26 0.46 0.80 1.46 2.84 4.73 7.57
1000 0.10 0.18 0.28 0.51 0.84 1.55 2.84 4.58 7.87
2000 0.11 0.20 0.30 0.56 0.99 1.72 3.27 4.80 8.51
4000 0.16 0.22 0.34 0.71 1.25 2.15 4.28 6.20 10.20
8000 0.13 0.28 0.44 0.87 1.68 2.83 5.78 8.04 13.96
16000 0.15 0.33 0.62 1.17 2.30 3.77 8.34 10.40 19.16

These timings are informal times done on a workstation. They are not tuned benchmark numbers, and should be used for intra-chart comparison only. The documents and rules used are derived from real-world corpora.

When looking to improve performance, there's not much you can do about the size of your document, so your ability to decrease response time for a particular document is limited. However, you can get overall throughput improvement by partitioning and parallel. Instead a single session doing a matches query on each document as it comes in, round-robin the documents to separate sessions -- or even separate databases on separate machines.

CTXRULE Index Maintenance

ctxrule indexes require sync and optimize, just like a ConText index. Simply use the ctx_ddl.sync_index and ctx_ddl.optimize_index calls, passing a ctxrule index name instead of a ConText index name.

CTXRULE Query and Document Services

Document (CTX_DOC) and query (CTX_QUERY) services are not supported against ctxrule indexes.

 

 



Miscellaneous New Features

Deprecated Features

The alter index interface for SYNC and OPTIMIZE is no longer supported. We have discouraged the use of alter index for SYNC and OPTIMIZE since the PL/SQL interface for these was added in 8.1.6. Failures during sync or optimize can invalidate the index when using the alter index interface, forcing a complete index rebuild. The PL/SQL API does not carry this risk.

You can no longer create CTXCAT indexes on CLOB columns. Although existing CTXCAT indexes on CLOB columns created in 8.1.7 are still valid and can be queried, creating new ones is not allowed and will result in an error message.

CTXSRV has been deprecated since 8.1.6 -- avoid using this. Instead, use CTX_DDL.SYNC_INDEX called from a DBMS_JOB.

CTXLOAD should be used only for thesaurus import and export -- its loading and updating features are deprecated and should not be used.

BLOB_LOC/CLOB_LOC OUTPUT_TYPE in USER_DATASTORE

When the user datastore is employed with an OUTPUT_TYPE of CLOB or BLOB, the datastore creates a temporary lob, and the user datastore procedure is expected to copy the document content into it. The system of copying into a temporary lob was designed for user datastores where the document is being assembled from parts or its content is being processed before indexing.

However, some applications use the user datastore as a multiplexer, where each document is stored in exactly one of several locations (an arc relation, for you ERD nuts). For instance, you might have separate tables to store movies, images, and text so that you could record media-specific attributes:

 

  create table MOVIE (
    movieid     NUMBER,
    description CLOB,
    encoding    VARCHAR2(30),
    data        BLOB
  );

  create table IMAGE (
    imageid     NUMBER,
    annotation  CLOB,
    data        BLOB
  );

  create table DOCUMENT (
    docid       NUMBER,
    author      VARCHAR2(80),
    title       VARCHAR2(150),
    text        CLOB
  );

but still want to be able to efficiently search across all media types -- for movies, search the description, for images, search the annotation, and for documents, search the text. You could do this with the user datastore. First create a combined media table:

 

  create table MEDIA (
    mediatype   VARCHAR2(10),
    mediaid     NUMBER,
    searchcol   CHAR(1)
  );

then create a multiplexing user datastore procedure, which reads the media type of each row, and fetches the data from the appropriate location:

 

  create or replace procedure myds(rid in rowid, dataout in out nocopy clob)
  is
    mlob clob;
  begin
    for c1 in (select mediatype, mediaid from media where rowid = rid) loop
      if (c1.mediatype = 'MOVIE') then
        select description into mlob from movie where movieid = c1.mediaid;
      elsif (c1.mediatype = 'IMAGE') then
        select annotation into mlob from image where imageid = c1.mediaid;
      elsif (c1.mediatype = 'MEDIA') then
        select text into mlob from document where docid = c1.mediaid;
      end if;
      DBMS_LOB.COPY(dataout, mlob, dbms_lob.getlength(mlob));
    end loop;
  end;

However, in this case, that last DBMS_LOB.COPY is wasteful -- since we already have a lob locator with the data, there's no reason to spend I/O copying the exact same thing to the temp lob. It would be more efficient to simply pass this lob locator back to the indexing engine.

For this class of application, the user datastore object has been extended with two new OUTPUT_TYPE values: CLOB_LOC and BLOB_LOC. These settings tell the user datastore to by-pass temporary lob creation because the user datastore procedure will pass back a permanent lob locator instead.

To use CLOB_LOC, our example user datastore procedure needs to be modified to get rid of the dbms_lob.copy and instead pass back the lob locator in dataout:

 

  create or replace procedure myds(rid in rowid, dataout in out nocopy clob)
  is
  begin
    for c1 in (select mediatype, mediaid from media where rowid = rid) loop
      if (c1.mediatype = 'MOVIE') then
        select description into dataout from movie where imageid = c1.mediaid;
      elsif (c1.mediatype = 'IMAGE') then
        select annotation into dataout from image where docid = c1.mediaid;
      elsif (c1.mediatype = 'MEDIA') then
        select text into dataout from document where movieid = c1.mediaid;
      end if;
    end loop;
  end;

and, in our preference creation, set output_type to CLOB_LOC:

 

  begin
    ctx_ddl.create_preference('myds_pref','USER_DATASTORE');
    ctx_ddl.set_attribute('myds_pref', 'PROCEDURE', 'myds');
    ctx_ddl.set_attribute('myds_pref', 'OUTPUT_TYPE', 'CLOB_LOC');
  end;

Now our user datastore is passing back locators instead of doing copies, so we should see an improvement in indexing performance.

One limitation is that the output_type setting applies to all rows being indexed -- you cannot have the datastore pass in a temp locator for some rows and not pass one in for others. In our example above, we might want to append author and title information for documents, but still pass back permanent lobs for movies and images.

You can accomplish this to a limited extent in the user datastore procedure. It can create a temporary lob and store the locator in a persistent place, such as a package variable. When you need to use it, trim the temp lob, construct the document content, and set the OUT variable to the temp lob locator.

IGNORE Format Column Value

Some customers would like to have a single repository table for all media. This includes non-text data like images and soundfiles. When creating a ConText index on these, at best INSO filters them to empty files and you waste time. At worst, there's no filtering and you end up with a lot of garbage tokens.

To deal with these, a new IGNORE value has been added to the format column. When the format column is set to IGNORE, the document content is blocked in the datastore -- its content is never accessed, and it is indexed as an empty document. It is assigned a docid and the rowid is in $R, but it has zero impact on $I because no tokens are generated, and near-zero impact on query time -- the only effect is a slightly larger $R table.

Here's IGNORE in action:

 

  create table myfiles (
    format varchar2(10),
    filen  varchar2(80)
  );

  insert into myfiles values ('TEXT',  '/doc/README');
  insert into myfiles values ('BINARY','/doc/imt901.doc');
  insert into myfiles values ('IGNORE','/doc/banner.jpg');

  create index myfilesx on myfiles(filen)
   indextype is ctxsys.context
   parameters ('datastore ctxsys.file_datastore
                   filter ctxsys.inso_filter
                   format column format');

The first document is marked as TEXT, so it bypasses INSO filtering. The second document is marked as BINARY, so it gets INSO filtered before lexing. The third document is marked as IGNORE, so it is indexed as an empty document. The file /doc/banner.jpg is not opened or accessed during the create index.

The IGNORE value is caught by the datastore interface, so all intervening layers do minimal work. If you have a user datastore or procedure filter, for instance, the procedure will not get executed at all for the IGNORE documents.

The IGNORE value is also useful for working around problem documents. If you find that your index cannot be created because of faulty documents, you can mark those documents as IGNORE, which should allow the index to be created. When the problem is resolved, you can change the value to TEXT or BINARY as appropriate, and force re-indexing by updating the indexed column to itself. In previous versions, the row had to be deleted, which is usually unacceptable because you lose data and affect other parts of the app.

INSO Timeout

The INSO filter will occasionally hang or loop on certain documents. In the past, this would bring indexing to a halt. While we continually work with INSO corporation to improve filter quality, we have added a timeout so that such cases can fail individually, allowing the indexing to continue.

You can set timeout as an attribute of the inso filter:

 

  begin
    ctx_ddl.create_preference('myinso','inso_filter');
    ctx_ddl.set_attribute('myinso','timeout','600');
  end;

The unit of TIMEOUT is seconds. The default is 120, if TIMEOUT is not specified. (Note: bug 1819825 makes the default no timeout, but this will be addressed in the first 9.0.1 patchset) If filtering has not progressed within that time period, ctxhx kills itself and logs an indexing error. Those documents (which can be found by scanning ctx_user_index_errors after create index or sync) can be tested further, perhaps by manually invoking ctxhx.

The timeout interval is not an upper limit on document filtering time -- a document can take longer than the timeout interval. ctxhx checks the file size every timeout interval seconds. If the file size is growing, then ctxhx continues. If it is not growing, ctxhx kills itself. This means that documents that simply take a long time to filter can be filtered. It also means that infinite loops in INSO that continually write to the output file cannot be caught by timeout.

Do not set timeout too low. There is currently no way to override the INSO timeout on a per-document basis. If it turns out that INSO can complete filtering a timed-out document, just not make any progress each interval, there is no way to get that document indexed. The best you can do is manually filter to text and replace the binary contents with text contents. We believe that documents not progressing after 2 minutes are very rare.

CTX_DOC.IFILTER

The user datastore is often used to construct virtual structured documents by concatenating several columns of the base table. However, this technique can't really be used when one or more of those columns is binary data, such as a Word document. The binary data must be converted to text in the datastore procedure before concatenation.

Unfortunately, there is no way to do this in previous versions. The obvious attempt -- calling ctx_doc.filter inside the user datastore procedure -- is not workable. The ctx_doc calls are trusted callouts, as is indexing, and our architecture does not currently allow recursive calls. Even if supported, however, the ctx_doc functions require an index, so you would have to create a dummy index on the BLOB column just to be able to call ctx_doc.filter. This would just take up space.

So for 9i we've created a new document service CTX_DOC.IFILTER. It takes a passed-in BLOB, filters it to HTML using the INSO filter, and appends the result to the passed-in CLOB:

 

  PROCEDURE ifilter(
    data   IN            blob,
    text   IN OUT nocopy clob
  );

No index is needed, since you pass in the BLOB directly. It appends, so you can lay out your structured columns first, then call ifilter to append the BLOB's data -- or call it multiple times for multiple BLOBs. Here's an example of ctx_doc.ifilter used in a user datastore procedure:

 

  create procedure ifds(rid in rowid, doc in out nocopy clob)
  is
    buf varchar2(120) := null;
  begin
    for c1 in (select id, title, document
                 from pdfdocs
                where rowid = rid)
    loop
      buf := buf || '<ID>'||c1.id||'</ID>';
      buf := buf || '<TITLE>'||c1.title||'</TITLE>';
      buf := buf || '<DOCTEXT>';
      dbms_lob.writeappend(doc, length(buf), buf);
      
      ctx_doc.ifilter(c1.document, doc);

      buf := buf || '</DOCTEXT>';
      dbms_lob.writeappend(doc, length(buf), buf);
    end loop;
  end;

The text is always HTML in this version -- a plaintext option may be added in the future. Because ctx_doc.ifilter always appends to the existing data in the text clob, make sure you pass in an empty clob or trim the clob before the first call. Unlike the in-memory doc services, ifilter does not automatically allocate a temporary lob; passing in NULL or a non-writeable lob in the text parameter will result in an error.

ifilter is implemented using a safe callout -- this means that SQL*Net must be configured for extproc and running in order to use this feature. ifilter uses the safe callout so that it can be called from a user datastore procedure, which runs as a trusted callout.

CTXHX will be invoked using the default built-in timeout of 120 seconds. There is no way to override this in this version.

ifilter explicitly calls the INSO filter. There is no way to have ifilter use the charset, procedure, or user filters.

UTF-16 Support

The INSO and charset filters currently convert documents to the database character set when needed. The conversion is done using Oracle NLS functions. Since 9i NLS improvements add UTF-16 character set support ( AL16UTF16, AL16UTF16LE), these filters get UTF-16 conversion implicitly.

However, we have also added UTF-16 endian detection. If the document character set specification is UTF16AUTO, then the filter will examine the first two bytes of the data and choose big-endian or little-endian UTF-16 conversion as appropriate. If the marker bytes are 0xFE, 0xFF, then these two bytes are stripped out and little-endian conversion is used. If the marker bytes are 0xFF, 0xFE, then these two bytes are stripped out and big-endian conversion is used. Anything else assumes big-endian conversion and leaves the first two bytes intact.

An example of using UTF-16 auto-detection:

 

  create table conv (
    format  varchar2(10),
    charset varchar2(30),
    text    varchar2(80)
  );

  insert into conv values ('TEXT', 'UTF16AUTO',    '/doc/msunicode.txt');
  insert into conv values ('TEXT', 'UTF16AUTO',    '/doc/unixunicode.txt');

  create index convx on conv(text)
    indextype is ctxsys.context
    parameters ('datastore ctxsys.file_datastore
                    filter ctxsys.inso_filter
                    format column format
                   charset column charset');

Japanese Lexer

Written Japanese is problematic for full-text indexing because words are not space-delimited; sentences are written in one long string of kana and kanji. Human readers can easily pick out the words, but this is much harder to do mechanically.

The existing JAPANESE_VGRAM_LEXER solves this problem by not indexing words at all. Instead, it splits up a string of characters into overlapping bits. A sequence of ABCD (where A, B, C, and D are kana/kanji) might be indexed as three "tokens": AB, BC, and CD.

While this allows full-text retrieval, it has some problems. Index space is wasted on "tokens" that span word boundaries, because they might never get queried -- and if they do get queried, it's a false hit. Query is slowed by the need to combine index data from multiple tokens for each word, and these tokens are somewhat unselective, because they are character sequences instead of full words.

9i includes the new JAPANESE_LEXER, which is a hybrid lexer based on the CHINESE_LEXER that uses a lexicon to segment written Japanese into coherent words, falling back on vgrams only for unrecognized sequences. The segmenta- tion should make Japanese text query faster and more precise, while the vgramming preserves recall. Using the JAPANESE_LEXER is pretty easy, because it has no attributes to fiddle with in this version:

 

  exec ctx_ddl.create_preference('myjlexer','japanese_lexer');

  create index myidx on mytable(text)
    indextype is ctxsys.context
    parameters ('lexer myjlexer');

The lexicon is an Oracle-modified version of the EDR lexicon, licensed from the European Language Resource Association. This is not a case of poor American geography skills -- just a funny consequence of licensing issues. The lexicon is not user-modifiable in this version.

New Chinese Lexicon

Written Chinese poses the same problem as Japanese: lack of whitespace. Previous versions have two lexers for Chinese: the CHINESE_VGRAM_LEXER for vgram indexing, and a CHINESE_LEXER, for segmentation, 9i includes an update to the lexicon that the CHINESE_LEXER uses, which should improve its accuracy.

Korean Morphological Lexer

Completing our tour of East Asia, we have a new Korean lexer. Unlike Chinese or Japanese, Korean is mostly whitespace delimited. However, like German, it has a large number of compound words. Since words change form when combined in compounds, finding word boundaries is quite difficult. Furthermore, verbs have a large number of inflectional forms -- literally thousands.

The new lexer, KOREAN_MORPH_LEXER, uses a newly-licensed technology, 'HAM'. This analyzes text a sentence at a time, then uses morphology and grammar lexicons to identify word boundaries and normalize word form. Its word boundary detection is more accurate than the existing KOREAN_LEXER. The word form normalization improves recall without having to expand query terms to all possible forms.

The KOREAN_MORPH_LEXER has a lot of attributes to play with. They control what to index and what to throw away -- for instance, NUMBER can be set to FALSE to index or to throw away numbers. Generally the defaults are fine, so the following will work:

 

  exec ctx_ddl.create_preference('myjlexer',korean_morph_lexer');

  create index myidx on mytable(text)
    indextype is ctxsys.context
    parameters ('lexer myjlexer');

The KOREAN_MORPH_LEXER uses four lexicons, all in ?/ctx/data/kolx:

 

  drk2sdic.dat - system dictionary
  drk2gram.dat - grammar dictionary
  drk2udic.dat - user dictionary
  drk2xdic.dat - stopword dictionary

The system dictionary should not be modified, but the grammar, user, and stopword dictionaries can be user-edited for extension and customization. The rules for editing each one can be found in the files themselves (albeit in Korean). After modification, it is best to bounce the database or flush the shared pool to ensure that the changes get reloaded into memory.

ALL Language for Multi-Stoplist

Version 8.1.7 introduced the multi-language stoplist. One drawback to this functionality was that stopwords that appear in multiple languages had to be added multiple times. Foreign-language documents in the real world sometimes contain passages in English (we refuse to learn anything else). In order to stop THE in such cases, you had to do:

 

  ctx_ddl.add_stopword('globallist','the','French');
  ctx_ddl.add_stopword('globallist','the','German');
  ctx_ddl.add_stopword('globallist','the','Swedish');
  ...

which quickly uses up the 4096 available stopword slots.

For this case, we've added a new ALL language for stopwords:

 

  ctx_ddl.add_stopword('globallist','the','ALL');

An ALL-language stopword works for any and all document languages, while taking up just one stopword slot.

CTXCAT Wildcard Searching

Wildcard has been added to the CTXCAT query language, using asterisk. For instance:

 

  catsearch(desc, 'monopol*', '')>0

does the same thing as:

 

  contains(desc, 'monopol%')>0

Left-truncated wildcards are not supported. They are stripped off and the remainder of the pattern is expanded.

 

  catsearch(desc, '*monopol*', '')>0

is the same as the above query for monopol*. Wildcards in the middle of a token are supported:

 

  catsearch(desc, 'mon*pol*', '')>0

CTXCAT queries do not perform well when the number of tokens in queries is very large. If you expect to use CTXCAT wildcard a lot, we highly recommend also using prefix indexing; wildcard queries will be one prefix token instead of multiple normal tokens.

Transportable Tablespace

Transportable tablespace is now supported by all three indextypes, as long as all internal objects are in the tablespace set. With transportable tablespace, you can do your indexing on a large machine, then quickly push the results to multiple query server databases for query and DML scalability. To illustrate, we'll create a table in tablespace datat, and a text index on it with internal objects in tablespace indext, controlled by the storage preference:

 

  create table transport(text varchar2(80))
    tablespace datat;

  begin
    ctx_ddl.create_preference('mystore','basic_storage');
    ctx_ddl.set_attribute('mystore','i_table_clause','tablespace indext');
    ctx_ddl.set_attribute('mystore','k_table_clause','tablespace indext');
    ctx_ddl.set_attribute('mystore','n_table_clause','tablespace indext');
    ctx_ddl.set_attribute('mystore','r_table_clause','tablespace indext');
    ctx_ddl.set_attribute('mystore','i_index_clause','tablespace indext');
  end;
  /

  create index textidx on transport(text)
   indextype is ctxsys.context
   parameters ('storage mystore');

Imagine that the table has a lot of data, and that the index takes a long time to create and populate the internal tables. Now make the two tablespaces read only:

 

  alter tablespace datat read only;
  alter tablespace indext read only;

then export the two tablespaces:

 

  exp "'/ as sysdba'" transport_tablespace=y tablespaces=datat,indext 
    file=transport.dmp

In order to do this, the text table and all secondary objects for domain indexes have to be in the tablespace set. Since our text table is in datat and all secondary objects for the text index are in indext, we meet this requirement.

Now we use a OS copy or FTP to get the datafiles for datat and indext to the target system, and import:

 

  imp "'/ as sysdba'" transport_tablespace=y 
    datafiles=/oracle/home/dbs/datat.dbf,/oracle/home/dbs/indext.dbf
    tablespace=datat,indext file=t.dmp

The text table and text index are created on the target database very quickly. The index does not have to be rebuilt -- all the data in the datafiles is useable, since the table data rowids are not changing. All that needs to be done is a little bit of meta-data cleanup.

User datastore procedures and preferences are not preserved by this, so if you intend to do sync or indexing on the target system, be sure to manually bring those over. Queries should be fine without them.

If you have rows in dr$pending for the text index when you do the transportable tablespace export, those ROWIDs will materialize in dr$pending during the import, so you won't miss anything. At worst you'll have to do a sync on the target system.

If you are refreshing a target system, then the tablespaces already exist. They must be dropped before importing. Normally you can do:

 

  drop tablespace datat including contents;

However, this will not work if the tablespace contains tables with domain indexes, or domain index secondary objects -- either datat or indext from our examples -- you'll get an ORA-29857 error message. You must first drop the domain indexes manually:

 

  drop index textidx;

then drop the tablespaces.

Rowid Logging

When something goes wrong during indexing -- a core dump or a hang, for instance -- you can usually work around it by removing or IGNORE-ing the faulty documents. However, in previous versions it was very difficult to find out which documents were faulty.

To make this easier, 9i has a new rowid printing event. When set, the rowid of each row will be printed to the logfile. For instance:

 

  create table logtest(a varchar2(80));
  insert into logtest values ('the quick brown fox jumps over the lazy dog');
  insert into logtest values ('Mr. Jock, TV quiz PhD, bags few lynx');
  
  exec ctx_output.start_log('mylog');
  exec ctx_output.add_event(CTX_OUTPUT.EVENT_INDEX_PRINT_ROWID);

  create index logtestx on logtest(a)
   indextype is ctxsys.context;

the log shows:

 

  16:29:00 05/14/01 populate index: CTXSYS.LOGTESTX
  16:29:00 05/14/01 Begin document indexing
  16:29:00 05/14/01 INDEXING ROWID AAABklAACAAACynAAA
  16:29:00 05/14/01 INDEXING ROWID AAABklAACAAACynAAB
  ...

The ROWID message is printed just after the datastore has selected the row from the base table, but just before it is read or processed. This means that if a problem occurs, the last ROWID printed in the log is the problem document, and should be removed.

There is currently only one event. If you have a suggestion as to other events that could be useful, let us know.

VER_CODE Column in CTX_VERSION

The view CTX_VERSION has been changed so that it now has two columns: VER_DICT, which shows the version of the CTXSYS data dictionary, and VER_CODE, which is actually a trusted callout that queries the linked-in code for the text version. This should solve the problem of determining which Text patchset is installed, although this has already been mostly solved with unified RDBMS and Text patchsets in 8.1.7.

 

Chapter 5 : Oracle 9.2.0

Oracle Text 9.2.0 - Technical Overview

This is a technical overview of the improvements found in Oracle Text version 9.2.0. This is intended for an audience fairly familiar with previous versions of the product.

If you are new to the product, you should familiarize yourself with the product before reading this document. The official documentation, especially the Application Developer's Guide, is a good place to start, as are the interMedia Text technical overviews for 8.1.5, 8.1.6, 8.1.7, and 9.0.1 in order. (Someday I'll get around to creating an updated overall technical overview which merges these together)

The goal of this overview is to supplement the documentation in three ways:

 

  1. Collect all the new features -- and only the new features -- in one document so that expert users can catch up just by reading this instead of reading the whole doc and mentally diff'ing.
  2.  

  3. Sketch each new feature without covering every option, bell, and whistle. Give the reader enough information to decide if the new feature merits further investigation in the main doc.
  4.  

  5. Include information, observations, and notes on each feature which may not be suitable for documentation. For instance, opinions on the pros and cons of feature X, the internal mechanisms of feature Y, speculation of how feature Z could be useful, etc.

 



 

Table of Contents

 

 



Upgrading 9.0.1 to 9.2.0

Oracle Text is integrated with the Database Migration Assistant (DBMA), so the Text upgrade should happen as part of the overall database upgrade procedure. This section details the manual upgrade process, in case something goes wrong or you choose to not use DBMA.

This process assumes that you are starting with a valid 9.0.1 ctxsys data dictionary, and that the system is not being used for index, DML, or query. Use SQL*Plus to run all scripts.

1. run ?/ctx/admin/s0902000.sql as SYS
this grants new, needed database privileges to ctxsys

2. run ?/ctx/admin/u0902000.sql as CTXSYS
this upgrades the ctxsys schema to 9.0.1

It is not necessary to run anything else. The u0902000.sql script calls other scripts to upgrade the indextypes (t0902000.sql), recompile all packages (dr0pkh.sql and dr0plb.sql), and recreate the shared library (dr0ulib.sql).

The script is not expected to encounter errors other than ORA-955 ("name is already used by an existing object", for public synonyms). After the script is complete, all ctx-created objects owned by ctxsys should be VALID.

Oracle Text also uses the new DBMS_REGISTRY facility, using the component ID CONTEXT. However, since we do not currently have a validation procedure, a return status of "VALID" from DBMS_REGISTRY.STATUS does not guarantee installation validity. The version for DBMS_REGISTRY.VERSION is generated from VER_DICT of CTX_VERSION. You might notice that this says 9.2.0.1.0 -- in the past that would have been the version for the first patchset, but this is the official and expected version for the production release.

 



Index Objects

URITYPE Indexes

The URIType and its subclasses are now indexable. As an example:

 

  create table tutype(url sys.httpuritype);
  insert into tutype values (
     sys.httpuritype.createuri('http://www.oracle.com')
  );
  create index tutypex on tutype(url) indextype is ctxsys.context;
  select * from tutype where contains(url, 'oracle')>0;

URIType indexes use DEFAULT_FILTER_TEXT and DEFAULT_SECTION_HTML. So, with the default defaults, the above index would be using the NULL_FILTER and the HTML_SECTION_GROUP.

Behind the scenes, Text is using the GETBLOB method to fetch the contents of the URI. If you run into problems creating the index, a good sanity check is something like:

 

  select dbms_lob.getlength(a.<uritypecolumn>.getblob())
    from <basetable> a;

which checks if the rows are readable outside of Text.

INSO_FILTER Attribute TIMEOUT_TYPE

Version 9.0.1 added the TIMEOUT attribute to INSO_FILTER. This checked the size of the output file every TIMEOUT seconds and killed the filter if no progress was detected. If a file filtered very slowly, the inso filter could run for a long time -- even exceeding TIMEOUT seconds, as long as the output file was continually growing in size.

The new TIMEOUT_TYPE attribute allows specification of the timeout style as HEURISTIC (old 9.0.1 behavior) or FIXED (kill after TIMEOUT seconds no matter what). The default is HEURISTIC to maintain backward compatibility. While using FIXED timeout is more predictable, it may cause some long-filtering documents to be excluded from the index.

BASIC_LEXER Attribute INDEX_STEMS

By default, stemming is a query expansion operation. At query time, it takes the input word, generates all possible forms of the word, does a query to get token_info for each form, then merges the results. This can be inefficient. To ensure correctness, the stemmer must generate all forms of the input term -- even strange or really uncommon forms. Since every form results in a query on $I, each form that does not appear in the index just wastes time. The forms which do appear in the index incur token info merging overhead.

The new INDEX_STEMS attribute of BASIC_LEXER provides a better method -- index-time stemming. During indexing, each word is stemmed to its base form. This base form is added to the index (as a type 9 token) along with the original form. For example:

 

  create table stest(text varchar2(80));
  insert into stest values ('RAN');
  insert into stest values ('RUNNING');
  commit;

We'll use the English derivational stemmer in the wordlist and lexer, and create the index:

 

  exec ctx_ddl.create_preference('stemming_lexer','basic_lexer');
  exec ctx_ddl.set_attribute('stemming_lexer','index_stems','derivational');
  exec ctx_ddl.create_preference('deriv_wordlist','basic_wordlist');
  exec ctx_ddl.set_attribute('deriv_wordlist','stemmer','derivational');
  create index stestx on stest(text) indextype is ctxsys.context
    parameters ('lexer stemming_lexer wordlist deriv_wordlist');

The $I shows the base form indexed as type 9, and the original forms as type 0:

 

  select token_text, token_type from dr$stestx$i;

  RAN         0
  RUN         9
  RUNNING     0

A query for $word is then just a matter of reducing word to its base form and returning the results for that one type 9 token. This makes stem query faster at the cost of index time and space. To prove that it is not doing the expansion here, we can remove the type 0 rows from the index:

 

  delete from dr$stestx$i where token_type != 9;
  select text from stest where contains(text, '$run')>0;

  RAN
  RUNNING

Under certain circumstances, the stem operator is forced to fall back to expansion behavior. If the stem operator output is the input to other expansion operators, such as ?($(word)) then the stemming must be done as an expansion in order to get the input to fuzzy. Another case is when the query-time stemmer does not match the index-time stemmer. For instance, if in the example above we had used the default wordlist (which uses English stemmer) then the stem query would be done through expansion for correctness.

CTX_QUERY.EXPLAIN is the best way to examine the query plan and determine if stem form or expansion is being used.

USER_LEXER

The USER_LEXER lets you write and plug in your own lexer to handle languages or other indexing techniques that Text does not support. You specify two stored functions -- one for index, one for query -- that take in text, lex into tokens, and pass back the tokens in an XML-esque document.

For our example, we'll create a very simple Hawaiian lexer. It will uppercase the input and remove single quotes (glottal stop or okina).

First, we write the index lexer procedure:

 

  create or replace procedure ul_idx_proc(
    text      in     varchar2, 
    tokens    in out nocopy varchar2,
    locneeded in     boolean
  ) is
  ...

(see Appendix B for actual code samples). Like user datastore procedures, user lexer procedures must be owned by ctxsys and the indexing user must have EXECUTE permission. For simplicity, our example will be done entirely in the ctxsys schema.

Our example is also written entirely in PL/SQL. You can write lexers using stored Java, or an external language like C or Perl, using the extproc mechanisms (see the PL/SQL User's Guide).

We've used VARCHAR2 for "text" and "tokens" for simplicity, but CLOB is also possible here for real-world implementations. You can name the parameters whatever you want -- for readability, I'll refer to them by the parameter names I've given them in this example, instead of "parameter 1".

Given input "imua i'olani", our user lexer will output in "tokens":

 

  <tokens>
    <word>IMUA<word>
    <word>IOLANI<word>
  <tokens>

The Text Reference contains the DTD and restrictions for this output format. It's XML-esque, in the sense that it is XML, but with additional restrictions. Note that because it's based on XML, those tag names are case-sensitive.

"locneeded" is TRUE when "tokens" output needs to include location data -- the character offset and length of each token. locneeded will be TRUE most of the time -- only stopword lexing does not require location data.

 

  <tokens>
    <word off="0" len="4">IMUA<word>
    <word off="5" len="7">IOLANI<word>
  <tokens>

Providing location data when locneeded is FALSE, or omitting location data when locneeded is TRUE results in a lexer error.

ul_idx_proc will be called:

  • once for each document during indexing (location data needed)
  • once for each stopword during index metadata loading
  • once for each document services call which requires lexing (location data needed)

Next, we create a similar query lexer procedure:

 

  create or replace procedure ul_qry_proc(
    text   in     varchar2, 
    wilds  in     ctx_ulexer.wildcard_tab,
    tokens in out nocopy varchar2
  ) is
  ...

"wildcards" is an input table of numbers which tells you which characters of the passed-in text are wildcard characters. The output format of tokens is similar to that of the index procedure. ul_qry_proc will be called at query time, once per query term.

Now we create the user lexer:

 

  ctx_ddl.create_preference('hawaiian_lexer','user_lexer');
  ctx_ddl.set_attribute('hawaiian_lexer','index_procedure','ul_idx_proc');
  ctx_ddl.set_attribute('hawaiian_lexer','input_type','varchar2');
  ctx_ddl.set_attribute('hawaiian_lexer','query_procedure','ul_qry_proc');

and use it in creating an index, like any other lexer:

 

  create table htest(text varchar2(80));
  insert into htest values ('imua i''olani');
  create index htestx on htest(text) indextype is ctxsys.context
    parameters ('lexer hawaiian_lexer');

  select token_text from dr$htestx$i;

  IMUA
  IOLANI

This is an extremely simple example. Your user lexer can also do composite words, detect sentence and paragraph boundaries, and do custom number parsing. See the Text Reference for full details.

 



Index Maintenance

 

Create Index Online

Version 9.2.0 implements support for create index online, which allows DML on the base table during an index creation:

 

  create index myindex on mytable(mycolumn)
    indextype is ctxsys.context
    parameters ('datastore mydatastore')
    online;

Alter index rebuild online for "replace" and "resume" is also supported. These should improve usability for installations that cannot afford to delay changes during a long-running index creation or full rebuild.

The DML that happens during indexing gets logged to the table dr$online_pending. At the end of the index creation, these pending DMLs are processed, removing deleted rowids from $R and $K, and moving new rowids to dr$pending. These changes are therefore not actually added to the index until the next sync.

Keep the following limitations in mind when using online:

 

  • ONLINE is supported for CONTEXT indexes only
  •  

  • Although you can do DML on the base table, you cannot query during a create index online
  •  

  • ONLINE cannot be used in conjunction with PARALLEL. In this version you can build an index in parallel or online, but not both. We may add support for parallel online in a future version.
  •  

  • ONLINE is not supported for local indexes. You cannot create a local index online, nor rebuild a partition online.
  •  

  • There is a small window at the very beginning of a create index online (during our creation of index meta-data) where a concurrent DML will fail with an error message
  •  

  • There are small windows at the beginning and end of a create index online where an exclusive lock on the table is required. This means that the create index cannot be started if another session has an uncommitted transaction on the base table. It also means that at the end of indexing, the process will block until all open transactions on the base table are committed or rolled back (maybe that should be rollbacked?).

Parallel Create Index

The parallel create index in previous versions used dbms_jobs and could be applied only when creating global indexes on partitioned tables. The parallel support has been rewritten in this version, and now supports creating an index on a non-partitioned table in parallel.

When you issue a create index like:

 

  create index parx on parbase(text) parallel 4

under the covers, the base table scan is split up by a parallel query:

 

  select /*+ PARALLEL(base 4) */ rowid, text
    from parbase base;

The database's parallel query facility splits the base table into partitions, spawns slave processes, and assigns a different partition to each slave. Each slave text indexes the rows in its portition.

The method of slicing the base table into portitions is determined by the database's parallel query facility and is not under your (or Text's) direct control. So is the number of slave processes actually spawned, which depends on machine capabilities, system load, init.ora settings, and other factors. It may not match the degree of parallelism requested. See Notes on Parallel Degree for more information.

The parallel servers are P servers -- on my Solaris workstation, they show up named ora_p<slave#>_<SID> -- like ora_p001_TEXT.

If logging is on, you'll get a seperate logfile for each slave process, with the slave number appended to the logfile name. For instance, a parallel 2 create index where logging is to file "LOG" will create "LOG_0" and "LOG_1" files. Be careful though -- each parallel operation opens and closes these files. So, if you do a parallel index create, followed by a parallel sync later on, the logfiles from the create can get overwritten, because each sync slave will re-open its assigned logfile, thereby erasing the existing contents. It's probably a good idea to back up the logfile or change the logfile name (through end_log then start_log) between parallel operations.

Also now supported are:

 

  • alter index ... rebuild partition ... parallel
  • alter index ... rebuild ... parallel
  • alter index ... rebuild parameters ('resume') ... parallel
  • alter index ... rebuild parameters ('replace') ... parallel

Keep the following limitations in mind when using online:

 

  • PARALLEL is supported for CONTEXT indexes only
  •  

  • PARALLEL cannot be used in conjunction with ONLINE. In this version you can build an index in parallel or online, but not both. We may add support for parallel online in a future version.
  •  

  • PARALLEL is not supported for local index creation. If you specify parallel for a local index creation, the clause is ignored and the index creation is done serially. It still has the effect of setting the default parallel degree for queries, which may have a noticeable performance effect on queries (see Parallel Query on Local Indexes). To build a local index in parallel, use DBMS_PCLXUTIL.

Notes on Parallel Degree

The new parallel operations -- parallel create index, alter index, sync index, and optimize index -- use parallel query under the covers. For parallel create index/alter index/sync, if you specify parallel degree of N, and you get N servers started, only N/2 of those servers are actually doing indexing. If you are logging, you will only see N/2 logfiles. The other N/2 servers are doing the work of partitioning the base table. So, if you want 4 servers actually indexing, you should specify a parallel degree of 8.

Parallel optimize does not use the same method. You specify N, it starts N slaves, and all N of those slaves are doing optimize. However, the workload across those N may be unequal, and some parallel servers may not actually have any work to do -- you may see <N logfiles.

The settings of various database parameters also affect the actual degree of parallelism. With improper settings, you could specify parallel 8 and only get 4 slaves. The important ones are:

 

  • PARALLEL_ADAPTIVE_MULTI_USER: this should be false, since we are explicitly setting the parallel degree through hints, under the covers

     

  • PARALLEL_MAX_SERVERS: this bounds the number of P servers that can be spawned. So, for create index, should be >= 2*N, where N = number of servers you want to be indexing, and for optimize index, should be N.

     

  • PARALLEL_THREADS_PER_CPU: PARALLEL_THREADS_PER_CPU * (# of CPU's) should be >= specified parallel degree.

     

See the Database Performance Guide, Concepts Manual, or Data Warehousing Guide for general information on how to set up your database for parallel query.

 

DBMS_PCLXUTIL Support

The UNUSABLE keyword is now supported for create index. This will create the index meta-data and then immediately exit the indexing. With support for rebuild partition in parallel, you can now use DBMS_PCLXUTIL to get parallel local index creation.

For instance, to create a local index on ptest, doing 3 partitions at a time, building each partition in parallel 2, you'd do:

 

  create index ptestx on ptest(text)
    indextype is ctxsys.context
    local unusable;

  exec dbms_pclxutil.build_part_index(3, 2, 'PTEST', 'PTESTX', TRUE);

Exchange Partition

alter table exchange partition quickly swaps data between a non-partitioned table and a table partition -- it doesn't physically move the data, it simply reassigns data segments. Using exchange partition, you can quickly coalesce several independent tables into one large partitioned table. You can have multiple site databases collect info, transportable tablespace them into a master database, and use exchange partition to get a monolithic table for data mining, for example. When exchange partition is used with the "including indexes" clause, the same fast switching is supposed to happen with the non-partitioned table's index and the partioned table's local index's partion.

In 9.0.1, an exchange partition including indexes for context indexes forced a rebuild of both the index on the non-partitioned table and the index on the table partition. This has been rewritten so that exchange partition now does data switching, making it much faster.

As an example, let's examine what happens when you do:

 

  create table npt(id number, text varchar2(80));
  insert into npt values(1, 'cat');
  create index npx on npt(text) indextype is ctxsys.context;

  create table pt(id number, text varchar2(80))
    partition by range(id)
    (partition p1 values less than (10));
  insert into pt values (1, 'dog');
  create index px on pt(text) indextype is ctxsys.context local;

  alter table pt exchange partition p1 with table npt
    including indexes;

First, Text will create a temporary table for each internal table. The $I temporary table, for instance, will be DRMNPX$I. These temporary tables are partitioned, with one partition. Then, it issues exchange partition commands to:

  1. swap non-partitioned $I with temporary $I
  2. swap partitioned $I with temporary $I
  3. swap temporary $I with non-partitioned $I

for each internal table, except for $K. $K cannot be partitioned, so instead we physically move rows between the three $K tables. This data moving will probably take the majority of the time for most exchange operations. Finally, the temporary tables are dropped.

In order to use exchange partition including indexes, the internal tables cannot be partitioned. If, using the storage preference, you have partitioned any internal table, you'll get an error message if you try to do exchange partition including indexes. Similarly, if the two indexes do not use the same indexing objects, you will get an error message during exchange. For these cases, you will have to use exchange partition EXCLUDING INDEXES and then manually rebuild the non-partitioned index and the partitioned index's index partition.

Parallel Sync

Sync of context indexes can now be run in parallel. Specify the desired number of slaves in the new parameter parallel_degree in ctx_ddl.sync_index. Like create index in parallel, this uses the parallel query mechanism under the covers -- the driving parallelized query is a join of dr$pending and the base table. See Notes on Parallel Degree for more information.

Unlike serial sync, a snapshot too old error during a parallel sync will cause a failure, so if possible minimize the possibility of such errors by running parallel syncs during periods where DML activity on the base table is low.

Parallel sync is not supported for any other indextype.

New Sync Locking Mechanism

9.0.1 sync locks each row in the base table during indexing. A long-running sync operation can therefore inconveniently block users' access to base table rows for long periods of time. The locking algorithm has been rewritten in 9.2.0 to lock these rows only at the end of indexing. This greatly reduces the window in which user access is blocked. A side effect of this change is that updates during sync processing may result in garbage in $I. The sync process will index the old version of the document, notice the update at lock time, and be forced to automatically invalidate the just-indexed document. We do not believe that this side effect will be noticable or even encountered at most installations.

See Appendix C if you are interested in an in-depth technical overview of the new sync locking mechanism.

Optimize Logging

Optimization now writes periodic progress updates to the CTX_OUTPUT log. If greater granularity is desired, the new event CTX_OUTPUT.EVENT_OPT_PRINT_TOKEN will print each token as it is being optimized.

Token Optimize Token Type

TOKEN mode optimization does a FULL optimize for a single token. In previous versions this was limited to type 0 (normal text) tokens, so you could not affect field section text tokens or zone section tokens, for instance. This version adds a new token_type argument to ctx_ddl.optimize_index which allows specification of the token type to optimize. For instance, to optimize the zone section HEADLINE:

 

  ctx_ddl.optimize_index('myindex', 'TOKEN', token=>'HEADLINE', token_type=>2);

If you don't know the token type number, you can use the new CTX_REPORT.TOKEN_TYPE function to translate a mnemonic name to token type. For instance:

 

  ctx_ddl.optimize_index('myindex', 'TOKEN', token=>'HEADLINE', 
     token_type=>ctx_report.token_type('myindex','ZONE SEC'));
or
  ctx_ddl.optimize_index('myindex', 'TOKEN', token=>'HEADLINE', 
     token_type=>ctx_report.TOKEN_TYPE_ZONE_SEC);

See CTX_REPORT for more information on valid inputs to CTX_REPORT.TOKEN_TYPE.

New Full Optimize Method

FULL optimize of context indexes has been rewritten, and has the following two enhancements:

Will not rewrite already-optimized rows. 9.0.1 optimize deletes every row in $I, removes garbage, defragments, then re-inserts them. This causes a lot of excess rollback and redo activity when run on already-optimized indexes. The new optimization method is designed to leave such rows alone.

Optimizes all $N rows. 9.0.1 optimize was limited to removing garbage for maximum 16,000 docids, and would remove garbage only for those docids that were marked in $N. So, if you did a time-limited full optimize, then deleted 10,000 docids, the next time-limited full optimize would not see those new deletes, and would not remove their information from $I. The new optimization method reads the full contents of $N at the start of each full optimize, and removes all information for those docids. Marking and removal of rows from $N is still capped, so you will see $N count decreasing 16,000 at a time -- under the covers, we're really optimizing everything, so this should not be cause for concern.

Parallel Full Optimize

FULL optimize of context indexes can now be run in parallel. Specify the desired number of slaves in the new parameter parallel_degree in ctx_ddl.optimize_index. Like create index in parallel, this uses the parallel query mechanism under the covers, so the actual degree of parallelism achieved will depend on machine capabilities, system load, init.ora parameters, etc. (see Notes on Parallel Degree) The partitioning of the load is done by hash value on token, which means that a token's info will never get split across slaves, and that some slaves may be slower than others (if they have large tokens).

A parallel time-limited full optimize may result in certain $I rows being optimized more than once. For instance, let's say you do a parallel 2 full optimize, and, when time runs out, slave 1 is up to DOG, while slave 2 is up to DOKUMENT. Because there is no guarantee that all rows between DOG and DOKUMENT have been optimized, the saved state is the minimum, which is DOG. The next full optimize will therefore re-optimize any rows between DOG and DOKUMENT that slave 2 has already examined. However, with the new FULL optimize method, this should have a minimal impact if those rows are still optimal.

FAST optimize has not been parallelized, and parallel FULL optimize is not supported for any other indextype.

MARK_FAILED

While we have not been able to reproduce such a situation in-house, customers have reported that occasionally a create index or alter index fails and leaves the context index in an unusable state. The database believes that the index is in "LOADING" or "INPROGRESS" state and blocks any attempt to recover through alter index. The only recourse is to drop and recreate the index.

For this situation 9.2.0 adds CTX_ADM.MARK_FAILED, a call to force the index from LOADING to FAILED, thus unblocking alter index.

This is a last-resort failsafe and is not considered a fully public part of our API. This procedure can be dangerous if run improperly. It should be run only under direction from support or development, and only when no other session is touching the index.

 



Query

 

Optimizer Statistics Tuning

The algorithms for estimating cost and selectivity of context queries have been tweaked. This should result in optimal plans being generated more often when using the CBO.

Query Template

Query templates are an alternative to the existing query languages. Instead of passing a query string to contains or catsearch, you pass a structured document which contains the query string in a tagged element. Other elements in this document control aspects of query processing.

The simplest form of query template is:

 

  <query>
    <textquery>cat or dog</textquery>
  </query>

You just use the document as the query string, so a complete query looks something like:

 

  select id from mydocuments
   where contains(text, 
                  '<query>
                    <textquery>cat or dog</textquery>
                   </query>')>0;

though in real life the query document will most likely be generated by the application into a pl/sql or bind variable and passed to the contains. You don't need to do anything to be able to use query templates -- Oracle Text recognizes that the query string starts with "<query>" and automatically shifts to query template parsing mode. If it doesn't see that opening tag, the default query parser is used, behaving like previous versions.

You can control two things using query templates in this version: grammar and score datatype. An attribute of textquery, grammar can be set to either "context" or "ctxcat", and specifies the query language of the textquery. This allows you to use the ctxcat language for queries on context indexes, or the context language for queries on ctxcat indexes:

 

  select id from mydocuments
   where contains(text, 
                  '<query>
                    <textquery grammar="ctxcat">cat | dog</textquery>
                   </query>')>0;

The datatype attribute of the empty score element controls whether the score() function returns an integer, or a floating point number:

 

  select score(1), id from mydocuments
   where contains(text, 
                  '<query>
                    <textquery>cat or dog</textquery>
                    <score datatype="float"/>
                   </query>', 1)>0;

  SCORE(1)         ID
  --------  ---------
     4.403        103
     7.217        101

We plan to add more features to query templates in future releases.

The query template language is XML-like, but not XML. Some differences:

 

  • You don't need to use entities. This makes it easier to use some of our operators, such as:

     

    <query>
      <textquery> foo & bar </textquery>
    </query>
    
    instead of
    <query>
      <textquery> foo &amp; bar </textquery>
    </query>
    
  •  

  • tag names, attribute names, and allowed attribute values are case-insensitive
  •  

  • The parsing is stripped down for performance, so it's not robust. Form and validity mistakes may not raise errors -- they may just be ignored. For instance, if you add an unallowed space in the attribute=value pair:

     

    <query>
      <textquery grammar = "ctxcat"> cat frog </textquery>
    </query>
    

    you won't get an error -- but the query will be interpreted as a context language query string, meaning "the phrase 'cat dog'" instead of "cat and dog". You can even add invalid elements or attributes like "<foo>" without affecting query processing.

Parallel Query on Local Indexes

Thanks to extensible indexing enhancements, you can now parallelize contains queries on local partitioned context indexes. Parallel query is not supported for non-partitioned context indexes, or non-context indexes (e.g. ctxcat). Even for context indexes, the usage is a little tricky, and the performance may not necessarily be faster for your application.

To examine parallel query, we'll use an "articles" table which is partitioned by numeric "art_id" column, and which has a local partitioned context index on "art_text", called "artx". A simple query like:

 

  select art_id from articles
   where contains(art_text, 'myquery')>0

Uses the query plan:

 

PARTITION RANGE ALL
  TABLE ACCESS BY LOCAL INDEX ROWID ARTICLES
    DOMAIN INDEX  ARTX

This is a serial query -- it iterates over the index partitions, and does a contains query on each one. Contains query on partion 2 is not started until the contains query on partition 1 is complete.

In order to parallelize, you need to adjust the parallel degree on the index:

 

  alter index artx parallel 4;

This alter index statement only changes a bit of data in the SYS dictionary tables, so it takes a second or so to run. The parallel degree can also be specified in the create index statement, but, since the parallel keyword is ignroed for local indexes (see Parallel Create Index), it is not seen often. Once the index parallel degree is changed, the query plan shows parallel execution:

 

PARTITION RANGE ALL   PARALLEL_COMBINED_WITH_PARENT
  TABLE ACCESS BY LOCAL INDEX ROWID ARTICLES  PARALLEL_TO_SERIAL
    DOMAIN INDEX  ARTX

This is, for the most part, the only way to get this query parallelized. The table parallel degree can also trigger parallelism, but only in limited circumstances (See Appendix D). There is no hint you can use to force parallel query -- the PARALLEL_INDEX hint does not work for domain index scans. Similarly, the NOPARALLEL_INDEX hint does not work -- once you've changed the index's parallel degree, ALL multi-partition queries run in parallel. To get serial execution, you have to alter the index again, with "parallel 1".

To process the contains query. the database will start up some number of parallel slaves -- it will try to start 4 slaves (the degree of parallelism on our index) but the actual number started will depend on the number of slave processes already running for other sessions, machine capabilities, init.ora parameters, etc. On solaris, each slave is a database background process like ora_p001_<SID>. The slaves are each assigned a few partitions. They iterate over these partitions and do a contains query on each one while the parent process merges the results.

Parallel queries will deliver the most benefit for queries involving several partitions, that need all results (this includes score-sorted queries), and that are run on multi-processor machines. Parallel execution may actually slow down the query when run on a small machine, or when optimizing for response time, or system throughput (See Appendix D).

Reverse Docid Sorting

The natural order of contains queries (that is, the order of rows returned from a query in the absence of an order by clause) is docid order. 9.2.0 queries reverse this, so that query results are now returned in reverse docid order. This is on automatically for all contains queries in 9.2.0 -- you don't need to do anything to turn it on.

In order to return hits in reverse docid order, each term node in the query pre-decompresses the docid information from the $I row into an array, then walks the array backward. In order to save memory, the array does not include word position information, which is instead re-read from the $I data.

We have run numerous benchmark tests, and have not shown any noticeable performance impact of this extra processing.

The intent is to provide rough reverse-date sorting. If documents are inserted in chronological order, and sync'ed in chronological order, then reverse docid order is reverse chronological order. In practice, however, docid order can get out of sync with chronological order, so don't rely on reverse docid if you need accurate order.

Some of the ways docid order can get out of sync with chronological order:

 

  • When the index is initially created, the main scan on the base table is done without an order by. If this scan does not return the documents in chronological order, the docids will not be assigned in order.
  •  

  • Whenever a row is updated and sync'ed, it gets a new, higher docid. This means that updating an old row will pop it to the front of reverse docid-ordered hitlists. This also applies to DML which occurs during a create index online.
  •  

  • During sync, rows are processed in rowid order -- if rowid order does not mirror chronology, docid order will also be off
  •  

  • Docid order is only meaningful per-partition of a local partitioned index. If the table is not partitioned chronologically, docids could mirror chronology for a particular partition's results, but not for the table overall.
  •  

  • Docids assigned during a parallel index create or parallel sync are not likely to be in chronologic order. Docids are assigned to each slave in blocks, and there is no control over how the database partitions the workloard to each slave. Rows that are close chronologically can be assigned to different slaves and end up with docids that are far apart. Furthermore, within each slave there is no control over the order in which documents are indexed and assigned docids.
  •  

  • If a base table row is locked during a sync, that row is invalidated and instead gets indexed during the next sync. It will get a higher docid than any other row in its sync batch.

 



Miscellaneous Other Features

 

CTXCAT Theme Indexing

Theme indexing for CTXCAT is now supported. The indexed themes can be used when ABOUT queries are issued through catsearch, using the new query template facility. Since ctxcat documents are not usually very long, you should set PROVE_THEMES to FALSE for the lexer preference.

CTXXPATH Indextype

This version includes a new indextype -- ctxxpath. Although this indextype is created in Text install, and uses Text code, it really doesn't have anything to do with Text searching. A ctxxpath index, which can be built only on sys.xmltype columns, is used to speed up certain queries using the existsnode method (See the Application Developer's Guide - XDB).

Classification Training

There are three steps to setting up a basic document classification application:

 

  1. Group related sample documents together
  2. For each group, write rules that explain why the documents belong in the group
  3. Using the rule set, classify incoming documents into appropriate groups

Version 9.0.1 introduced the ctxrule index type. This automated step 3 of the process, but it was up to the user to manually write the routing rules. The new CTX_CLS.TRAIN automates step 2; it statistically analyzes document groups and automatically generates ctxrule-compatible rules.

To use training, first gather a sample set of documents -- you can even use the real document table, if you like. The document table MUST have a numeric column with all positive, integer key values less than 4 billion. It doesn't have to be the primary key, but it should be unique and allow fast access by id value. We'll call this column the doc_id column, though you can name it whatever you want.

Next, create a context index on the table. You can use whatever preferences you want, but TRAIN will generate rules based on the tokens produced, so for best results the lexer preference should match the lexer preference you intend to use for the ctxrule index:

 

  create table docs (doc_id number primary key, text clob);
  <<populate the table>>
  create index docsx on docs(text) indextype is ctxsys.context;

TRAIN will consider themes for its generated rules if they are present in the index, but theme indexing is not required. TRAIN will not consider any token other than type 0 and 1. So it cannot discover classification based on subject codes in a non-visible field section, for instance. This may be added in future releases.

Now create a table to hold the document groups. It's a many-to-many normalization table, so it needs one column to hold a document id, and one column to hold a category id. You can name the table and the columns whatever you like, and have extra columns if you want. However, the doc_id and cat_id columns both have to be numeric, limited to positive integer values less than 4 billion. It's assumed that you have a category table out there somewhere, but it is not actually used during training, so technically you don't need one. Training will perform best if you have a normal b-tree index on the category id column (or a composite key index where category id is the first column).

 

  create table cat_doc(
    cd_cat_id number, 
    cd_doc_id number, 
    primary key (cd_cat_id, cd_doc_id)
  ) organization index;
  <<organize your sample documents into groups>>
  <<populate cat_doc table>>

Final setup step: create a result table. Recall that the output of TRAIN is a set of ctxrule-compatible rules. So the result table must have three columns: a numeric column to hold the cat_id, a varchar2(4000) column to hold the rule text, and a numeric column to hold the confidence level -- a score telling you how well the produced rule approximates the grouping:

 

  create table rules(
    rule_cat_id     number,
    rule_text       varchar2(4000),
    rule_confidence number
  );

Now we run TRAIN, passing it the names of the tables and columns we've set up:

 

  begin
    ctx_cls.train(
      index_name => 'docsx',
      docid      => 'doc_id',
      cattab     => 'doc_cat',
      catdocid   => 'cd_doc_id',
      catid      => 'cd_cat_id',
      restab     => 'rules',
      rescatid   => 'rule_cat_id',
      resquery   => 'rule_text',
      resconfid  => 'rule_confidence'
    );
  end;
  /

If logging is on, progress will be shown in the logfile. When it completes, the rule table will be filled with rules. As a last step, you can create a ctxrule index on the result table.

It may not be possible to generate rules in some situations; you won't always get rows in the rules table. Also, this is a statistical process, so the rules only approximate your grouping -- if you classified the sample documents using the produced ruleset, you may not get exactly your doc_cat table back.

See Appendix E for a closer look at the training process, including coverage of RULE_CLASSIFIER preferences.

CTX_REPORT Package

The CTX_REPORT functions provide information about text indexes for diagnosis and maintenance. The procedures analyze the index, generate a report, then write the report to a CLOB. For convenience, most have a function form which allocates a call-duration temp clob and returns it, so you can do something like:

 

  set long 64000
  set pages 0
  set heading off
  spool myindex_report

  select ctx_report.describe_index('myindex') from dual;

  spool off;

You can see sample output for some ctx_report functions in Appendix F, if you want to get an idea of what's included in each report. Now let's take a look function by function:

DESCRIBE_INDEX is a human-readable description of index meta- data, including the name of the base table, the objects used for each class, and the values of their attributes. It's useful when you don't know, don't remember, or wish to verify an index's settings. (example output here)

DESCRIBE_POLICY is like DESCRIBE_INDEX, but for policies.

CREATE_INDEX_SCRIPT generates a sql script which includes preference creation PL/SQL blocks and a create index statement to re-create a given index. It's great when you need a testcase, but be aware that it won't include external requirements like a user datastore procedure. (example output here)

CREATE_POLICY_SCRIPT is like CREATE_INDEX_SCRIPT, but for policies.

INDEX_SIZE lists the internal index tables and, using queries on sys data dictionary tables and DBMS_SPACE, shows their allocated and used sizes. (example output here)

INDEX_STATS analyzes the $I table and calculates statistics like the number of unique tokens, the 10 most frequent tokens, average size of token_info per token, etc. It can also calculate statistics like estimated row fragmentation, garbage size, and most fragmented tokens. (example output here)

INDEX_STATS is heavy-duty. In order to calculate the stats, it needs to scan the entire $I table, which can take a long time for large indexes. You can track its progress in the ctx_output logfile, if logging is on. Because it's expected to run for long periods of time, it's not suitable for calling in a select list, and there is no function variant.

It also needs to compile a lot of information, for which it uses a session temporary table. Insertions to the temporary table are committed periodically, so a single call to INDEX_STATS ends the current transaction implicitly. This also means that you cannot pass a lob locator, because lob locators cannot span transactions. Instead, pass in NULL and let ctx_report allocate a session-duration temp clob. Something like:

 

  declare
    x clob := null;
  begin
    ctx_report.index_stats('tdrbprx21',x);
    insert into output values (x);
    commit;
    dbms_lob.freetemporary(x);
  end;
  /
  select * from output;

TOKEN_INFO decodes a token's info in $I. The report shows which documents contain the token, the word offsets where the token occurs in each document, and statistical information like average number of occurrences per document. It's good for verifying the index contents during diagnosis of problems. (example output here)

TOKEN_TYPE translates a mnemonic label to a numeric token type. It has two arguments: an index name, and the mnemonic label. The index name must be provided in order to lookup meta data for field section types. Valid input for the mnemonic are:

   Input Meaning Type Returned
   THEME theme token 1
   ZONE SEC zone section 2
   ATTR TEXT text that occurs in an attribute 4
   ATTR SEC attribute section 5
   PREFIX prefix token 6
   PATH SEC path section 7
   PATH ATTR path attribute section 8
   STEM stem form token 9
   FIELD <name> TEXT text token in field section <name> 16-79
   FIELD <name> PREFIX prefix token in field section <name> 616-679
   FIELD <name> STEM stem token in field section <name> 916-979

for example:

 

  typenum := ctx_report.token_type('myindex', 'field author text');

The FIELD* types will fetch the index meta-data from the dr$ tables, so if you're going to be calling these often, you might want to cache the values to avoid repeated lookups.

If you are just interested in one of the static types, it does not need the index meta-data, and you can pass NULL a the index name, if you like. Actually, you're probably better off using one of the package constants:

 

  TOKEN_TYPE_TEXT      constant number := 0;
  TOKEN_TYPE_THEME     constant number := 1;
  TOKEN_TYPE_ZONE_SEC  constant number := 2;
  TOKEN_TYPE_ATTR_TEXT constant number := 4;
  TOKEN_TYPE_ATTR_SEC  constant number := 5;
  TOKEN_TYPE_PREFIX    constant number := 6;
  TOKEN_TYPE_PATH_SEC  constant number := 7;
  TOKEN_TYPE_PATH_ATTR constant number := 8;
  TOKEN_TYPE_STEM      constant number := 9;

Policies

This version re-introduces policies. A policy is a synthetic object like a preference. You can think of it as a context index minus the index data -- that is, it's just a collection of index objects and attribute settings.

You create a policy with ctx_ddl.create_policy, specifying preferences for each class, just like the parameters clause of create index:

 

  begin
    ctx_ddl.create_policy('german_filtered',
       filter        => 'ctxsys.inso_filter',
       section_group => 'ctxsys.html_section_group',
       lexer         => 'my_german_lexer',
       stoplist      => 'd_stoplist',
       wordlist      => 'd_wordlist'
    );
  end;

Since there is no base table, there is no need for a datastore preference. Other than this, it works like create index.

The defaulting system works for policy creation. For instance, if you specify an INSO_FILTER for filter, and NULL for section_group then the HTML section group will kick in automatically.

Preference values are deep-copied, so you can modify or even drop the preferences after create_policy without affecting the policy.

You can update an existing policy using ctx_ddl.update_policy and drop a policy using ctx_ddl.drop_policy.

Policies use the same data dictionary tables as indexes, so you can view existing policies using the views CTX_USER_INDEXES, CTX_USER_INDEX_OBJECTS, and CTX_USER_INDEX_VALUES. Or, preferably, try out the new function CTX_REPORT.DESCRIBE_POLICY. Either way, You'll see that under the covers, policies are indexes created on the table dr$policy_tab. On a side note, this table has a language column so that you can use multi-lexer preferences.

So what are policies for? Not much in this version. They are used in the XPath function ora:contains, which can be used in the sys.xmltype methods (see the Application Developer's Guide - XDB). In the future, we intend to use policies for things like direct document services and true functional invocation. Watch this space.

JDeveloper Text Wizard

The Text Wizard is an add-in for JDeveloper which provides a menu-driven interface for generating simple Text applications. You fire it up, answer some prompts, and it generates SQL scripts for creating the preferences and index, scripts for sync, and JSP's for web query.

The Text Wizard will soon be available on OTN.

 


 

 

Chapter 6 : Oracle 10g

Oracle Text 10g - Technical Overview

This is a technical overview of the improvements found in Oracle Text version 10.1.0.2. This is intended for an audience fairly familiar with previous versions of the product.

 

If you are new to the product, you should familiarize yourself with the product before reading this document. The official documentation, especially the Application Developer's Guide, is a good place to start, as are the interMedia Text technical overviews for 8.1.5, 8.1.6, 8.1.7, 9.0.1, and 9.2.0 in order. (I really need to get around to creating an updated overall technical overview which merges these together)

The goal of this overview is to supplement the documentation in three ways:

  1. Collect all the new features -- and only the new features -- in one document so that expert users can catch up just by reading this instead of reading the whole doc and mentally diff'ing.

  2. Sketch each new feature without covering every option, bell, and whistle. Give the reader enough information to decide if the new feature merits further investigation in the main doc.

  3. Include information, observations, and notes on each feature which may not be suitable for documentation. For instance, opinions on the pros and cons of feature X, the internal mechanisms of feature Y, speculation of how feature Z could be useful, etc.

If, while reading this document, you think to yourself, "I wish it could do X," or "there's a better way to do this," or (hopefully not) "that's really dumb," please tell us -- we're always eager to hear from the people who actually use our product!  Oracle personnel can email us or our product help mailing list.  External customers -- contact support and have them log an enhancement or drop us an email.



Table of Contents

 



Upgrading to 10g

Oracle Text upgrade should happen automatically as part of the normal database upgrade.  This section details the manual upgrade process, in case something goes wrong or for the incorrigibly curious.

Oracle Text is now consistent with the database upgrade process in naming convention, run procedure, and registry integration:

  • The naming convention of the upgrade scripts in previous versions was the version to which you are upgrading.  That is, to upgrade to 9.2.0, you ran u0902000.sql.  In 10g, the naming convention is the version from which you are upgrading.  To upgrade to 10g from 9.2, u0902000.sql gets run.

  • In 9.2, there were two scripts to run -- an s script run as SYS, and a u script run as ctxsys.  Now, there is a single script -- ctxu920.sql, for instance, that should be run as sys.  Internally, the script does a ALTER SESSION SET CURRENT_SCHEMA when needed.  In 9.2.0, if jumping multiple versions -- upgrading from 8.1.7 to 9.2.0, for instance, you had to run a pair of scripts for each version step -- s/u0901000 to go from 8.1.7 to 9.0.1, then s/u0902000 to go to 9.2.0.  In 10g, the earlier scripts call the later scripts, so you run only one script, even for a multiple-version jump.

  • As in 9.2.0, Oracle Text supports the DBMS_REGISTRY facility, so that you can query the registry views to see if Oracle Text is installed, and its version.  In 10g, we've added a simple validation procedure which checks that all expected database objects owned by CTXSYS are present and VALID.  This is not an exhaustive check of data dictionary validity, so a VALID status in the registry does not absolutely guarantee successful install, but it is a reasonable first check.

So, the manual upgrade procedure really boils down to a single step:

  1. run ctxu<ver>.sql as SYS
where <ver> is the version FROM which you are upgrading.  For instance, to upgrade from 9.2.0 to 10g, run ctxu920.sql.  If you are coming from 8.1.7, you would run ctxu817.sql instead.
  

Manual Post-upgrade Steps

Due to the security changes documented below, manual post-upgrade steps may be required to keep an existing system operational.  The biggest potential problem is a user datastore procedure owned by ctxsys which depends upon database objects in other users' schemas.  ctxsys's new lowered database privileges will prevent this procedure from working, which in turn will prevent the index from sync'ing new data.  The quickest solution is to manually grant any needed permissions to ctxsys.  However, the recommended solution -- which applies also to those existing user datastores which continue to work -- is to recreate the user datastore as a non-ctxsys-owned procedure and modify the index metadata.  Step-by-step instructions are below.



Security Changes

CTXSYS Database Privileges

CTXSYS no longer has ALL privileges.  It is a normal user with only CONNECT and RESOURCE

In previous versions ctxsys was an über-user with ALL database permissions.  To enhance security in 10g, we've reworked the product infrastructure so that ctxsys needs far fewer permissions.  By default, ctxsys now has only CONNECT, RESOURCE, and some specific SELECT privs on database data dictionary views.  Most systems can even revoke CONNECT and RESOURCE from ctxsys and continue to operate, although this is not actually supported.

This change should not be visible to end users, and even expert OT users shouldn't need to care, except for one wrinkle: the effect on user datastores in existing systems after upgrade.  Consider a simple user datastore created in 9.2 like this:

conn text/iscool
create table foo (a varchar2(10), b varchar2(10));
conn ctxsys/ctxsys
create or replace procedure foo_udp(r in rowid, c in out nocopy clob) is
  b varchar2(200);
begin
  for c1 in (select a, b from text.foo where rowid = r) loop
    b := '<A>'||c1.a||'</A><B>'||c1.b||'</B>';
    dbms_lob.writeappend(c, lengthb(b), b);
  end loop;
end;
/
grant execute on foo_udp to text;
conn text/iscool
exec ctx_ddl.create_preference('myud','user_datastore');
exec ctx_ddl.set_attribute('myud','procedure','foo_udp');

The user datastore procedure selects from text's table "foo".  The procedure is definer's rights, so executes as ctxsys.  In 9.2, ctxsys has ALL privileges, so the select is allowed.   However, in 10g, ctxsys is a normal user, and no longer has implicit permission to select other users' tables.  The datastore procedure will fail when executed.  There are four simple solutions to this problem:

  1. as text, grant select on foo to ctxsys with grant option

  2. make ctxsys.foo_udp invoker's rights (this works because the effective user during indexing is the index owner -- see below)

  3. recreate the user datastore procedure under the text schema, grant execute to ctxsys, and redefine ctxsys.foo_udp to call text.foo_udp

  4. RECOMMENDED recreate the user datastore procedure under the text schema, and update the index metadata to use this new procedure.  This depends on two new 10g features: user datastore procedures owned by users other than ctxsys, and rebuild index metadata.  Step-by-step instructions, with an example, are given below.

Invoker's Rights CTX Packages

The public CTX packages  are now invoker's rights packages.

In previous versions of the product, all public CTX packages were definer's rights.  The package code emulated invoker's rights by walking up the PL/SQL call stack.  This usually works, but does not support invoker's rights callers.  For instance, consider a wrapper which creates preferences:

conn usera/usera
create or replace procedure cr_ctx_pref(pre_name in varchar2, obj_type in varchar2)
  authid current_user
is begin
  ctx_ddl.create_preference(pre_name, obj_type);
end;
/
grant execute on cr_ctx_pref to public;
conn userb/userb
exec usera.cr_ctx_pref('myfilter','inso_filter');

userb is calling usera.cr_ctx_pref.  This is an invoker's rights procedure, so the effective user is userb.  This in turn calls ctx_ddl.create_preference.  The effective user emulation code looks at the call stack and sees USERA.CR_CTX_PREF is the caller.  It does not know that this is an invoker's rights package, so assumes that the effective user is USERA; this creates the preference USERA.MYFILTER, rather than USERB.MYFILTER.

In 10g, ctx_ddl has invoker's rights, and the package code determines current effective user from the session environment.  With this method, in the example above, the preference is created correctly as USERB.MYFILTER.

However, one hole still exists in name resolution: we don't honor ALTER SESSION SET CURRENT_SCHEMA.  If the example above does:

conn userb/userb
alter session set current_schema = usera
exec usera.cr_ctx_pref('myfilter','inso_filter');

One might reasonably expect the preference to be created as USERA.MYFILTER, but it is instead created as USERB.MYFILTER.  Current schema support may be added in a future release, but for now this is expected behavior.  To work around this limitation, the call should use owner.object syntax:

conn userb/userb
exec usera.cr_ctx_pref('usera.myfilter','inso_filter');

This will work as long as userb has the ability to create preferences owned by another user, as described below.

One last note.  Invoker determination has effects beyond name resolution, and this change adds support for invoker's rights for those effects, as well.  For instance, during a ctx_query.count_hits operation, to ensure security, the effective user must be able to do a select count(*) where contains -- in other words, we check that the effective user has select privs on the base table of the index.  Since ctx_query is now invoker's rights, you can have globally useful application code owned by a schema without requiring that schema to have permissions on invoker's objects.  For instance, consider an invoker's rights procedure that aggregates count_hits across multiple partitions:

conn usera/usera
create or replace function ach(
  qry   in varchar2, 
  idx   in varchar2,
  parts in sys.odcivarchar2list
) return number
  authid current_user is
  total number := 0;
begin
  for i in 1..parts.count loop
    total := total + ctx_query.count_hits(idx, qry, part_name=>parts(i));
  end loop;
  return total;
end;
/

any user can make use of this function without granting usera select on the tables actually being queried.

Owner's Rights Operations

The effective user during indexing and query is now the index owner.

The effective user during indexing and query in previous versions was ctxsys.  In 10g, the effective user is the index owner.  This is true for all index operations: SQL operations like create index and select .. contains, as well as PL/SQL operations such as ctx_ddl.sync_index, ctx_query.count_hits and ctx_doc.filter -- anywhere where either the index tables or the base table are being touched.

Keep in mind that owner's rights is slightly different from the more familiar invoker's rights or definer's rights models.  Consider a query executed by usera like this:

select * from table where userb.function(a)>0;

The effective user within "function" is either userb, if function is definer's rights, or usera, if the function is invoker's rights.  However, replace userb.function with ctxsys.contains:

select * from table where ctxsys.contains(a,'foo')>0;

and the effective user inside "contains" is neither usera, the invoker, nor the definer ctxsys.  Instead, it is the owner of the index, which may be some third user.

While this is subtle and interesting in a geeky way, the change to owner's rights is of little interest to the Text user except for user datastore procedures (see below) and temp tablespace.  In 9.2, since the effective user was ctxsys when creating the index, it used ctxsys's temp tablespace.  In 10g, it will use the index owner's temp tablespace.

User Datastore Restrictions

The restrictions on user datastore procedures have been relaxed.  Now, any user can own the user datastore procedure.

In 9.2, the effective user during indexing was ctxsys.  ctxsys was a super-user with ALL permissions, so malicious user datastore code could wreak havoc.  To encourage DBA review of user datastore code, only ctxsys could own user datastore procedures.  This made installation of text applications more complex because the ctxsys password had to be known to create the user datastore procedure.

In 10g, thanks to owner's rights during indexing, actions in the user datastore during text indexing are limited by the index owner's permissions -- it is no longer possible for a malicious user to hijack text indexing to give himself more permissions or to do anything that cannot already be done through direct SQL calls.  This eliminates the need for the security restriction on user datastore ownership, so it has been relaxed.  Now, any user can own the user datastore procedure.  In fact, Text development recommends that all user datastore procedures should be owned by non-ctxsys users -- ctxsys should not own application code or objects.  Information on migrating existing user datastores can be found below.

This relaxation is not limited to user datastores.  Any procedure used in a preference attribute -- procedure filter, user lexer, etc. -- can now be owned by any user.

The elimination of the restriction does present a small change in behavior when defining a preference.  In previous versions, since the user datastore procedure had to be owned by ctxsys, something like:

exec ctx_ddl.set_attribute('myuds','procedure','proc');

meant ctxsys.proc.  Now, however, ctx_ddl will follow ctx name resolution and store a fully qualified name in preference metadata.  To avoid potential problems, Text development recommends always using OWNER.PROCEDURENAME style when creating user datastore preferences.

One last security note: owner's rights does not abrogate your responsibility to think carefully about the security ramifications of user datastore procedures.  For instance, a malicious user can still define an invoker's rights user datastore procedure and trick a DBA user into creating an index using it.  In this procedure the malicious user effectively gains DBA permissions.  As with any arbitrary submitted code, user datastores should always be reviewed for security.

Migrating Existing User Datastores

With the new security regime, Text development recommends changing existing indexes with user datastores to non-ctxsys-owned procedures.  Here is a step-by-step procedure to modify an existing text index.  This will not trigger a complete index rebuild, so should have minimal burden for production systems.

  1. create a new user datastore procedure

    The new procedure should be owned by the index owner, or by whatever application code schema is appropriate for your application.  It should duplicate the behavior of the existing user datastore procedure.  If you no longer have the user datastore procedure creation script, you can simply duplicate the existing user datastore procedure -- use ctx_report.describe_index to see which stored procedure is currently being used, then select from user_source as ctxsys to get the code.
  2. create a replacement user datastore preference

    If you still have the user datastore preference used when you created the index, you can simply update that preference by using ctx_ddl.set_attribute. Otherwise, you should create a new user datastore preference using the new procedure name as the PROCEDURE attribute value. As a shortcut, you can use ctx_report.create_index_script on the existing index, and adapt the section of the generated code which creates the user datastore preference.
  3. modify the index metadata

    Now rebuild the index to replace the user datastore:
    alter index myindex rebuild
    parameters('replace metadata datastore newuserdatastorepref');
    Don't worry -- this will NOT reindex the documents. The new replace metadata syntax only updates the index metadata. It should take seconds to run rather than hours.

Cross-User CTX Operations

DBA users can now create/modify/drop CTX objects owned by other users, and sync/optimize other users' indexes.

Since 8.1.5, Text restricted operations on synthetic objects -- preferences, section groups, stoplists, etc. -- to the object owner; usera could create new preferences for himself, set attributes and drop those preferences, but could not touch userb's objects. This is too restrictive for some applications, which do all administration -- including text administration -- from one account which may not be the same as the application code owner schema.  In 10g, DBA's with the CREATE ANY TABLE privilege are allowed to create/modify/drop preferences owned by any user.  This applies to all synthetic objects -- preferences, stoplists, section groups, index sets, and policies.  Yes, it's a little strange to need CREATE ANY TABLE, but there are no database permissions for CTX objects (or for synthetic objects in general) so we picked the closest database priv.

sync and optimize, which in previous versions were limited to the index owner, have also been expanded.  Mimicking the old, deprecated ALTER INDEX method of sync and optimize, the PL/SQL interface now allows any user with ALTER ANY INDEX to sync or optimize any index.

CTX_OUTPUT Restrictions

CTX_OUTPUT.START_LOG  filenames can no longer contain slashes and are limited to 256 bytes

To improve security, log file names can no longer contain slashes, and are limited to 256 bytes in length.  Additionally, an existing log file will be read and format-verified before being overwritten by a new logfile.



Indexing Objects

MULTI_COLUMN_DATASTORE Attribute FILTER

The MULTI_COLUMN_DATASTORE can handle a mix of text and binary columns, automatically sending the binary columns through the INSO filter

The multi-column datastore in 9i worked by simply concatenating the columns together; it was therefore limited to columns which could be converted implicitly to text.  The new FILTER attribute of the multi-column datastore expands support to binary data -- you specify which columns have binary data, and those columns are sent through the INSO filter before concatenation.

FILTER is a comma-separated list of filter indicators which parallel the columns in the column list.  If a column needs filtering, use a Y in its position.  For instance:

ctx_ddl.create_preference('myfil','multi_column_datastore');
ctx_ddl.set_attribute('myfil','columns','cola, colb, colc');
ctx_ddl.set_attribute('myfil','filter', 'Y,N,N');
cola will be filtered, colb and colc will not.  Currently the INSO filtering in the datastore is hard-coded to a fixed timeout of 120, HTML output.  Additionally, either the column gets filtered for every row, or it does not get filtered for every row -- there is no capability for per-row filter indication, like the format column in the general create index.

MULTI_COLUMN_DATASTORE Attribute DELIMITER

The tags that the MULTI_COLUMN_DATASTORE places between columns are now optional

The multi-column datastore places, by default, HTML-like tags around each column.  The output seen by the filter looks like, for instance:

<columna>
these are the contents of columna
</columna>
<columnb>
these are the contents of columnb
</columnb>

In 10g, the new DELIMITER attribute can be set to COLUMN_NAME_TAG, which preserves this behavior, or NEWLINE, which uses a newline instead of these tags.  NEWLINE can be used in situations where the tags are not needed or may be confusing -- columns with XML fragments which are already tagged, or a single value split into multiple columns (like master-detail, but horizontally).

INSO_FILTER Attribute OUTPUT_FORMATTING

 

The INSO filter is now capable of producing unformatted (non-HTML) output, which is faster to produce than HTML output.

In previous versions, the INSO filter transformed binary documents into (supposedly) WYSIWYG HTML.  In 10g, the INSO filter has a new OUTPUT_FORMATTING boolean which, when set to FALSE, produces unformatted plain text instead.  This is helpful mostly because our tests show that the INSO filter can produce the plain text faster than the HTML -- up to 30 times faster (except for PDF -- there is no performance improvement for PDF because PDF is already using plain text output, as of 9i).  However, it is also helpful in cases where the HTML tags were undesired.  For instance, you can now use the INSO filter with the NULL_SECTION_GROUP.   Or use it with AUTO_SECTION_GROUP and not worry about adding all the HTML tags as stop sections.  Or use it in a user datastore (using policy-based document services) that supposed to output XML without worrying about validity.

And did I mention that it is up to 30 times faster?

MAIL_FILTER

The MAIL_FILTER converts multipart mail messages into a single concatenated text document.

The new MAIL_FILTER object takes email messages and converts them to text documents.  It knows the conventions of RFC-822 and 2045-2049, which is a fancy way of saying that it can deal with MIME.  The short explanation is that the mail filter splits the mail into parts, converts each part to text, then concatenates them together.  In more detail, the basic operation of the mail filter is:

  • separate the headers from the body
  • parse the headers (that is, glue together multi-line headers, handle non-ascii character set encoding, remember details of Content-Type, Content-Transfer-Encoding, etc.)
  • if there are headers which the user wants to preserve, transform them into tagged format (<SUBJECT>blah blah blah</SUBJECT>)
  • split the body into parts
  • for each body part:
    • if the body part is itself multipart, recursively parse it
    • otherwise, if the body part is encoded, decode it
    • if the body part needs filtering, filter it
    • convert the textual body part to the database character set
    • append the body part text to the output
Preserving Header Fields for Section Search

By default the mail filter throws away the mail headers.  You can nominate certain header fields to preserve in the output by using the INDEX_FIELDS attribute, which takes a colon-separated list (why colon-separated instead of comma-separated?  Colon is guaranteed not to be part of a mail header field name, while comma is not).  For instance:

ctx_ddl.create_preference('mymf','mail_filter');
ctx_ddl.set_attribute('mymf','index_fields','subject:from');
...
Input document                                Output Document
--------------                                -------------------------------
Subject: Oracle Text is too fast!             <SUBJECT>Oracle Text is too fast!</SUBJECT>
To: otlist@oracle.com                         <FROM>test.user@someplace.com</FROM>
From: test.user@someplace.com
Content-Type: text/plain; charset=US-ASCII
X-Return-Path: blah blah blah blah blah
The Mail Filter Configuration File

The mail filter does most of the heavy lifting automatically -- splitting up multi-level multi-part messages, converting from the message character set(s) to the database character set, handling RFC-2047 encoded strings, decoding from base64 and quoted printable, etc.  However, its behavior is partially user-adjustable, through the mail filter configuration file.

The mail filter configuration file resides in ?/ctx/config.  By default, it is named drmailfl.txt, but you can create your own file and specify its name in the system parameter MAIL_FILTER_CONFIG_FILE (use ctx_adm.set_parameter).  The configuration applies to all mail filters in the system -- we do not support index-specific configuration in this version.  The file should be in the database character set -- if you have more than one database in more than one character set, ensure that the mail filter configuration file is legal in both charsets, or use two files -- one for each database.

So what's in the mail filter configuration file?  Two parts -- content-type handling specification, and character set name mapping.

Configuring Content-Type Handling

In the content-type handling section, you specify what the mail filter should do with various content-types.  You have three choices:

  • INCLUDE - concatenate the part to the output
  • IGNORE - skip over the part
  • INSOFILTER - INSO filter the part, then concatenate to the output

All you do is list the content-type (type or subtype), some whitespace, then the handling for that content type on the right-hand side. For instance, let's take a look at the default config file:

[behavior]
text               INCLUDE
application        IGNORE
application/msword INSOFILTER
image              IGNORE
audio              IGNORE
video              IGNORE
model              IGNORE

All subtypes of text -- for instance text/plain -- are simply concatenated to the output (charset filtering to the database character set happens automatically).  Most subtypes of application -- application/binkydrawing, for instance -- are discarded.  The specific subtype application/msword, however, gets INSO filtered, then concatenated to the output.  Images, audio, video and 3D models are simply discarded -- they are not useful for text indexing. 

Tuning INSO Filtering

The INSO filter in the mail filter can be tuned.  The INSO_TIMEOUT and INSO_OUTPUT_FORMATTING attributes of the mail filter work just like the TIMEOUT and OUTPUT_FORMATTING attributes of the INSO_FILTER, and their values control the INSO invocations within the mail filter.  Currently timeout type is hardcoded to FIXED.

Adding character-set mappings

The second part of the mail filter configuration file is an area for mapping external character set names to oracle character set names.  By default the mail filter can map IANA character set names (e.g. US-ASCII) to oracle character set names (e.g. US7ASCII).  In this section you can add additional mappings -- external name on the left, oracle name on the right.  For instance, in the default file:

[charsets]
UTF-16      AL16UTF16
UTF-16BE    AL16UTF16
UTF-16LE    AL16UTF16LE
Changing the Mail Filter Configuration File

When changing the mail filter configuration file, its best to do an "alter system flush shared_pool" to force reloading of the index metadata (and thereby re-reading of the new mail filter config file)

MDATA Sections

The MDATA section is used for handling non-tokenized metadata, with isolated transactional updates.

Metadata in Oracle Text 9i is generally handled through the field section.  For instance, the author of a document may be coded like this:

<author>William Shakespeare</author>

Usually you would see a field section defined for author, and queries like:

(get thee to a nunnery) & (william shakespeare within author)

This works pretty well, but has some limitations:

  • the metadata gets tokenized
    • This prevents exact matching.  A search for "william shakespeare within author" will also hit a document authored by william shakespeare, jr.
    • Browsing does not handle multi-word values.  Using ctx_query.browse_words, you can find out that you have authors with "william" in their name, but not that "william shakespeare" is an author.
    • Special characters serve as token breakers.  For instance, say you have part numbers in a section.  A part number XZ-456 would get tokenized into XZ and 456.  This is now indistinguishable from, say, XZ+456, or XZ-999 (if stop numbers is enabled).  To prevent this, you'll have to add printjoins to the lexer, which may compromise the main body of the documents.
    • Phrase queries are slightly less efficient than single-word queries.  A query for the phrase "William Shakespeare" involves fetching two tokens, merging their document occurrences, then doublechecking the word occurrences for each of the documents in the intersect.  A theoretical query for WilliamShakespeare, on the other hand, involves fetching one token, and the word occurrence information is irrelevant.  Additionally, the posting lists for william and shakespeare will tend to be longer than the posting list for WilliamShakespeare.  In short, fewer tokens make queries a bit faster because there's less work to be done.
  • the metadata can't be changed in isolation
    • let's face it -- usually field section data is coming from another column, the document being indexed is synthetic, coming from pregeneration, user datastore, or multi-column datastore.  When the underlying relational data changes, the only way to propagate the change is to invalidate the existing document, then reindex the entire document.  This makes the document unfindable between the time it is invalidated and the time it is synched, and the unneeded reindexing of the document main body causes excess fragmentation.

10g offers the MDATA section, which is not tokenized and can be changed, transactionally, without synching the entire document. 

Defining MDATA Sections

To add an MDATA section to a section group, call:

ctx_ddl.add_mdata_section(<sectiongroup>,<sectionname>,<tag>)

The semantics and restrictions are the same as zone or field sections -- you can map multiple tags to one section name, section names can't conflict between section types, you can add them to basic, html, xml, and news section groups only, etc.  The one difference is that you can have up to 100 mdata sections.

How MDATA is Indexed

Unlike normal tokens, MDATA values are plucked out of the text by the sectioner and passed directly to the inversion engine, bypassing the lexer.  Let's look at an example, which will contrast normal text, field sections, and mdata sections:

doG <MDATA>  T.S. Elliot  </MDATA> HoUse <FIELD>  T.S. Elliot </FIELD>

This text comes out of the filter and gets processed by the section group.  The normal text (the words which occur outside of tags) gets sent to the lexer.  Both field and mdata section tags get consumed by the section group.  But field section text gets sent to the lexer, while mdata text is consumed by the section group.  So the lexer sees:

doG  HoUse   T.S. Elliot

Let's assume that we're using the default, basic lexer.  This breaks up the text into tokens by looking for non-alphanumerics, then uppercases each resulting token.  Normal text gets a type of 0 and the field section text gets a type of (let's say) 16.  So the engine will produce the following from the lexer (assuming default stoplist, where "S" is a stopword):

DOG      0  DOC1 POS1
HOUSE    0  DOC1 POS2
T       16  DOC1 POS1
ELLIOT  16  DOC1 POS3

Note that because the lexer never sees the MDATA text, you can hit phrases around it -- "dog house" as a query will hit this document, even though "dog" and "house" in the text are separated by an mdata section; MDATA is totally invisible to text search.  Normally, you'd see metadata all at the top of a document or all at the bottom, so most users won't even notice, but it's an interesting aside for detail-oriented types.

Back in the sectioner, the MDATA text gets stripped of leading and trailing whitespace and truncated to 64 bytes.  The result is then sent directly to the engine as a single value.  For our example, the resulting TOKEN_TEXT value will be:

T.S. Elliot

MDATA values get token types between 400 and 499, so we can expect our TOKEN_TYPE to be 400.  As for TOKEN_INFO, MDATA values don't have word position information -- they either exist or don't exist.  So our entry in $I for this is:

T.S. Elliot  400  DOC1

This is very different from how the field section was indexed.

MDATA sections can repeat -- each unique value gets indexed.  So a document like:

<MDATA>value 1</MDATA>
<MDATA>value 2</MDATA>
<MDATA>value 1</MDATA>
<MDATA>value 3</MDATA>
produces
value 1  400 DOC 1
value 2  400 DOC 1
value 3  400 DOC 1
Querying MDATA Sections

MDATA is queried using the new MDATA operator.  The syntax is MDATA(section, value), as in:

where contains(text, 'mdata(author, William Shakespeare)')>0

Pretty easy, but mdata searches have some differences from other elements in the contains language.

First, unlike WITHIN searches, MDATA searches are equality searches -- you'll hit only the exact queried value, no more, no less.  Take our example document and the comparable queries:

Elliot WITHIN field
MDATA(mdata, Elliot)

The first query hits the document, because the token "Elliot" is in between the field start tag and end tag.  The second query does not hit the document, because the mdata value is "T.S. Elliot", not "Elliot".  On the other hand, the field within is unable to search for "Elliot" and NOT hit "Elliot Ness", while the mdata search does this implicitly.  It all comes down to how you want to search your data.  Field sections should be used for cases where word search is needed (titles, for instance) and MDATA sections should be used where the section contents need to be treated as single value.

Second, the MDATA value does not go through the lexer at index time, and neither does the query value.  This means that the query must match the indexed term exactly, without token normalization.  For instance:

t.s. elliot WITHIN field
MDATA(mdata, t.s. elliot)

The first query hits the document.  The input tokens (t, s, elliot) go through the lexer, which uppercases them.  The document text also got uppercased by the lexer at index time, so the tokens hit the document.  The second query, on the other hand, does not hit the document.  "t.s. elliot" does not match "T.S. Elliot".  This may seem limiting at first, but it makes possible features like section-specific case insensitivity.  And, you could always perform symmetric normalization yourself -- just uppercase the value in the document and in your query.  This is fairly easy to do if, like most apps, the queries and documents are generated (user/multi-column datastore, and a user query parser, e.g.).

Third, MDATA searches are not affected by stopwords.  MDATA(artist, The The) is perfectly legal.  In our example, the word "S" was a stopword, so:

T.S. Elliot WITHIN field

is actually looking for T <anyword> ELLIOT in between field start and end tags.  This matches the document

<FIELD>T.R. Elliot</FIELD>

while the comparable MDATA search -- MDATA(mdata, T.S. Elliot) -- would not.

Finally, MDATA searches cannot use expansion operators in this version.  The value in the mdata operator is used directly as the query value -- any characters like %, $, ?, etc. are considered part of the value.  So the query

MDATA(mdata, T.S. Ell%)

does not find our example document.  It would, however, match the document

<MDATA>T.S. Ell%</MDATA>
Isolated MDATA Updates

Using new ctx_ddl functions, you can add and remove mdata values from an already-indexed document.  For example, let's say our documents have mdata to record the clubs to which a student belongs.  An example document:

<NAME>Bruce Aznable</NAME>
<GROUP>Soccer</GROUP>
<GROUP>Cooking</GROUP>

Bruce belongs to the soccer and cooking clubs.  Now he joins the drama club.  With field sections, we would have to reindex the document by updating it, and sync-ing.  With mdata sections, we simply do:

ctx_ddl.add_mdata('myidx','group','Drama','ABCD1234')

The parameters are: index name, mdata section name, new mdata section value, then rowid of the row which we're modifying.  This updates just the mdata for just this document, without syncing.  After this call, a query like

mdata(group, Drama)

Will pull up Bruce.  Similarly, if Bruce decides to stop cooking:

ctx_ddl.remove_mdata('myidx','group','Cooking','ABCD1234')

There is no update_mdata.  If the soccer club changes their name to the football club, you'll have to:

ctx_ddl.remove_mdata('myidx','group','Soccer','ABCD1234');
ctx_ddl.add_mdata('myidx','group','Football','ABCD1234');

Which means that to do an update you'll have to know the existing (old) values.  That shouldn't be too much of a problem, especially if these are called from trigger code.

Under the covers, the add and remove calls insert "delta rows" into $I.  For instance, the add Drama call above results in:

TOKEN  TYPE  FIRST  LAST COUNT  INFO
Drama  -400   1234  1234     1  null

Note the negative token type, which makes it a delta row.  At query time, each mdata operator opens a cursor to read its delta rows, which are then woven into the token info for the normal rows.  The end effect is that the delta rows change the hitlist without having to modify existing $I rows.  Note that the add and remove calls do not commit, which allow them to make transactional changes:

ctx_ddl.add_mdata('myidx','group','Football','ABCD1234');
contains(text, 'mdata(group,Football)')>0   -- pulls up the row ABCD1234
rollback;
contains(text, 'mdata(group,Football)')>0   -- does not pull up the row ABCD1234

The delta rows are designed to be transient -- optimize (full or rebuild) integrates them permanently by rewriting the normal rows.  Because each mdata operator opens a cursor, you probably don't want to have hundreds of mdata operators in your query.

One last note on the mdata updates -- if you have a lot of values or a lot of rows, there is a multiple-value/multiple-rowid version:

ctx_ddl.add_mdata('myidx','group',sys.odcivarchar2list('Football','Rugby'),sys.odciridlist('ABCD1234','EFGH5678'));

In each list you can pass several values.  The call adds each value in the list to each rowid in the list.  This should be faster than several add calls.

MDATA Tricks

A companion document for advanced users is available on OTN ( http://otn.oracle.com/products/text/htdocs/mdata_tricks.htm) It's a more technical, in-depth look at some tricks you can do with MDATA -- getting around mdata limitations, handling dynamic token normalization, mixed structured range criteria, and efficient sort by structured column.

BASIC_LEXER Attribute NEW_GERMAN_SPELLING

The basic lexer can automatically handle  some of the differences due to new German spelling

Over the last decade or so, German spelling reform has changed the way German words are spelled.  From an IR perspective, this is as irritating as the Brits misspelling words like "harbour centre" -- it's harder to find occurrences of a word because the computer can't easily tell that the pre-reform spelling of a word (Photo) is the same as the post-reform spelling (Foto).  The NEW_GERMAN_SPELLING attribute of the lexer, when set to TRUE, will detect old spellings and transform them to the new spellings.  For example:

Document: Känguruh daß Potential Schifffahrt
...
exec ctx_ddl.create_preference('ngslx','basic_lexer');
exec ctx_ddl.set_attribute('ngslx','new_german_spelling','true');
...
TOKEN_TEXT in $I:
DAß
DASS
KÄNGURU
KÄNGURUH
POTENTIAL
POTENZIAL
SCHIFFFAHRT

The old spellings are indexed twice -- once in the old form, and once in the new form.  Words already correctly spelled -- Schifffahrt in our example -- are indexed only in new form.  During query, any entered old spellings are transformed to the new spelling:

contains(text, 'Schiffahrt')>0

hits the document.  The old spelling (with 2 f's) is not in our $I.  However, at query time, it's automatically converted to the new spelling (3 f's), which does exist.  This means that you can query in old or new spelling and hit documents which use either the old or new spelling.  Querying for just the new spelling or just the old spelling is not currently possible.

So if old spellings are converted to new spellings at query time, why are the old spellings indexed?  For wildcard.  The basic lexer cannot convert partial old spellings to new spellings, so a query like Potent% must expand on old spellings.  On the other hand, a query for Schiffa% will not hit this document, even though it matches the old spelling of "Schifffahrt", which is in our document.  So it really only works as long as your documents are completely old spelling, or completely new spelling.

This feature does not implement all Rechtschreibreform changes.  Single word to multi-word changes, in particular, are not handled.  For instance, Radfahren should now be written as two words -- Rad fahren, and sequences like 8jährig get hyphenated to 8-jährig, which normally would be lexed as two words.  However, the NGS basic lexer does not currently support this.  It also does not support capitalization changes -- Swimming-pool (old) to Swimming-Pool (new) for instance.  And, it's a little too aggressive on ß changes -- it thinks that spaß should be spelled "spass" -- but these would normally be handled through alternate spelling, which is usually on for German indexing.

There is also possibly inconsistent behavior when combining new German spelling with base letter.  The base letter transform is applied only to the new spelling.
Consider a lexer with new German spelling and base letter processing the word "Känguruh".  We know that this gets indexed as "Känguruh" and "Känguru" because of new German spelling.  But after this only the new spelling gets base lettered, so we index "Känguruh" and "Kanguru". This produces a small hole in the query results.  Compare this to a lexer with only base letter on:

QUERY:       Känguruh  Kanguruh  Känguru  Kanguru
NGS + BL        hit     no hit     hit       hit
BL ONLY         hit       hit     no hit   no hit

With only base letter, we can hit on Känguruh or Kanguruh, but not the new spellings.  With new German spelling on, we can hit on the new spellings, but we lose the ability to hit on "Kanguruh".  At first glance this may not seem so bad, since the native German speaker would not query for "Kanguruh" -- they'd never just drop the umlaut from the a.  However, consider "Varieté".  Under the new German spelling rules, this becomes "Varietee":

QUERY:       Varieté   Variete   Varietee
NGS + BL       hit      no hit      hit
BL ONLY        hit        hit     no hit

Each lexer has a problem: NGS+BL won't hit on "Variete", which the average German may type, since é is not in the German alphabet (in the same way that an American might type "manana" instead of "mañana"), and the BL ONLY won't hit on the new spelling, Varietee.

Alternate spelling coexists a bit better.  A lexer with alternate spelling German and new German spelling indexes "Kaenguruh" as "Kaenguruh" and "Känguru", and can find this document by any of the four queries "Kaenguruh", "Kaenguru", "Känguruh" or "Känguru". 

When alternate spelling German AND base letter AND new German spelling are on, you still suffer from the foreign base letter query problem discussed above.  Other than that, the query semantics are not surprising. 

BASIC_LEXER Attribute OVERRIDE_BASE_LETTER

OVERRIDE_BASE_LETTER allows you to switch the direction of alternate spelling vowel transforms

9.2 allowed users to set BASE_LETTER and ALTERNATE_SPELLING at the same time.  Unfortunately, this produces linguistically undesirable behavior in some cases, due to our normalization method.  Consider three documents:

DOC1: Brüder
DOC2: Bruder
DOC3: Brueder

In Text, each word is sent through the alternate speller first.  The alternate spelling transformer looks for two-letter alternate spellings and transforms them to single-letters.  The original form is also saved for wildcard matching, in case there are changes.  So our table would get lexed:

       ORIGINAL      ALT. SPELL
DOC1:  Brüder    ->  Brüder
DOC2:  Bruder    ->  Bruder
DOC3:  Brueder   ->  Brueder Brüder

Now the token is sent through base lettering.  Note that if the alternate speller saved an original form, that form does NOT go through the base lettering.  Only the transformed version goes through.  So our table is:

       ORIGINAL      ALT. SPELL          BASE LETTER
DOC1:  Brüder    ->  Brüder          ->  Bruder
DOC2:  Bruder    ->  Bruder          ->  Bruder
DOC3:  Brueder   ->  Brueder Brüder  ->  Brueder Bruder

When someone searches for "Brüder", it goes through the same process -- becoming a search for "Bruder".  This matches all the documents, because the form "Bruder" was indexed for all.  The problem is that this is incorrect.  "Bruder" and "Brüder" are different words.  Searching for "Brüder" should match "Brüder" and "Brueder", but not "Bruder".  Searching for "Bruder" should match only "Bruder".

To get such semantics, 10g has the OVERRIDE_BASE_LETTER attribute.  This reverses the alternate spelling transform so that instead of changing two letters to one, it changes one letter to two (vowels only -- ß is always transformed to ss).  Let's go through our table again:

       ORIGINAL      ALT. SPELL
DOC1:  Brüder    ->  Brüder Brueder
DOC2:  Bruder    ->  Bruder
DOC3:  Brueder   ->  Brueder

Note that the alternate spelling now hits doc 1 instead of doc 3.  Now the base letter:

       ORIGINAL      ALT. SPELL          BASE LETTER
DOC1:  Brüder    ->  Brüder Brueder  ->  Brüder Brueder
DOC2:  Bruder    ->  Bruder          ->  Bruder
DOC3:  Brueder   ->  Brueder         ->  Brueder

The ü in doc 1 is not changed to u because the saved original form is not sent through the base letterer -- only the transformed version.  Now we get the semantics we want.  At query time, "Bruder" is unchanged by alternate spelling or base lettering, and matches only doc 2.  "Brüder" and "Brueder", on the other hand, get transformed to "Brueder", which is the normalized form of docs 1 and 3. 

So it's a little confusing under the covers, but the short story is that if you're indexing with BASE_LETTER and ALTERNATE_SPELLING on, you're probably better off with OVERRIDE_BASE_LETTER also on.

JAPANESE_LEXER Attribute DELIMITER

The DELIMITER attribute signals the Japanese lexer to honor common Japanese printjoins

The Japanese lexer attempts to split Japanese text into words.  Normally, it breaks on any non-Japanese characters.  ABCD&EFG, for instance (assuming ABCD and EFG are Japanese characters) gets split automatically at the &, even if DE is a word (you can add D&E to the lexicon to prevent this).  Middle dot and forward slash are often used in Japanese between related words -- similar to the function of hyphen in English compound words.  Setting the DELIMITER attribute to ALL effectively declares these two characters as printjoins.  So ABC/DEF gets lexed as ABC/DEF instead of ABC and DEF.  It's a little confusing, because by turning DELIMITER on, you're actually saying don't delimit, so just think of it as a join setting.

User-extensible Japanese & Chinese Lexicons

The lexicons used by the Japanese and Chinese lexers are extensible.

The Japanese and Chinese lexers are driven by lexicons -- think of them as big word lists.  In 10g you can extend and override the lexicons with your own words -- neologisms, company-specific terminology, etc.  Here's how to do it:

  1. convert the existing lexicon to text.  You do this by using ctxlc:
       ctxlc -ja -ocs ja16euc > jalex
    This command takes the Japanese lexicon in ?/ctx/data/jalx and converts it to text in the ja16euc character set. The result gets directed to standard out, so we've used UNIX piping to direct it to a file called jalex.  The output character set (ocs) is whatever NLS character set name you want to work in.  For Chinese, use the -zht switch instead of -ja.  Don't worry about the switch being called zht -- the lexicon is not tied to traditional or simplified -- that's all controlled by -ocs.
  2. Edit the lexicon file to add your own terms, delete existing terms, etc.

  3. Compile your wordlist to binary:

        ctxlc -ja -ics ja16euc -n -i jalex
    -ics is the character set of the input file, and -i is the text file to compile to binary. (-n is needed because we're creating new files, rather than modifying old ones). This will generate new lexicon files in the current directory:
        drold.dat droli.dat drolk.dat
  4. Now back up the files in ?/ctx/data/jalx (or ?/ctx/data/zhlx) if you wish, then replace them with the newly-generated files. You will have to append JA or ZHT to the name of the file to conform to the naming convention.  For instance::

        mv drold.dat $ORACLE_HOME/ctx/data/jalx/droldJA.dat
  5. Flush the shared pool to force reload of the new lexicon. This is not technically needed right now, but may be in the future.

Chinese Lexers New Supported Character Sets

The Chinese lexers now support new character sets.

The Chinese lexers (VGRAM and lexical) now support any database character set for Chinese, including the newer ZHS32GB18030 character set.

WORLD_LEXER

The WORLD_LEXER automatically handles multi-language corpora and documents by Unicode codepoint detection.

The new WORLD_LEXER indexes multi-lingual documents and corpora on unicode databases.  It processes a document by looking at the codepoints of the characters, and following an appropriate segmentation strategy:

  • Words made of Chinese, Japanese, and Thai characters get bigram segmented -- they are split into overlapping sequences of 2 characters.  Japanese kana is vgram segmented.
  • Words made of Arabic characters get whitespace segmented, then post-processed with Arabic-specific simple normalization and decompounding:
    • Removal of vowel marks
    • Separation of definite articles, pronoun suffixes, prepositions, conjunctions, and particles from the base word.  For instance, the Arabic translation for "his teacher" is written as a single sequence.  The World lexer separates this into "his" and "teacher"
    • case-folding, verb stemming and gender normalization are not currently handled
  • Everything else (Latin characters, Cyrillic, Greek, etc) gets whitespace segmented

Since the segmentation strategy is determined by codepoint, you do not need to identify the language of the document, like you do with the multi-lexer.  Furthermore, you can index multi-lingual documents -- the world lexer will shift segmentation strategy as needed to deal with the multiple languages.  In keeping with its automatic nature, the world lexer has no attributes to worry about.  To use it, just create a preference:

ctx_ddl.create_preference('mylex','world_lexer');
...
create index .. parameters ('lexer mylex');

On the query side, you don't have to worry about query-time language settings. There's only one lexer, unlike the multi-lexer's collection of multiple lexers.

On the other hand, the world lexer's lack of attributes means that it has no language-specific features.  You cannot do fancy things like German decompounding, Japanese segmentation, or Swedish alternate spelling.  You can't define printjoins, use themes, or do stem-form indexing.  So while the multi-lexer is more work to set up, it is a lot more flexible.  Here's a short comparison between the world lexer and the multi-lexer:


Feature MULTI_LEXER WORLD_LEXER
Definition and Setup use ctx_ddl API to set up individual lexers and collect them into a multi-lexer no programming needed
Multi-lingual corpora (mono-lingual documents) Supported Supported
Multi-lingual documents Not Supported Supported
Database charset Any Unicode only (AL32UTF8 and UTF8)
Set attributes of each language Supported Not Supported
Document language ID User-identified, per document Not needed
Query language identification User-identified, per query Not needed
Arabic Keyword search only Arabic-specific features
Chinese/Japanese VGRAM or Segmentation VGRAM only

Japanese Stemming

Japanese Stemming is now supported when using the JAPANESE_LEXER

The JAPANESE setting for stemming in the basic wordlist implements Japanese inflection expansion.  It pretty much works like stemming in other languages -- enter in a form of a word and it finds all inflection forms of that word in the documents.  However, there are some differences:

  • Japanese stemming is supported only for the segmentation lexer
  • Stemming is supported only for verbs and adjectives.  Since this includes Japanese-special cases like noun-verb and adjective-verb (unten suru, shizuka datta) and Japanese nouns don't generally have plural form, this should be practically sufficient.
  • The input to stemming must start with kanji and end with a regular kana ending.  For instance, TANOshikatta, where TANO is a kanji and the rest is written in kana

The Japanese stemmer is homegrown technology (unlike our other stemmers which we license).  If you have any feedback on its behavior, let us know.

Japanese Fuzzy

Japanese Fuzzy is now supported for kana fuzzy matching.

The JAPANESE setting for fuzzy match in the basic wordlist implements Japanese fuzzy.  The input at query time should be a kana sequence.  The fuzzy matcher will expand this to similar kana strings in the document corpora using character similarity clusters.  This should work with both the segmentation and vgram Japanese lexers.  It's probably of most use when finding inconsistent Japanization of imported words.  For instance, BAIOLIN or VAIOLIN for violin.



Index Creation and Maintenance

Direct Path Inserts During Index Creation

Index creation should be noticeably faster due to the use of direct path inserts.

During an index creation, Text inverts the documents and collects the token lists in memory.  When memory is full, it flushes the memory to the index tables.  In 9i, we did this using normal SQL inserts.  In 10g, we've switched to direct path inserts.  This is much faster, and you can even avoid undo/redo log use if you've created your index tables with nologging.  Currently this happens only during an index creation -- sync still uses SQL for insertion.  This preserves your ability to sync and query at the same time -- if sync were to use direct path loading, the $X index would go invalid for small windows of time, and concurrent queries would fail.

Parallel Local Domain Text Index Creation

Local domain text indexes can now be built in parallel using the PARALLEL keyword in the create index statement

Local domain text indexes in 9i, like b-tree indexes, can be built in parallel using the DBMS_PCLXUTIL helper package.  In 10g you can also use the parallel keyword in the create index statement.  The index gets built with inter-partition parallelism -- the system will work on multiple partitions at one time, with no more than one slave per partition.  You can still use DBMS_PCLXUTIL for intra- and inter-/intra- partition parallelism.

Metadata-only Index Rebuild

The METADATA keyword in alter index rebuild changes the index metadata without reindexing.

Normally a change to the index metadata necessitates a rebuild.  For instance, changing the lexer to turn base_letter on makes the existing index data invalid -- the documents have to be relexed or queries will get inconsistent results.  However, some changes don't require a rebuild.  Changing the stemmer setting, switching from single-lexer to multi-lexer, changing the user datastore procedure name (while keeping the same code), etc.  For these situations, 10g has the METADATA keyword, which changes a rebuild operation from a complete index rebuild to a metadata-only index rebuild.

Let's say you want to change from a single-language lexer to the world lexer.  First you create your replacement preference:

exec ctx_ddl.create_preference('rlex','world_lexer');

Then rebuild your index using the REPLACE operation, and include the METADATA keyword:

alter index myidx rebuild parameters ('replace metadata lexer rlex');

This removes the existing lexer and plugs in the replacement lexer, but leaves the index tables alone.

The main caveat is that this operation gives you enough rope to shoot yourself in the foot.  If go and do something inconsistent with existing index data -- like changing the stopword list or turning on prefix indexing -- it won't stop you.  So you should either be certain that your changes are backwardly compatible with existing data, or accept possible query inconsistencies.

Another caveat, possibly obvious, is that while you can change the storage preference using the METADATA keyword, it has no effect because the index tables are not touched by this operation.  It would affect a future "alter index .. rebuild;" command, though, since at that time it will read the updated metadata and rebuild the index tables.

The ODCI interface does not allow partitioned indexes to be rebuilt. You can use replace metadata on individual partitions to, for instance, change the sync interval. But you can't alter index rebuild for the overall index -- you'll get the error ORA-14086: a partitioned index may not be rebuilt as a whole. However, we have a PL/SQL procedure which works around ODCI.  Just call it with the index name and the parameters string, like this:

exec ctx_ddl.replace_index_metadata('myidx','replace metadata lexer rlex');

Automatic Sync/Sync on Commit

The SYNC keyword of create index parameters can set up a dbms job or force sync after every commit

To keep their context indexes up-to-date, customers usually have a dbms_job which periodically wakes up and syncs the index.  The new SYNC keyword can do this step for you at create index time:

create index myidx on mytable(mycolumn)
indextype is ctxsys.context
parameters ('sync (every "SYSDATE+5/1440")');

The create index code will automatically create a new dbms_scheduler job that syncs this index every 5 minutes.  The double-quoted interval that follows EVERY is passed directly through to dbms_scheduler -- please refer to the PL/SQL Packages and Types Reference for information on the syntax and examples of use.  You can also set memory and parallel details of the sync call:

create index myidx on mytable(mycolumn)
indextype is ctxsys.context
parameters ('sync (every "SYSDATE+5/1440" MEMORY 15M PARALLEL 2)');

Of course this is mainly a convenience feature -- you could do all this yourself.

But what if every 5 minutes isn't good enough for you?  What if you really really need super up-to-the-minute updates?  Then either use transactional queries or the second option for the sync: COMMIT:

create index myidx on mytable(mycolumn)
indextype is ctxsys.context
parameters ('sync (on commit)');

That's all there is -- with COMMIT you cannot set memory or parallel.  In this mode, the syncs happen automatically when inserts or updates to the base table are committed.   Three caveats:

  1. since usually a transaction has only one or two document changes, you can expect this to increase fragmentation and thus decrease query performance
  2. since only one session can sync at a time this will serialize concurrent commits
  3. the sync is actually done in a two step process. First the transaction commits, then the sync occurs. This means that, in the window of time between the transaction commit and the sync completion, the documents are visible to other users but not searchable.

Sync on commit is not mutually exclusive with transactional query. You can have both of them on if you want. But if you were to choose one, which to choose? The next section, on transaction query, draws a more detailed comparison, but the short answer is that most customers who need up-to-date query should use EVERY sync and transactional query, instead of ON COMMIT sync.

The last mode of SYNC is MANUAL, which maintains current behavior -- that is, don't sync on commit, don't create a scheduler job.

Transactional Queries

Transactional queries use per-row evaluation to deliver statement-level consistent contains queries.

Customers sometimes tell us that they need immediate results -- when they insert or update a row, it needs to be searchable right away.  The context index, like most information retrieval products, does not do this by default.  Deletes are immediate, but inserts and updates are asynchronous; the document is unavailable until the next sync.  Practical experience shows that this is usually acceptable.  The requirement for "right away" search in most systems is more theoretical than necessity -- if the application can live without this requirement, or find some alternative way of fulfilling it, the app generally runs better.  But for those situations where it is unavoidable, context indexes now have the option of being fully transactional.

To use this new feature, specify the transactional keyword during create index:

create index myidx on mytable(mycolumn)
indextype is ctxsys.context
parameters ('transactional');

You can also alter index rebuild .. replace metadata transactional to make an existing index transactional. 

Updates and inserts on a transactional index get logged in dr$pending, like normal indexes, but the rowids also get logged in a second table, dr$unindexed.  During query, the $I produces its hits, then each rowid in dr$unindexed is evaluated, individually, to see if it is a match -- that is, we fetch the document text, filter it, section it, invert it, then run the contains query on that document's data -- somewhat like a highlight operation.  The set of rowids from dr$unindexed that match is combined with the results from the $I scan and the merged set is returned from contains.

Like dr$pending, dr$unindexed is fully transactional -- you can savepoint, rollback, etc. and the effects are immediate.  So you can do stuff like:

select count(*) from table where contains(text, 'someword')>0;    -- 0 hits
insert into table values ('someword');
select count(*) from table where contains(text, 'someword')>0;    -- 1 hit (the one we just entered)
rollback;
select count(*) from table where contains(text, 'someword')>0;    -- 0 hits

So why do we still need sync?  The cost for evaluating a single unindexed rowid can be fairly high -- depends on your index.  Perhaps a user datastore procedure needs to be executed.  Maybe the INSO filter will have to be run.  Even the presence of one unindexed rowid can slow down query response time.  As dr$unindexed gets more and more rows, the queries will get slower and slower.   A sync indexes the documents, adds the data to the $I index, then removes them from dr$unindexed.  Since searching $I is more scalable than evaluating documents one at a time, your queries should be faster after the sync.  So best practice (if you absolutely MUST have instant searching) is to have transactional query and a reasonable periodic sync -- as infrequently as possible to control fragmentation, but frequent enough that dr$unindexed doesn't get hairy.

If some of your queries need instant searching, but others don't (like count_hits estimates, etc) you can turn off the dr$unindexed scan at the session level:

exec ctx_query.disable_transactional_query := TRUE;

Future contains queries will not scan dr$unindexed.  To turn transactionality back on, set the variable to FALSE.  A bit confusing to set to false to turn something on, but think of it as turning off the override.

How does transactional query compare to sync on commit?

  Transactional Sync on Commit
can query documents immediately after insert or update YES NO
can query documents immediately after commit YES YES (small window where documents are visible but not queryable through contains)
impacts query speed YES -- slower because dr$unindexed must be scanned NO -- not directly, but increased fragmentation can lead to slower queries
increases index fragmentation NO YES

Transactional query is slower because of dr$unindexed, true, but the sync on commit, unless dealing with large transactions, will increase fragmentation, which may slow down queries anyway.  And transactional gives you more immediate gratification than sync on commit -- you can query right after the update or insert, instead of having to wait for the commit.  So I think in general transactional is better than sync on commit.  It would be much better than sync for those systems where documents change very frequently when they are young -- for instance, an editorial system where writers are saving drafts and making edits.  Transactional would keep up with the frequent changes, then the companion periodic sync would capture the reasonably "final" version, while sync on commit would have to index every saved version, leading to dramatically increased fragmentation.

However, transactional query and sync on commit are not mutually exclusive. If you want to turn both of them on, you can. Personally, I think sync EVERY with transactional is better, because you get the benefits of transactional consistency without the fragmentation of sync commit, but your application mileage may vary.

Token Type Optimization

The TOKEN_TYPE mode of optimize optimizes all tokens of a specific type

The new TOKEN_TYPE mode does a FULL optimize, but limited to all tokens of a specific token type.  You can think of it as an expansion of the TOKEN mode, which targets a specific (token, token_type) .  You invoke it like a TOKEN optimize, but specify TOKEN_TYPE for the mode:

exec ctx_ddl.optimize_index('myidx','TOKEN_TYPE',token_type=>1);

This optimizes all theme tokens (type 1).  For mode you can use the symbol CTX_DDL.OPTLEVEL_TOKEN_TYPE instead of a literal 'TOKEN_TYPE', if you want.  For the token_type parameter, you can use the numeric token type code, if you know it, or ctx_report.token_type if you don't (see example below). Like TOKEN optimize, TOKEN_TYPE runs fully to completion, so don't specify maxtime.  It optimizes all tokens of the given type, so don't specify a token parameter, either.  You can run TOKEN_TYPE optimization in parallel, so you may specify a parallel_degree parameter if you wish.

Behind the scenes, the token_type optimize goes through almost the same code as a FULL optimize -- read the $N table, scan the $I table, rewrite the rows by removing garbage and defragging small rows.  The main difference is that the TOKEN_TYPE mode will simply ignore any row that is not of the specified type.

This mode is probably most useful for keeping field sections and mdata sections up-to-date.  For instance, suppose you have an mdata section named "symbol" that holds stock symbols, on which you want to maintain speedy performance.  Periodically, run something like:

exec ctx_ddl.optimize_index('myidx','TOKEN_TYPE',
       token_type=>ctx_report.token_type('myidx','MDATA symbol'));

This will keep all values of the MDATA section symbol defragged and optimal.

Rebuild Optimization

The REBUILD mode optimizes an index by building a fresh shadow copy of the index, then swapping it with the existing index.

FULL optimize updates the $I table in place.  This can be slow, since we must use SQL inserts and deletes, which are slower than direct path writing.  On top of that, the $X index slows down any DML on $I.  The new REBUILD mode of optimize is designed to get around these problems.  You invoke it like a fast optimize, but specify REBUILD for the mode:

exec ctx_ddl.optimize_index('myidx','REBUILD');

There is currently no symbol CTX_DDL.OPTLEVEL_REBUILD, so you'll just have to write out the literal string REBUILD.  Rebuild optimize optimizes the entire index in one shot, so there is no need to specify maxtime, token, and token_type parameters -- you'll just get a syntax error.  Rebuild optimize, like FULL, can be run in parallel, so you may specify a parallel_degree parameter if you wish.

So what does it do?  Broadly, it creates an empty $I, copies the contents of current $I to the new $I, optimizing each row as it copies, then swaps the new $I and the old $I.  This can be faster than FULL because FULL has to insert rows using SQL, then delete old rows, all the time updating $X.  REBUILD can insert using direct path load, does not have to delete old rows, and does not have to update a b-tree index.

Here's what REBUILD optimize does, specifically:

  1. Create two shadow $I tables.  Shadow 1 is named DR$<idx>MI, while shadow 2 is named DR$<idx>NI.  Both will use the storage clause for $I, so if the initial extent on that storage clause is really high, you could run out of space here.  The MI table will be partitioned, but have only one partition.  It needs to be partitioned so that later we can do a swap with the existing $I.  We use special internal calls so that this works even on standard edition, which does not have partitioning.  Obviously this precludes your use of storage preferences which would partition the $I table.

  2. Create a log table DR$<idx>$L -- this table is used to monitor changes to the existing $I table during the optimization.  It gets no storage clause, so uses the index owner's default tablespace's default storage.

  3. Create a copy of the current $N table, named DR$<idx>MN.  This serves as a snapshot of deleted docids.

  4. Create a monitor trigger, DR$<idx>$T, which logs the rowid of all inserts to the existing $I in the $L table

  5. Copy the existing $I table to the shadow I tables.  This is done in one complex SQL statement -- roughly,

    insert into <shadow I>
    select * from TABLE(optimize_index(CURSOR(select * from $I))

    Let's break this down. The innermost part is the CURSOR. This selects all the rows from $I and passes them through the optimize_index function.  The optimize function reads the snapshot $N table, then takes the $I rows coming from the cursor and optimizes them -- removing garbage rows, defragging small rows, etc.  It uses essentially the same code as the FULL optimize.  The table function returns optimized $I rows which are then shunted by the insert into statement into the shadow tables.  (Mostly to shadow 1.  Shadow 2 gets those rows with really long token_info, for technical reasons).  The insert into uses the efficient direct path insertion, so the I/O costs of populating the shadow tables is minimal.  (You do not need to have NOLOGGING in your I_TABLE_CLAUSE in your storage preference, but it might help performance if you do, or if it is part of the tablespace default). At the end of this statement, the shadow tables hold an optimized version of the existing $I table data.

  6. Copy all rows from shadow 2 to shadow 1

  7. Create a shadow $X index on shadow 1

  8. Copy any new rows in $I to shadow 1.  While we were doing the optimize, the $I might have changed.  Perhaps there was a concurrent sync, or ctx_ddl.add_mdata.  Our shadow 1 does not have these new rows.  Luckily, their rowids are logged in $L (remember the monitor trigger created in step 4?).  So it's a simple matter to copy the new rows from $I to shadow 1 -- insert into shadow1 as select from $I, $L where $L.logged_rowid = $I.rowid.

  9. Swap shadow 1 and $I, using alter table exchange partition

  10. We've optimized all the docids in the snapshot $N table, but there could have been deletes or updates while we were doing the optimize, so we can't just truncate the $N table.  We need to delete all the docids in our snapshot table from the $N.  Logically, that's what we do here.   Only practically, it is easier to create a shadow $N table (DR$<idx>NN) using create table as select current $N table MINUS snapshot $N table.

  11. Swap the second shadow N table and the $N table

  12. Drop all shadow objects, which have old data in them

The implication of steps 8 and 10 is that after a rebuild optimize, your index may not be totally optimal -- DML that happens during the optimize cannot be optimized.  But this kind of thing is unavoidable, and no interesting index stays pristine for very long, anyway.

So, when do you use FULL vs REBUILD vs TOKEN?  Use FULL and its time-limited operation for continuing maintenance, like a cron job that kicks off every night for 4 hours.  This can be supplemented by a cron that runs some TOKEN or TOKEN_TYPE optimizes for continuing targeted maintenance, keeping especially dynamic field sections, mdata sections, or text tokens optimal.  Use REBUILD for periodic maintenance -- like running it once a month or something, if FULL if unable to keep up with daily activity.

Improved Parallel Optimize Mechanism

The internal mechanisms used for parallelization of optimize_index have been improved.

The method of parallelization used for optimize_index in 9i was susceptible to incomplete coverage if insufficient parallel slaves were available.  In 10g we've rewritten the underlying mechanisms so that it adapts to the number of parallel slaves actually spawns, and, in the extreme case, can fall back to serial optimization seamlessly.

 



Query

Internal Optimizations

Query I/O performance has been improved

Text queries now use low-level database calls to do index table reads.  This makes the I/O faster, which makes your text queries faster.  You don't need to do anything -- it's automatic.

MDATA Operator

The MDATA operator searches MDATA sections.

This is a cross-reference.  MDATA is a new operator in our query language that does equality search on tagged document metadata.  Read all about it above.

 

Transactional Queries

Text queries can now have statement-level consistency

This is a cross-reference.  Queries on indexes created with the TRANSACTIONAL keyword have statement-level consistency -- inserts and updates are immediately searchable.  Read all about it above.

Reverse Docid Sorting

The default ordering of text queries is, once again, forward docid.  Reverse docid is available as an option.

In 9.2.0, the natural order of  rows returned from Text index, in the absence of an order by clause, was in reverse docid order.  The intent was to provide a rough first-in-last-out sorting, so that generally the most recent documents were returned first.  For various technical reasons, we've reverted this in 10g to forward docid order.  To provide backward compatibility for existing applications relying on reverse docid, we have an internal switch.  The switch modifies the index metadata so that text queries return results in reverse docid order:

exec drvxmd.set_reverse_docid_switch('indexowner','indexname','ON');

this turns reverse docid ordering ON for the index indexowner.indexname.  This will simulate 9i reverse docid ordering, except in the cases of progressive relaxation or transactional query -- this switch is intended as a backward-compatibility coping mechanism only, and these are new query models.

If your application relies on reverse docid, you really should think about moving to a different model.  The current recommended solution for date-sorted text query is a local partitioned text index on a date-partitioned base table.  Experimental and customer data show that this is generally the most-performant solution with the current version of the product.  Alternatively, MDATA can be used for conjunctive partitioning -- see the companion document 10g MDATA Tricks on OTN.

COUNT_HITS Result Caching

Certain contains queries remember the total number of documents hit by the text query so that subsequent count_hits can use the cached result

When you do a text search on the web these days, you usually get about a page of "top" results and an estimate of the number of hits.  Personally I think the number of hits is useless and cliched, but people seem to like it.  Since 8.1.5 we've had ctx_query.count_hits to deliver the estimated hit count.  However, even in estimate mode, count_hits re-runs the query, which means you usually end up running the query twice -- once to get the "top" hits and once to get the total hitcount.

10g introduces a hit count cache.  When you do the contains to get the top hits, behind the scenes Oracle Text remembers the total number of hits.  When you then do the count_hits, instead of re-running the query, we just get the cached hitcount.

This technique does not always guarantee correct results, so there are a few restrictions.

  • The order should be contains, then count_hits -- count_hits won't cache results for a later contains
  • The contains must be one that produces all text hits.  For instance, a query like:

    select * 
      from (select /*+FIRST_ROWS(50)*/ * from table 
             where contains(text,'dog',1)>0
             order by score(1) desc)
     where rownum <= 50
    
    actually produces all text hits internally, even though only 50 are produced. A query like:
    select * from table where contains(text, 'dog')>0 where rownum <= 10
    
    does not produce all hits.  Nor does a mixed query that uses function invocation, for example.
  • The query strings must match.  If you do a contains for "dog and ENG within LANGUAGE" and then a count_hits for "dog", it won't use the cached results.

  • The count_hits must be an exact => false

  • There cannot be any sync completions, index rebuilds, partition operations, etc. between issuing the contains and issuing the count_hits.

You don't have to worry about correctness. If you fail to meet a restriction, the count_hits just switches automatically to the old behavior of re-running the query -- you'll always get correct results, not random answers or error messages.

Progressive Relaxation

Progressive relaxation mechanically rewrites and re-executes queries to increase recall as needed

Progressive relaxation is a technique that uses query variation to maximize precision while guaranteeing recall.  Think of the user-entered query as simply a sequence of words.  You can put these words together in a variety of ways to form text queries, some more precise than others.  You run the variants sequentially, from most precise to least precise, until you get the number of hits you need.  Let's look at a concrete example.

Let's say your clothing website has a small text field labeled "search".  An end user enters "leather bomber jacket".  In the best case, there is a product with this exact phrase.  So first, you try running it as a phrase: "leather bomber jacket".  If it hits, you stop.  Otherwise, you relax the query.  Maybe the three words appear, but not in that order: "leather and bomber and jacket".  If you still don't have enough hits, you could try for an accum: "leather,bomber,jacket" then perhaps a stem query: $leather and $bomber and $jacket, etc.  The idea is that you keep relaxing the precision until you get the recall you need.

Obviously you can do this yourself with PL/SQL code.  But why do all that work when there's the new progressive relaxation feature?  All you need to do is specify the series of queries you want to run.  Oracle Text handles the mechanics of running each one in order until you get the number of hits you need.  Our example above would be written:
select /*+FIRST_ROWS(10)*/ item_id
  from stuff_for_sale
 where contains(description,
'<query><textquery><progression>
<seq>leather bomber jacket</seq>
<seq>leather and bomber and jacket</seq>
<seq>leather, bomber, jacket</seq>
<seq>$leather and $bomber and $jacket</seq>
</progression></textquery></query>')>0
where rownum <= 10;

There may be overlap between the variants.  For instance, you could have one document that has the phrase "leather bomber jacket".  That's not enough to satisfy  our 10 rows buffer, so we move to the second one -- "leather and bomber and jacket".  Well, this also matches the document that we already have.  Will we get the document twice?  No. The progressive relaxation code keeps track of produced results and eliminates duplicates.

Notice the FIRST_ROWS and the rownum limit.  This allows Oracle Text to stop after producing the number of rows you need.  If you get 10 hits on the phrase variant, there's no point in running the relaxed versions, right? 

If you need to know which variant hit, you can select the score.  Like accum, progressive relaxation partitions the score into ranges.  Our example has four queries, so each query gets assigned a range of 25 points (100/4). So if a particular document has score 78, this is in the top range (76-100), so we know that it was hit by query 1 (the phrase).

To make things even easier, the variants can be mechanically generated for you.  We can rewrite our example above as:

select /*+FIRST_ROWS(10)*/ item_id
  from stuff_for_sale
 where contains(description,
'<query><textquery> leather bomber jacket
<progression>
<seq><rewrite>transform((TOKENS, "", "", " "))</rewrite></seq>
<seq><rewrite>transform((TOKENS, "", "", " and "))</rewrite></seq>
<seq><rewrite>transform((TOKENS, "", "", " , "))</rewrite></seq>
<seq><rewrite>transform((TOKENS, "$", "", " and "))</rewrite></seq>
</progression></textquery></query>')>0
where rownum <= 10;

Let's break down the first rule:

<rewrite>transform((TOKENS, "", "", " "))</rewrite>

The transform operator takes in four arguments.  For the first argument, you should pass in TOKENS, meaning you want to split the input into constituent tokens.  The next three arguments are concatenation inputs: what you want prepended to each token, what you want appended to each token, and what you want in between.  So we want nothing in front, nothing in back, and a space in between: the phrase "leather bomber jacket".   The second rule puts nothing in front, nothing in back, and AND in between: leather and bomber and jacket.  And so on.

With rewrite, your front-end query application does not need any fancy code -- just prepend the "<query><textquery>" and append the "<progression><seq>...", both of which are static strings, and the Oracle Text query engine does the rewriting.

The syntax is pretty fussy.  You need the double parens for transform, and you actually need the whitespace between the arguments.  If you wrote it like this:

<rewrite>transform((TOKENS,"",""," "))</rewrite>

You'd get a parse error or possibly worse.  So don't do it.

The major limitation is that you can have only the transform inside the rewrite tag. You can't do progressive section expansion like

<rewrite>transform((TOKENS, "", "", " and ")) within title</rewrite>
<rewrite>transform((TOKENS, "", "", " and ")) within head</rewrite>
<rewrite>transform((TOKENS, "", "", " and ")) within body</rewrite>

or add extraneous limiters like:

<rewrite>transform((TOKENS, "", "", " and ")) and is_a_book</rewrite>

Future versions may have this capability, but for now you'll have to do it manually (non-rewrite progression/seq).  Also, highlighting does not support query template, so you'll have to determine which rule hit, transform from rule to text query, if needed, then send the text query to highlighting.  Future versions of the product should be able to support passing the query template directly to ctx_doc.markup.

COUNT Score Algorithm

The COUNT score algorithm is a simple occurrence-based relevance ranking.

The default score algorithm is kind of complex.  Say you search for "dog".  The score you get for a document containing the word depends on the number of times dog appears in the document, but also the number of documents in the collection that contain dog, and the number of indexed documents.  So it's difficult to predict the expected score for a particular document given a particular query.  This can get problematic if you're trying to do something fancy.
The COUNT score algorithm makes most scores predictable.  It counts the number of occurrences and returns that as the score.  So if you query for "dog" and "dog" appears 14 times in the document, your score is 14.  You specify the count score algorithm using the algorithm attribute of the score tag in the query template:

select score(1), id from table where contains(text, 
'<query><textquery>dog</textquery>
<score algorithm="SCORE"/></query>',1)>0;

Count score takes into account position-based limiters.  For instance, if you query for "cat dog" and "cat" appears 18 times, "dog" appears 5 times, and "cat dog" appears 3 times, then your score is 3.  If you query for "cat within BOLD" (assuming BOLD is a zone section) and "cat" appears 20 times, but only 5 times in bold, then your score is 5.

Most operators already have defined numeric behavior -- for instance, AND returns the minimum score of its children -- so the expected score for even a complex query on a given document is predictable.  The only things to watch out for are those operators which redefine the score -- like FUZZY, which can return a score based on spelling nearness, NEAR, whose score depends on the term distance, and ACCUM, which remaps scores to ranges based on the number of distinct children hit.
Is COUNT score better than the default algorithm?  There's no blanket answer.  It really depends on your application and the data.  Since it's pretty painless, why not try it out with some sample queries and compare the precision/recall?  Additionally, there are things which COUNT can do that the default algorithm cannot.  For instance, stay stable over product versions and document corpus changes.  Evaluate queries like "document must have at least 5 occurrences of dog": dog > 4.  And COUNT scores are fairly comparable across partitions.  Even across indexes in some cases.

Query Language Specification

You can specify the  sub-lexer to use for query lexing without changing the NLS_LANG setting.

8.1.7 introduced the multi-lexer, which routes documents and queries to one of a collection of language-specific lexers.  On the index side each document was identified with a language tag.  On the query side, the nls_language setting determined the sub-lexer to use.  For single sessions that have requests in multiple languages -- like you find in web applications, where the session is shared by several front-end sessions -- the nls_language has to be reset before each request.  In 10g there's an easier way: the LANG attribute in the query template.
To demonstrate, let's set up a simple table with one "english" row and one "german" row:

create table ltest(l varchar2(10), t varchar2(80));
insert into ltest values ('en','foo bar apples');
insert into ltest values ('d', 'foobar äpfel');

Now our multi-lexer.  The only difference between our "english" and "german" lexer is going to be that the german lexer has : defined as a skipjoin:

ctx_ddl.create_preference('elx','basic_lexer');
ctx_ddl.create_preference('dlx','basic_lexer');
ctx_ddl.set_attribute('dlx','skipjoins',':');
ctx_ddl.create_preference('mlx','multi_lexer');
ctx_ddl.add_sub_lexer('mlx','default','elx');
ctx_ddl.add_sub_lexer('mlx','german','dlx','xxz');

We'll create the index using "l" as the language column, "mlx" as the lexer, and a wordlist preference that uses the AUTO stemmer setting. When I do:

select * from ltest where contains(t, 'foo:bar')>0;

I get the english row. My session language is english, which does not have a specific sub-lexer, so the default lexer gets used. That breaks "foo:bar" at the colon, so this becomes a query for the phrase "FOO BAR", which matches only the english row. I could set the session language to german, but let's use the new-fangled way:

select * from ltest where contains(t,
'<query><textquery lang="german">foo:bar</textquery></query>')>0;

the LANG attribute of textquery overrides my session language, and tells the multi-lexer to use the german lexer. This lexer has colon as a skipjoin, so "foo:bar" becomes "FOOBAR". This matches the german row.

The lang attribute has two caveats: first, you must specify a language name or abbreviation. Notice in our multi-lexer setup we defined xxz as an alternate value for german. If you tried to use lang="xxz" in the query above, you would get the english row -- the alternate value is not supported at query time.

Second, the language specification does not affect the stemmer or fuzzy matcher. Remember that we indexed using the AUTO stemmer. If we did:

select * from ltest where contains(t,
'<query><textquery lang="german">$Apfel</textquery></query>')>0;

we would get no results -- the stemmer picks up only the session language, which is english, and uses the english stemmer. Since "Apfel" is not an English word, this probably expands to something like Apfel=Apfels, which hits nothing. However, if I change the session language to german and issue a non-template query $Apfel, the german stemmer gets used, and expands Apfel to, among others, its plural of Äpfel. This hits the german row.



Classification and Clustering

RULE_CLASSIFIER Attribute PRUNE_LEVEL

The PRUNE_LEVEL attribute of the RULE_CLASSIFIER sets the aggressiveness of decision tree pruning.

The RULE_CLASSIFIER examines groups of documents and creates ctxrule rules. The process is actually:

  1. Create a full decision tree. Conceptually, the tree routes documents using a series of token tests. If the token is in the document, then we follow the left branch. If the token is not in the document, then we follow the right branch. For instance, let's use the tree:
         dog
        /   \
       /     \
      cat   frog
     /  \   /  \
    G1  G2 G3  G4
    
    and the document:
     the dog sits on the wall
    
    We start at the root, which is DOG. This document has that token, so we follow the left branch. Next is a test for CAT. This document does not have this token, so we follow the right branch, leading to the result G4, which means that this document belongs in group 4. In reality, the mechanics are a bit different, but this is a valid conceptual framework.
    The decision trees created in this first step are perfect. That is, every document that belongs in the group gets routed to the group by the tree. Every document that does not belong to the group is excluded by the tree.
  2. The tree is pruned. Perfect trees are problematic for two reasons. First, they result in large rules which would decrease performance of routing. Second, they are too sensitive. They may exclude new documents which really should be part of the group. To avoid these problems, some leaf nodes of the tree are removed. This results in a more relaxed tree which empirically produces better routing.
The pruning step 2 can now be influenced with the PRUNE_LEVEL attribute of the RULE_CLASSIFIER. You set the PRUNE_LEVEL to an abstract measurement of aggressiveness, from 0 to 100. The higher the PRUNE_LEVEL, the more aggressive the pruning. The resulting rules will be less strict, meaning that the group gets fewer false negatives and more false positives than rules produced by a less aggressive prune setting.

Like most things in the classification/clustering space, you probably need to play with it to really get a feel for the setting that's best for your situation; there's no hard-and-fast, best-for-everyone rule of thumb.

SVM_CLASSIFIER

The SVM_CLASSIFIER implements support vector machine classification.

The SVM_CLASSIFIER is a new object for classifier preferences which implements support vector machine classification. The existing classifier, RULE_CLASSIFIER, builds decision trees and populates a rule table with inspectable queries. SVM_CLASSIFIER instead implements a complex statistical model with rules that are not human-readable. However, the method for using SVM_CLASSIFIER is nearly the same as the RULE_CLASSIFIER (see the 9.2.0 Technical Overview for an example of using the RULE_CLASSIFIER).

To use the SVM_CLASSIFIER, first create a classifier preference:

exec ctx_ddl.create_preference('mysvm','svm_classifier');

You can also set attributes like any other preference -- see the documentation for the full list of attributes and their meanings. Once you have the SVM classifier preference, simply call the new generic train API:

PROCEDURE train (
  index_name in varchar2, -- the name of the index holding the documents
  docid      in varchar2, -- the name of the docid column
  cattab     in varchar2, -- the name of the category table
  catdocid   in varchar2, -- the name of the docid column in the category table
  catid      in varchar2, -- the name of the category id column in the category table
  restab     in varchar2, -- the name of the result table
  preference in varchar2  -- the classifier preference
);

exec ctx_cls.train('myindex','did','cat_tab','did','cid','svmtab','mysvm');

Like RULE classifier, SVM classifier populates a table of rules.  However, the result table structure for the SVM classifier is slightly different than for the RULE classifier:

create table svmtab (
  cat_id number,
  type   number,
  rule   blob
);

Although the rule is a non-inspectable BLOB, you can still create a CTXRULE index on the output for routing documents.  However, you must provide the classifier preference during ctxrule index creation:

create index svmx on svmtab(rule)
indextype is ctxsys.ctxrule
parameters ('classifier mysvm');

The theory behind SVM is not in the scope of this document. See technet for further information if you are curious. Broadly, it provides higher-accuracy classification, although the rules are non-modifiable and non-inspectable, so cannot be easily modified if they are inaccurate. Like most things in classification and clustering, you really need to try it out to see what works best for your situation.

Clustering

Clustering organizes a set of documents into a hierarchy of groups of similar documents..

Clustering can be seen as the "next step up" in a progression of automatic organization feature sets. 9i introduced the ctxrule index type. You wrote a set of queries, and incoming documents were routed to the queries which matched. If you didn't know how to write the best queries for ctxrule, version 9iR2 introduced classification. Given groups of documents, classification determines which words or themes best characterize each group, and creates queries that can be fed into ctxrule for routing of future documents. Now, if you don't even know how to organize your corpus into groups to feed into classification, version 10g introduces clustering to do it for you. Given a set of documents, clustering organizes them into a hierarchy of groups of similar documents. This can then be fed into classification, and that fed to ctxrule, so you can go from unorganized corpus to routing system with minimal human work.

The clustering process is the unsupervised classification of patterns into groups (clusters). A cluster is simply defined such that any document is "more similar" to other documents in its cluster than it is to documents not in its cluster.

"More similar" basically means "sharing more attributes". As with training, attributes may consist of simple words (or tokens), word stems, and themes where supported. We can distinguish two main clustering techniques: hierarchical and partitional. The first one produces a nested series of partitions and the second produces only one. For hierarchical clustering Oracle uses a top-down partition approach and k-means for partitional clustering. Next we'll take a look at how to use k-means clustering

KMEAN Clustering

How to use the KMEAN clustering method

k-Mean is an example of a partitional clustering algorithm that produces a single partition of the data set (sometimes called flat output). k-Mean is a trade-off between quality and scalability.

The k-mean algorithm works in "feature space", where each dimension is represented by one of the features (such as a word or theme) from the document set. It assigns k centers, or "protypes" -- one for each cluster. It then calculates the distance from each prototype in feature space for each document, and assigns the document to the cluster to which it is closest. It then calculates a new prototype for each cluster based on the mean of each document assigned to it, and repeats the process.

 while not done {
    for each document d {
        find cluster c whose centroid is most similar to d
        assign d to cluster c
    }
    for each cluster c {
        compute the centroid of c using its documents
    }
 }

Now let's take a look at how to use the clustering features in Oracle Text:

Start with a table of documents. This is just like setting up for a context index -- the documents don't have to be stored in the document table -- you can use file datastore, user datastore, etc. if you want. The only restriction is that the document table should have a numeric unique identifier column. Our example is fairly simple:

create table news (pk number primary key, title varchar2(2000));
insert into news values(1,'NASA Announces major Mars rover finding');
insert into news values(2,'Mars rover maneuvers for rim shot');
insert into news values(3,'NASA Mars Odyssey THEMIS image: typical crater');
insert into news values(4,'Mars rovers ready to cruise');

Next, create a context index on the document table, using appropriate preferences. The preferences affect the tokenization of the documents, and thus the potential shared features upon which the clusters are based. For instance, turning themes ON in the lexer may result in clusters that are based on a theme. Having field or mdata sections in the section group can lead to clusters based on tagged metadata. In our example, we'll just go with the defaults:

create index news_idx on news(title)
indextype is ctxsys.context;

The index does not have to be populated; if you are not interested in searching the corpus, create the index NOPOPULATE to save time and disk space.

Next, we create the result tables. Clustering has two result tables -- the document assignment table, which tells you which documents belong to which clusters, and the cluster description table, which holds the cluster hierarchy. Both can be named anything you want, but they must have the following structure:

create table clu_restab (
        docid      number,    -- the document id
        clusterid  number,    -- the cluster id
        score      number     -- the score of the document in the cluster
);
create table clu_clusters (
        clusterid       number,          -- the cluster id
        descript        varchar2(4000),  -- a description of the cluster
        label           varchar2(200),   -- a suggested cluster label
        sze             number,          -- number of documents in the cluster
        quality_score   number,          -- score of the cluster, higher is better
        parent          number           -- the id of the parent cluster.  Negative means no parent.
);
Now we're ready to cluster. The package for clustering is called ctx_cls.clustering:
PROCEDURE clustering (
 index_name   in varchar2              -- the name of the text index
 docid        in varchar2              -- the name of document id column in the collection table
 doctab_name  in varchar2              -- the name of document assignment table for clusters
 clstab_name  in varchar2              -- the name of the cluster description table
 pref_name    in varchar2 default NULL -- the name of the cluster preference, if any
);

The pref_name argument can take a clustering preference. The clustering preference has all sorts of knobs which you can use to attempt to influence the clustering output:

Attribute Datatype Default Min Max Description
MAX_DOCTERMS
integer 50 10 8192 the maximum number of distinct terms representing one document
MAX_FEATURES
integer 3000 1 500000 the maximum number of overall distinct features
THEME_ON 
boolean FALSE     Consider themes as features? (INDEX_THEMES must be on in lexer)
TOKEN_ON
boolean TRUE     Consider tokens as features? (INDEX_TOKENS must be on in lexer)
STEM_ON
boolean FALSE     Consider stem tokens as features? (INDEX_STEM must be on in lexer)
MEMORY_SIZE
integer 500 10 4000 Approximate memory size for clustering, in MB
SECTION_WEIGHT
integer 2 0 100 the occurrence multiplier for adding a term in a field section as a normal term. For example, by default, the term cat in "<A>cat</A>" is a field section term and is treated as a normal term with occurrence equal to 2, but you can specify that it be treated as a normal term with a weight up to 100. SECTION_WEIGHT is only meaningful when the index policy specifies a field section.
CLUSTER_NUM
integer 200 2 20000 the maximum number of clusters to be generated

We'll set a maximum of 2 clusters in our example to illustrate:

exec ctx_ddl.create_preference('ycluster','kmean_clustering');
exec ctx_ddl.set_attribute('ycluster','token_on','yes');
exec ctx_ddl.set_attribute('ycluster','cluster_num',2);

Now we call the procedure:

exec ctx_cls.clustering('news_idx','pk','clu_restab','clu_clusters','ycluster');

In 10gR1, the cluster package may have some internal nodes, which are not useful and have zero size. For clarity, we can simply delete them:

delete from clu_clusters where sze=0;
commit;

Now let's take a look at the output, starting with the clusters:

select clusterid cid, label, descript from clu_clusters;

CID LABEL        DESCRIPT
--- ------------ ------------------------------------------------------------
  2 CRUISE       CRUISE,ROVERS,MARS,READY
  3 ANNOUNCES    ANNOUNCES,MANEUVERS,THEMIS,RIM,NASA,ROVER,MARS,SHOT,MAJOR,FI
                 NDING,IMAGE,TYPICAL,ODYSSEY,CRATER

Clustering labeling is a difficult problem. Oracle Text picks the most important word from the description column as the label for a cluster. The description column provides more information about the content of the cluster.

Next the documents assigned to each cluster (a document can be in more than one cluster):

select clusterid, score, title
  from clu_restab, news
 where clu_restab.docid = news.pk
 order by clusterid, score desc

The following script shows both cluster and document information:

set serverout on size 1000000
begin
  for csr1 in (select clusterid, label from clu_clusters) loop
     dbms_output.put_line('Cluster ID : '||csr1.clusterid||', Label: '||csr1.label);
     for csr2 in (select n.title, c.score
                  from news n, clu_restab c
                  where n.pk = c.docid
                  and c.clusterid = csr1.clusterid
                  and c.score > 0.01
                  order by score desc) loop
        dbms_output.put_line('Clust:'||csr1.clusterid||' Score:'||csr2.score||': '||csr2.title);
     end loop;
  end loop;
end;
/

TEXTK Clustering

TEXTK clustering is an experimental clustering algorithm.

TEXTK is an UNDOCUMENTED, second clustering method available for experimental use. If you're curious, you can try both KMEAN_CLUSTERING and TEXTK_CLUSTERING objects and determine which is better for your needs. The interface for using TEXTK_CLUSTERING is the same as the above section, but, during clustering preference creation, specify the TEXTK_CLUSTERING object instead:

exec ctx_ddl.create_preference('ycluster','textk_clustering');

The textk clustering has the same attributes as kmean clustering, but has two additional attributes::

Attribute Datatype Default Min Max Description
MIN_SIMILARITY	
integer 1 1 20 the minimum similarity score for each leaf cluster
HIERARCHY_DEPTH	
integer 1 1 20 the maximum depth of the cluster hierarchy

 

 



Miscellaneous Other Features

Policy-Based CTX_DOC Services

The policy-based services allow text processing of ad hoc data, without a document table or text index.

The doc services (ctx_doc.filter, ctx_doc.tokens, ctx_doc.themes, etc.) take in an index name and a document identified (primary key or rowid).  That limits their operation to indexed data, by which we mean that there has to be a record in a table, and that table has to have a text index on it.  This restriction is too limiting for some customers, who just want to do text processing on ad hoc data.  9.0.1 introduced ctx_doc.ifilter command, which filters ad hoc binary data to text.  While this served the immediate need, it was limited to just filtering, and only INSO filtering.  10g introduces policy-based document services, which is a generalization of the ifilter model to all document services.

The policy services are similar to the in-memory version of the regular services.  Other than the name of the function, the only difference is that the second argument is the data to be processed, rather than a document identifier.  Let's compare two filter calls which do the exact same thing:

declare
  output clob;
begin
  ctx_doc.filter('myindex','1',output);
end;

declare
  output clob;
  input blob;
begin
  select data into input from mytable where id = 1;
  ctx_doc.policy_filter('myindex',input,output);
end;

In both cases we are filtering "data" from row 1, using whatever filter is defined for index "myindex", and placing the result in local CLOB variable "output".  The only difference is that in the first block of code we pass in the document id, and the document content is read from the table, while in the second block we pass in the document content directly.  This particular example is not too interesting, but the point is that you could get the data from anywhere -- it could come from another table, be passed in from a webpage or uploaded document, or be generated by some other procedure.  You no longer need to actually store data in a table in order to do text processing.

Some miscellaneous notes on these:

  • Policy variants are available for all doc services: themes, tokens, gist, filter, highlight and markup.
  • As the example shows, even though these are called policy-based services, the first argument can be either an index name or a policy name. When using an index, the index's datastore is ignored -- you are passing in the data.

  • The data passed in can be a varchar2, a clob, blob, or bfile.

  • There are no result-table versions of the policy services -- only in-memory results.

  • In addition to the document data itself, you can also pass in values for "language", "format", and "charset", in case your policy or index uses multi-lexer, charset filter, etc.  Just specify it using the appropriate parameter name, as in:

      ctx_doc.policy_filter('myindex',input,output,language=>'French');
    
Since policy_filter replicates ctx_doc.ifilter, ifilter is now deprecated and should not be used in new code.

INPATH and HASPATH Highlighting

INPATH and HASPATH are now supported in highlighting

Nuff said.

CTX_REPORT XML Output

CTX_REPORT calls can produce XML-formatted output.

The reports produced by CTX_REPORT are formatted text by default.  We've now added a new, optional last parameter (report_format) to each call that can take in a format selector.  Currently you can ask for 'TEXT' or 'XML'.  When setting the output format to XML, you get the report formatted in XML style -- suitable for passing through XSLT for presentation, or XMLType.extractNode if you just want a particular piece of data.  Let's check out an example:

select ctx_report.index_size('myidx',null,'XML') from dual;

<CTXREPORT>
  <INDEX_SIZE>
    <SIZE_INDEX_NAME>
      USERA.MYIDX
    </SIZE_INDEX_NAME>
    <SIZE_OBJECT>
      <SIZE_OBJECT_NAME>
        USERA.DR$MYIDX$I
      </SIZE_OBJECT_NAME>
      <SIZE_OBJECT_TYPE>
        TABLE
      </SIZE_OBJECT_TYPE>
...

I've added indentation -- you don't get any in the real output.  It's all the same to a machine, which is probably the consumer of these things.  Humans can continue to use the formatted text versions.

Unfortunately we don't have formal DTD's or schemas for the output -- the best thing to do if you intend to use the XML output is run it, check out the output, then extrapolate your XPaths (or whatever it is you are doing) based on that.

 

CTX_OUTPUT Events Parallel Propagation

Active CTX_OUTPUT events are now propagated to parallel slaves.

9iR2 introduced logfile events -- you set events and the ctx_output logfile gets more information, like the token currently being optimized, the rowid of the row currently being indexed, etc.  This is mostly useful when debugging problems.  Unfortunately, these events were not propagated during parallel operations -- parallel sync, parallel optimize, etc.  So you would only get basic information in the logfile of each parallel slave.
In 10g the active events are propagated to the parallel slaves, so if you have the print rowid event on when you execute a parallel sync, you'll see the rowids printed in each parallel slave.

CTX_OUTPUT.EVENT_INDEX_PRINT_TOKEN

The CTX_OUTPUT.EVENT_INDEX_PRINT_TOKEN event prints out tokens during the dump phase of indexing

The new EVENT_INDEX_PRINT_TOKEN ctx_output event will cause create index and sync to print out each token as it is being written to $I.  For instance:

exec ctx_output.start_log('idx');
exec ctx_output.add_event(ctx_output.event_index_print_token);
exec ctx_ddl.sync_index('myidx');

Then in the log you will see something like:

17:02:59 01/15/04 Writing index data ($I) to database.
17:02:59 01/15/04 WRITING TOKEN AACHEN, 0
17:02:59 01/15/04 WRITING TOKEN AARDVARK, 0
17:02:59 01/15/04 WRITING TOKEN ABACUS, 0

This is very useful for targeted optimization.  With careful logging and a bit of logfile parsing code, you can determine which tokens are sub-optimal because of the sync, and do token-mode optimize on just those tokens.  Or, you could monitor vocabulary in new documents and detect new, previously-unknown tokens -- if you have mdata or field sections, you could use this technique to send alerts when new authors or stock symbols pop up, for example.

Tracing

The tracing facility allows low-level high-resolution timing of events for advanced diagnosis.

The new 10g tracing facility profiles internal Oracle Text operations.  We believe that this is mostly going to be used under support direction when diagnosing performance problems.  The idea is that you add a trace in your session, which activates timing or counting code that we've added around some of our internal operations.  After the overall operation (create index, select) is complete, you can look at these results to see, for instance, where time is being spent during a slow query.
You add a trace using:

exec ctx_output.add_trace(CTX_OUTPUT.TRACE_QRY_I_ROWS);
exec ctx_output.add_trace(CTX_OUTPUT.TRACE_QRY_R_TIME);

Here we are telling the engine to count the number of $I rows that are read, and time the reads from $R.  Note that logging does not have to be on to add traces.  Now we do a query:

select * from mytable where contains(text, 'apple')>0;

During query processing, the traces are counting and timing.  After the query, we can see the values of the traces by looking at the view ctx_trace_values:

select * from ctx_trace_values;

TRC_ID TRC_VALUE
------ ---------
     8         1
    10        63
This tells us that 1 $I row was read (TRACE_QRY_I_ROWS = 8) and that docid to rowid resolution took 63 microseconds.  In addition to getting the trace values from the view, we can get them programatically:
exec :n := ctx_output.get_trace_value(CTX_OUTPUT.TRACE_QRY_R_TIME);

or we can dump the values to the logfile:

exec ctx_output.log_traces;

When we run the next query, the traces will simply accumulate.  For instance, if we ran the same query again, and queried ctx_trace_values, we would see 2 for trace 8 and something like 128 for trace 10.  This way, you can get an aggregate time over a set of operations.  Alternatively, you can zero out the traces as desired:

exec ctx_output.reset_trace(8);

Here are the traces available in this version:

Trace ID Meaning
TRACE_IDX_USER_DATASTORE 1 time to execute user datastore procedure
TRACE_IDX_INSO_FILTER 2 time to execute inso filter
TRACE_QRY_XX_TIME 3 time to execute select on $X
TRACE_QRY_XF_TIME 4 time to fetch from $X
TRACE_QRY_X_ROWS 5 number of rows fetched from $X
TRACE_QRY_IF_TIME 6 time to fetch from $I
TRACE_QRY_IR_TIME 7 time to read oversized $I lobs
TRACE_QRY_I_ROWS 8 number of $I rows read
TRACE_QRY_I_SIZE 9 total size of $I rows read
TRACE_QRY_R_TIME 10 time to read $R table
TRACE_QRY_CON_TIME 11 overall contains time
TRACE_QRY_K_TIME 12 time to read from $K (function invocation)


One caveat: like logging, tracing is not currently supported on MTS.

Query Log Analysis

The query log analysis package allow you to see what your users are looking for.

Many search sites track the queries that users enter -- you can see what people commonly search for, tune your optimization, etc.  10g introduces a product-level feature for tracking queries and analyzing the logfiles.  It all starts with logging the queries.  In sessions that run text queries, start a query log:

exec ctx_output.start_query_log('qlog');
This works like the existing ctx_output.start_log, so we're creating a logfile named "qlog" in the log directory.  The difference is that only queries will be logged, in  a special output format.  Query logging and regular logging are exclusive -- you can either do normal logging or query logging, but not both at the same time.  Like normal logging, query logging is not currently supported on MTS.  Once the log is started, subsequent select .. where contains statements (only contains -- not count_hits) will add entries to the logfile.  If you take a look, they look like little XML fragments:
<QuerySet><TimeStamp>14:27:25 01/15/04 </TimeStamp><IndexName>MYINDEX</IndexName>
<Query>apple</Query><ReturnHit>Yes</ReturnHit><QueryTag></QueryTag></QuerySet>

So let's say you set this up, and now your logfile has several hundred queries in it.  Now you want to analyze the logfile.  ctx_report.query_log_summary reads a logfile, compiles the top n queries, and returns the results in a PL/SQL table -- each item in the table tells you the query string (query) and the number of times it was run (times).  By default, it tells you the 50 most frequently-run queries that hit documents:

declare
  otab ctx_report.query_table;
begin
  ctx_report.query_log_summary('qlog','myindex',otab);
  for i in 1..qlog.count loop
    dbms_output.put_line(otab(i).query || ' run '||otab(i).times||' times');
  end loop;
end;

This must be run as CTXSYS, or you'll get an ORA-942 error.  This might produce output like:

apple run 10 times
aardvark run 5 times
...

You can change the size of the output buffer using argument row_num.  For instance, to get the top 100 results:

ctx_report.query_log_summary('qlog','myindex',otab,row_num=>100);

There are also two boolean parameters which affect the output.  most_freq selects which queries you get in the report.  If TRUE, the default, you get the most frequently executed <row_num> queries.  If FALSE, you get the least frequently executed <row_num> queries instead.  has_hit selects whether you get queries that had hits, or queries that had no hits.  Using has_hit => FALSE lets you see which queries had 0 results.  If your users are frequently querying for "raincoats" and finding nothing, either your store doesn't carry raincoats (and maybe you should start) or you need to massage the query to get more recall -- stemming or thesaurus, perhaps.

COPY_POLICY

CTX_DDL.COPY_POLICY creates a clone of an existing policy.

The new copy policy call clones a policy -- a new policy is created with the same attribute settings of an existing policy.  Syntax:

ctx_ddl.copy_policy('existing_policy','new_policy');

This was implemented for internal customers, but it could be useful when prototyping -- say you want to determine the effect of an index metadata change.  Create a copy of the existing index (you can enter in an index or policy name as "existing_policy"), make your change to the clone, and test, perhaps using the new policy-based document services.

The invoking user must be the owner of both policies, or have CREATE ANY INDEX database privileges.

Stretch Viewer Visualization

The Stretch Viewer applet creates a visual representation of a hierarchy.

The StretchViewer visualization applet takes a tree structure described in a particular format and displays it as an applet. The StretchViewer jar file is located in $ORACLE_HOME/ctx/sample/stretch. You need to create a directory "classes_g" in your Web server document directory (for example "htdocs" in Apache) and then copy the jar file to it.

Data format

  1. Every node data has to have a absolute path name. In the example, Categories is the root node. The other nodes are followed by '/'
  2. textcolor RGB representation such as #336699 can be used instead of blue, red, magenta, orange. 
  3. url The associate URL to the leaf node.

Let's define the test.txt example that defines the tree structure:

Categories textcolor=red
Categories/External websites textcolor=blue
Categories/External websites/Oracle   url=http://www.oracle.com textcolor=black
Categories/External websites/OTN url=http://otn.oracle.com textcolor=black
Categories/Internal websites textcolor=blue
Categories/Internal websites/Benri url=http://benri.us.oracle.com textcolor=black
Categories/Internal websites/Aria url=http://aria.us.oracle.com textcolor=black

You invoke the applet using the following HTML snippet. Make sure you have  the correct path file and machine names correctly:

<object classid="clsid:8AD9C840-044E-11D1-B3E9-00805F499D93"
        width="100%" height="500" align="baseline"
        codebase="http://bit.ly/2gcgDe">
 <param name="code" value="oracle.apps.iam.am.stretchviewer.StretchViewer.class"/>
 <param name="ImageDirectory" value="images/"/>
 <param name="codebase" value="http://yourmachinename/classes_g"/>
 <param name="DATA" value="http://yourmachinename/test.txt"/>
 <param name="DATALENGTH" value="66308"/>
 <param name="MODE" value="horizontal"/>
 <param name="BACKGROUND" value="white"/>
 <param name="STICKCOLOR" value="lightGray"/>
 <param name="AUTOEXPAND" value="true"/>
 <param name="CONTRACTABLE" value="true"/>
 <param name="TARGET" value="_top"/>
 </object>

The object classid and codebase depends on your current Java Plug in. We recommend using Sun's plug in that works in mostly all browsers including IE.

The parameter codebase specifies the location of the StretchViewer jar file in your web server. The rest of the parameters allow you to change the look and feel of the applet.




 


In-Memory Replay Banner