Oracle interMedia Text 8.1.5 - Technical Overview

Oracle8i interMedia Text 8.1.5 - Technical Overview

Garrett Kaminaga, Product Development, Database and Application Server Technologies, Oracle HQ

interMedia integrates all of the features and functions of the former Oracle ConText product with image, audio and video, and geographic location services for Web Content Management applications built with the Oracle8i Internet Platform.

interMedia Text has been completely re-architected, enhanced and tightly integrated with Oracle8i to provide up to an order of magnitude better search performance, greater ease of use and new capabilites like thematic searches. Please refer to the documentation for detailed information.

Table of Contents

  1. Introduction to interMedia text
  2. Installation and Setup
  3. Indexing
  4. The Preference System
  5. Datastores
  6. Filters
  7. Section Groups
  8. Lexers
  9. Other Indexing Classes
  10. DML
  11. Optimization
  12. Querying
  13. Query Operators (Part 1)
  14. Query Operators (Part 2)
  15. Thesaurus
  16. Query Tuning
  17. Query Feedback
  18. Highlighting
  19. Linguistic Extraction

Introduction to interMedia Text

What does it do?

interMedia Text extends Oracle8i by indexing any text or documents stored in Oracle8i, in operating system flat files or URLs. It also enables content-based queries, (e.g., find text and documents which contain this word) using familiar, standard SQL. It enables file-based text applications to use Oracle8i to manage text and documents in an integrated fashion with traditional relational information.

The easiest way to grasp what it does is to see it in action...

How do you use it?

Here is a quick example. We'll start by creating a table to hold our documents. For now, we'll use a small varchar2 column to hold the document text. We'll also need a primary key to identify each document.

    create table docs (id number primary key, text varchar2(80));
Nothing special about that. Let's populate it with two example documents:
    insert into docs values (1, 'first document');
    insert into docs values (2, 'second document');
    commit;
Now comes the fun part -- we build a Text index:
    create index doc_index on docs(text)
    indextype is ctxsys.context;

This is, for the most part, familiar DDL -- the difference is the indextype clause, which instructs Oracle to build a Text index instead of a regular, b-tree index.

Once the index has been created, we can run content-based queries on our huge database of two documents using the contains function:

    select id from docs
     where contains(text, 'first') > 0;

This will find all rows in docs where the text column contains the word "first", (which is document 1, for those not following closely). The > 0 part is necessary to make it legal Oracle SQL, which does not support boolean return values for functions (at this time).

Integration Architecture

interMedia Text is more tightly integrated with Oracle8i. For those of you familiar with previous versions of ConText, notice that-- in the previous example, no ctxsrv was required. The result is a specialized "domain index" which works like any b-tree index. It's all transparent to the user -- Oracle8i is simply extended to support text. Unlike previous versions of Oracle ConText, there are no servers to start up, there is no query rewrite, and index creation is done through familiar SQL rather than through a custom PL/SQL interface.

Installation and Setup

Migrating from Previous Versions

Due to significant architectural changes, interMedia Text is not backward-compatible with previous versions of Oracle ConText. Existing installations and applications need to be migrated to work with Oracle8i. There is a manual which details the steps of this process. Please remember that some of those steps need to be completed on the existing system before you install Oracle8i.

Installation

By default, the Oracle interMedia installation procedure installs interMedia Text. interMedia Text can also be installed separately, if you choose a custom install.

Important Files

?/bin/ctxsrv This may be renamed ctxsrv8 or some such. Oracle8i still supports the ctxsrv server. However, the only valid personality mask is M. You need to run it only when you want background DML. Batch DML (sync) does not require a running server. We'll go into DML in more depth later.
?/ctx/admin Data dictionary scripts. See below for details.
?/ctx/admin/defaults Language-specific default preferences. See below, and an upcoming issue on the preference system for more details.
?/ctx/data This directory has data files needed by interMedia Text, which includes the linguistic lexicon files, the Xerox stemming files, and the Korean lexer dictionary. If you get strange internal errors in linguistics, stemming, etc. check this directory.
?/ctx/bin This directory is for user-defined filters, which we'll talk about later. This directory should also contain the ctxhx program which is used by the INSO filter.
?/ctx/lib This directory has platform-specific shared libraries and template files used by the INSO filter. It also contains the safe callout used by indexing and document services -- libctxx8.so (name will change from platform to platform -- look for "ctxx")

Data Dictionary Installation

interMedia Text is integrated with the Oracle Database Creation Assistant (DBCA) so the ctxsys data dictionary should be installed with this tool. If a manual installation is required please follow these steps.

  1. Before starting, make sure that:
    • The interMedia Text software tree is installed
    • The database does not have a ctxsys user
    • Current directory is ?/ctx/admin
    • You can sqlplus internal

  2. Create the ctxsys user. Pass it the ctxsys password, default tablespace, and temporary tablespace as arguments.
             sqlplus internal @dr0csys <password> <def_tblspc> <tmp_tblspc>
    
  3. Install the data dictionary:
             sqlplus ctxsys/<password> @dr0inst <ctxx_library>
    
    The argument is the full path to the ctxx library, for example:
             sqlplus ctxsys/<password> @dr0inst /some_directory/app/oracle/product/8.1.5/ctx/lib/libctxx8.so
    
    On Unix, you must not use the environment symbol $ORACLE_HOME. You have to use the actual physical path.
     
  4. Install appropriate language-specific default preferences. There are forty-odd scripts in ?/ctx/admin/defaults which create language- specific default preferences. They are named in the form drdefXX.sql, where XX is the language code (from the Server Reference Manual).

    To install the US defaults, for instance:

             sqlplus ctxsys/<password> @defaults/drdefus.sql
    
    We'll talk more about this in the upcoming preference system issue.

    After these steps, interMedia Text should be installed and working.

Post-Installation Setup

If this database was an existing ConText site, make sure to remove text_enable from the init.ora. It is no longer used in Oracle8i, and will actually prevent Oracle8i from operating properly -- you'll get errors like "cannot find package DR_REWRITE".

Finally, ensure that the Net8 listener is running and is configured to invoke external procedures. A brief description of the process is below, with complete details are in the Oracle8i Server Administrator's Guide.

  1. Add an entry to the tnsnames.ora:
           extproc_connection_data = 
             (DESCRIPTION = 
                (ADDRESS = (PROTOCOL = ipc)
                           (KEY = DBSID))
                (CONNECT_DATA = (SID = ep_agt1)))
    
    DBSID is the database SID. ep_agt1 can be named anything. extproc_connection_data should not be changed.

  2. Add the following to the listener SID_LIST:
           (SID_DESC = (SID_NAME = ep_agt1)
                       (ORACLE_HOME = /oracle)
                       (ENVS = LD_LIBRARY_PATH=/oracle/ctx/lib)
                       (PROGRAM = extproc))
    
    ep_agt1 matches the CONNECT_DATA SID for extproc_connection_data in the tnsnames.ora.
    The PROGRAM section tells the Net8 listener to start the external procedure process.
    The ENVS section, which is shown here for UNIX, will ensure that the environment includes ?/ctx/lib in LD_LIBRARY_PATH. This is needed so that indexing can use the INSO filters.
    On NT, you may need to have ORACLE_HOME set in this section as well.

  3. Since the extproc_connection_data ADDRESS section specifies ipc, make sure that the ADDRESS_LIST of listener.ora accepts ipc connections.

    A quick way to test the Net8 configuration is to do:

        exec ctx_output.start_log('log')
    
    from SQL*Plus. If you get the error:
        DRG-50704: Net8 listener is not running or cannot start external procedures
    
    then things aren't set up correctly. Some of the things to check:
    • listener is not running
    • listener.ora is not configured for extproc
    • need to reload the listener
    • tnsnames.ora is not configured for extproc

Indexing

Creating Indexes

The first step in using interMedia Text is to create a Text index. Without a b-tree index, value queries are slower; without a Text index, contains queries are simply not possible. As we've seen, the index is created using the create index command:

  create index INDEXNAME on TABLE(COLUMN)
  indextype is ctxsys.context

Unlike previous versions of ConText, there is no separate policy creation step. We'll talk more about this when we discuss the preference system. View indexing is not allowed in Oracle8i, consistent with regular b-tree indexes. Parallel index creation is also not supported in this first Oracle8i version.

Composite indexes are not supported -- only one column is allowed in the column list. This column must be one of the following types: CHAR, VARCHAR, VARCHAR2, LONG, LONG RAW, BLOB, CLOB, BFILE. Date, number, and nested table columns cannot be indexed. Object columns also cannot be indexed, but their attributes can be, provided they are atomic datatypes.

The table must also have a primary key constraint. This is needed mainly for identifying the documents for document services, and may be used in the future for other purposes. Composite primary keys are supported, up to 16 columns.

The issuing user does not need the ctxapp role to create an index. If the user has Oracle grants to create a b-tree index on the column, then they have sufficient permission to create a Text index.

Unlike previous versions of ConText, the issuing owner, table owner, and index owner can all be different users, just like regular b-tree indexes.

IMPORTANT: If a syntax error occurs in the create index statement, the index is still created. This is different from regular b-tree indexes; before you reissue the corrected statement, you must drop the failed index first.

If an error occurs during actual indexing (e.g. you run out of tablespace) then you can pick up where you left off (after correcting the problem, of course) using alter index:

  alter index INDEXNAME rebuild parameters ('resume')

Storage clauses to the create index statement are not used. Index storage parameters are provided using the STORAGE object, which we will discuss later. Partitioning clauses are also not supported at this time, although you can manually partition the index tables if you wish, again using the STORAGE object.

Once the index is created, any export will include the index definition. At import time, imp will re-create the index by issuing the create index statement.

The Indexing Pipeline

Oracle8i detects that this is a Text index and performs the text indexing. The index is created using a pipeline of steps - "The Indexing Pipeline".

The diagram below shows the indexing pipeline:


 +-----------+           +----------+ filtered   +-----------+ 
 |           | doc data  |          | doc text   |           |
 | datastore |---------->|  filter  |----------->| sectioner |---------+
 |           |           |          |            |           |--+      |
 +-----------+           +----------+            +-----------+  |      | 
      ^                                                         |      |
      |                               section offsets           |      |
      |column data            +---------------------------------+      |
      |                       |                                        |
      |                       V                                        |
 +==========+            +----------+            +-----------+         |
 |          |            |          |            |           |         |
 | database |<-----------|  engine  |<-----------|   lexer   |<--------+
 |          |  index data|          |   tokens   |           |   plain
 +==========+            +----------+            +-----------+   text

Let's step through this pipeline, stage-by-stage. Each stage will be covered in depth later, so they'll be described here briefly.

Start with the datastore. This stage loops over the rows of the table and reads the data out of the column. Unlike previous versions, the rows are not read in any particular order. The datastore passes out document data. Usually, this will simply be the column data, but some datastores use the column data as a pointer to the document data. The URL_DATASTORE, for instance, uses the column data as a URL, does a GET, and passes out the returned data.

The filter takes the document data and transforms it to some kind of text representation. This is needed when storing binary documents such as Word or Acrobat files. Unlike previous versions, the output of the filter does not have to be plain text -- it can be a text format such as XML or HTML.

The sectioner, new for Oracle8i, takes the output from the filter, and converts it to plain text. There are different sectioners for different text formats, including XML and HTML. Conversion to plain text includes detecting important section tags, removing "invisible" information, and reformatting the text.

The lexer takes the plain text from the sectioner and splits it into discrete tokens. We have lexers for whitespace-delimited languages, and specialized lexers for Asian languages, where segmentation is quite a bit more complex. The basic lexer also includes theme functionality to build unified text/theme indexes.

Finally, Oacle8i takes all the tokens from the lexer, the section offsets from the sectioner, and a list of low-information words called the stoplist, and builds an inverted index. An inverted index stores tokens, and the documents in which these tokens occur. For instance, our example from issue #1:

  insert into docs values (1, 'first document');
  insert into docs values (2, 'second document');
would produce an inverted index like this:
  DOCUMENT  ---> doc 1 position 2, doc 2 position 2
  FIRST     ---> doc 1 position 1
  SECOND    ---> doc 2 position 1
Each document is assigned an identifier called a docid, which is used in the inverted index. The document primary key isn't stored.

Logging

The time needed to create an index is a function of the amount of text being indexed. Oracle8i can create a log file which can be monitored during indexing. start_log starts the logging, and should be called before issuing the create index, in the same session:

  ctx_output.start_log('mylog');
This will create a mylog file in ?/ctx/log which can be used to monitor indexing progress. The directory can be changed by ctxsys, using the system parameter LOG_DIRECTORY. See documentation for details.

IMPORTANT: In this first release of interMedia ?/ctx/log is not created at install time. You may get errors like "unable to open file". Simply have the DBA create this directory.

Logging is halted when the session exits or when end_log is called:

  ctx_output.end_log;

Errors During Indexing

Processing errors encountered during indexing -- filter errors, file cannot be read, etc. -- do not bring indexing to a halt. Instead, the error is logged and indexing continues to the next file.

You can see these logged errors in the ctx_user_index_errors view. It will tell you the rowid of the failed document, and the error which was encountered. You need to manually empty this table when you are finished reviewing the errors:

    delete from ctx_user_index_errors;

The Preference System

Classes, Objects, Preferences

The preference system allows you to customize Text indexing. Each stage of the indexing pipeline is really an interface into which customized objects can be plugged in. These stages are called "classes", and they can be found in the ctx_classes view:
DATASTORE Reads column data and returns document data
FILTER Reads document data and returns formatted document text
SECTION_GROUP Reads formatted document text and returns section offsets and plain document text
LEXER Reads plain document text and breaks it into tokens
WORDLIST Contains query expansion defaults
STOPLIST Contains a list of low-information words
STORAGE Index table storage parameters

Each class has one or more "objects" which are like behavioral templates. The DATASTORE class, for instance, has the following objects:

DIRECT_DATASTORE Passes column data directly as document data
DETAIL_DATASTORE Concatenates multiple rows of a detail table to construct document data
FILE_DATASTORE Interprets column data as a filename. Reads file and passes file contents as document data
URL_DATASTORE Interprets column data as a URL. Performs a GET and passes return as document data
USER_DATASTORE Invokes a stored procedure to synthesize document data

Objects can be found in the ctx_objects view.

But objects are only templates, and cannot be used directly in an index. Instead, you create a "preference" from one of these object templates, customize it by setting "attributes", then use the preferences to create the index. Let's take a closer look:

Creating Preferences

You create a preference using ctx_ddl.create_preference, specifying an object to use as template:

    ctx_ddl.create_preference('mypref', 'FILE_DATASTORE');
This creates the preference mypref, using the FILE_DATASTORE object. If you are familiar with previous versions, the object names have been changed, and there is no comment argument.

This is done in PL/SQL, so surround it with "begin" and "end", or use "exec" if you are in SQL*Plus. Also, in order to create preferences, the issuing owner must have the ctxapp role.

Once the preference is created, we can customize it using set_attribute: For our example, let's say all our files were in /docs. We can simply set the PATH attribute to communicate this to interMedia:

    ctx_ddl.set_attribute('mypref', 'PATH', '/docs');

If you are familiar with previous versions, set_attribute now comes *after* create_preference, which allows us to do attribute-level validation, and necessitates the first argument, which is the preference name.

This attribute is a text attribute, so it is set to a text value. There are also number attributes and boolean attributes, which should be set to 'TRUE' or 'FALSE' (or 'YES'/'NO'). The attributes for each object and their types can be found in view ctx_object_attributes or in the documentation.

Some attributes have a list of allowed values. The allowed values can be found in the view ctx_object_attribute_lov. Unlike previous versions, LOV attributes can be set using a descriptive "tag" rather than a numeric value.

Non-preference Classes

Preferences are used for all classes except the SECTION_GROUP and STOPLIST classes. These classes use specialized objects and have their own API. We'll discuss this in the respective upcoming issues.

Using Preferences

Once you've created your preferences, you can build all kinds of customized indexes by mixing and matching them. Because all objects of a class adhere to a common interface, any filter can be hooked up to any datastore, etc. Got a bunch of Korean Word 97 files? No problem -- hook up a FILE_DATASTORE preference, an INSO_FILTER preference, and a KOREAN_LEXER preference.

You plug in your preferences using the parameters clause of create index:

    create index doc_index on docs(text)
      indextype is ctxsys.context
      parameters ('datastore mypref')
This tells create index to use mypref as the datastore. Each of the classes has a parameters keyword to name a preference. The keyword is the same as the class name, except for SECTION_GROUP, whose keyword is the phrase "section group". You can specify multiple classes by simply tacking the keyword-preference pairs on the end of the string:
    parameters('datastore mypref filter myfilter')

Any user can use any preference. To specify a preference in another user's schema, simply add the owner name in the parameters string:

    parameters('datastore kenny.mypref')

Unlike previous versions, the index makes a copy of the preference settings, which means that after use in an index, the preference can be modified or dropped without affecting the index. Note that only the preference owner can modify or drop it.

The Default System

In the example above, only the datastore class is set. Since preferences were not specified for the other classes in the parameters string, they get their values from the default system.

The default system uses system parameters. System parameters are name- value pairs which apply to the interMedia Text installation as a whole -- kind of our version of init.ora. They can be seen in the view ctx_parameters.

The values for the parameters used by the default system are preference names. For instance, in our example we have not specified a lexer preference. interMedia gets the value of the system parameter DEFAULT_LEXER, and uses the preference named there. This allows the administrator to set the default behavior for the system.

The set of parameters used by the default system are:

DEFAULT_DATASTORE The default datastore preference. At install time (the default default) it is set to CTXSYS.DEFAULT_DATASTORE, a preference which uses the DIRECT_DATASTORE object.
DEFAULT_FILTER_BINARY This is the filter preference to use when indexing binary datatypes such as RAW, LONG RAW, or BLOB. At install time it is set to CTXSYS.INSO_FILTER, which uses the INSO_FILTER object.
DEFAULT_FILTER_FILE This is the filter preference to use when indexing files, either BFILE or the FILE_DATASTORE. At install time it is set to CTXSYS.INSO_FILTER.
DEFAULT_FILTER_TEXT This is the filter preference to use when indexing text datatypes such as CHAR, VARCHAR2, or CLOB. At install time it is set to CTXSYS.NULL_FILTER, which does no filtering.
DEFAULT_SECTION_HTML This is the section group to use when the filter class uses the INSO filter, or the datastore uses the URL_DATASTORE. The INSO filter filters binary files to HTML (see filters). At install time it is set to CTXSYS.HTML_SECTION_GROUP, which merely converts the HTML to plain text.
DEFAULT_SECTION_TEXT This is the section group to use in all other cases. At install time it is set to CTXSYS.NULL_SECTION_GROUP, which does no sectioning.
DEFAULT_LEXER This is the default lexer preference. At install time it is set to CTXSYS.DEFAULT_LEXER. The definition of that preference is language-specific. See below for more details.
DEFAULT_WORDLIST This is the default wordlist preference. At install time it is set to CTXSYS.DEFAULT_WORDLIST. The definition of that preference is language-specific. See below for more details.
DEFAULT_STOPLIST This is the default stoplist. At install time it is set to CTXSYS.DEFAULT_STOPLIST. The contents of that stoplist are language-specific. See below for more details.
DEFAULT_STORAGE This is the default storage preference. At install time it is set to CTXSYS.DEFAULT_STORAGE, which has no settings.

Language-Specific Defaults

The LEXER, WORDLIST, and STOPLIST classes are the most affected by the language of the documents. Every installation gets DEFAULT_LEXER, DEFAULT_WORDLIST, and DEFAULT_STOPLIST preferences, but the definition of these preferences will depend on the language chosen at install time.

?/ctx/admin/defaults holds language-specific scripts for each supported language. Based on the language chosen, DBCA runs the matching language-specific script, which creates the default preferences most appropriate for that language.

The result is that a German instance gets a DEFAULT_LEXER which is case- sensitive and does German decompounding, a DEFAULT_STOPLIST which has common German words, and a DEFAULT_WORDLIST with German stemming. A Japanese instance gets different defaults, more tailored to Japanese.

Browsing the Data Dictionary

Here is a list of views useful for browsing the interMedia Text data dictionary: ctx_classes Available classes ctx_objects Available objects for each class ctx_object_attributes Attribute names and types for each object ctx_object_attribute_lov Allowed values for each LOV object attribute ctx_preferences Preferences ctx_preference_values Attribute values of each preference ctx_user_index_objects Objects used by the user's indexes ctx_user_index_values Attribute values for the user's indexes

Datastores

The DATASTORE class

The datastore class is responsible for reading the column data from the database, getting the document data based on the column data, and passing that up to the filter.

It is important to keep in mind that some form of the document data must be in a table. Datastore objects which access external resources such as files and web pages still require a table of pointers to those resources.

Datastore preferences are plugged into indexes using the datastore keyword:

    create index
    ...
    parameters ('datastore mydstore');
The datastore class has five objects.

DIRECT_DATASTORE

The direct datastore is the simplest case -- it assumes that document data is stored in the indexed column. It is so simple it has no attributes to customize. There should be no reason to create a preference based on the DIRECT_DATASTORE object -- CTXSYS.DEFAULT_DATASTORE can be used for any index.

IMPORTANT: If you are indexing BFILEs, make sure the user ctxsys has READ permission on the BFILE directory.

FILE_DATASTORE

The file datastore reads the column data as a filename. It opens the file and returns the contents as the document data. The indexed column cannot be LONG or LOB types. The files must be accessible from the database host machine.

This object has one attribute: PATH. PATH is optional, and specifies the directory where the documents are stored.

IMPORTANT: If PATH is specified, the column data must be simple file names; if not specified, the column data must be full path names. You cannot mix and match -- for instance, with PATH of /A, and a column of B/C.TXT it will NOT find /A/B/C.TXT.

On some platforms, PATH can specify multiple directories. If you do this, make sure that your documents have unique names -- if a document C.TXT is in directories /A and /B, and /A is in the PATH before /B, only /A/C.TXT will be indexed.

There is not much of a difference between this object and BFILEs. It's a matter of choice. Like BFILE's, when the file changes, the row will not be automatically reindexed. You should update the column value to itself to trigger a reindex.

URL_DATASTORE

The URL datastore reads the column data as a URL. It performs a GET and returns the contents as the document data. The URLs for the documents must in the table -- the URL datastore does not crawl. The indexed column cannot be LONG or LOB types.

http, ftp, and file protocols are supported. The URL datastore also has multi-threaded read (on most platforms) to maximize throughput. It is very customizable, and has several attributes:
TIMEOUT Read request timeout in seconds
MAXTHREADS Maximum number of threads to use for multithreaded read
URLSIZE Maximum length of an URL
MAXURLS Maximum number of URL's to read at one time
MAXDOCSIZE Maximum length of document
HTTP_PROXY HTTP proxy
FTP_PROXY FTP proxy
NO_PROXY Domain for proxy exclusion

Like the file datastore, if the web page changes, then you should manually update the column value to itself to trigger the reindex.

DETAIL_DATASTORE

Many legacy applications store text line-by-line, in multiple rows of a detail table. The detail datastore constructs documents by concatenating the text of these lines into a single, coherent document.

The detail datastore must be told the specifics of the detail table:
DETAIL_TABLE Name of the detail table
DETAIL_KEY The foreign key column(s) in the detail table
DETAIL_LINENO The sequence column in the detail table
DETAIL_TEXT The text column in the detail table

For instance, let's say the master and detail tables look like this:

    the_master                   my_detail

    ID     TITLE                 M_ID   SEQ  LINETEXT
    1      Grate Expectations    1      1    It was the best of times
                                 1      2    it was the worst of times.
"That's G-R-A-T-E Expectations, also by Edmund Wells."

The attributes of the preference should be set like this:
DETAIL_TABLE my_detail
DETAIL_KEY m_id
DETAIL_LINENO seq
DETAIL_TEXT linetext

There is one final attribute: BINARY. By default, the detail datastore treats each row as a separate line, and, when concatenating the contents, will automatically stick a newline at the end of the text for each row. BINARY set to TRUE suppresses this. In our example, we should set BINARY to FALSE or simply leave it unset.

The index itself should to be built on the master table:

    create index myidx on master(somecolumn)
      indextype is ctxsys.context
      parameters ('datastore my_detail')
The master table must have a primary key column, just like any other table indexed by interMedia Text. This primary key column is used to find the corresponding detail rows, where detail.fk = master.pk.

The indexed column can be any column of allowable type -- the contents are not used by interMedia. If you can afford the space, you can add a dummy char(1) column called "text" or "body", to make the queries more readable:

    select author, title from master
     where contains(text, 'best & worst') > 0; 

If detail rows are added, removed, or changed without changing the master text column, you should manually update the text column to itself to trigger a reindex.

For those of you familiar with previous versions of ConText, this is similar to the master detail new datastore. The old master detail datastore is no longer supported

USER_DATASTORE

The user datastore is new for Oracle8i. You write a stored procedure which, given a rowid, synthesizes a document. interMedia calls this procedure whenever document data is required. Let's look at an example. Say we have a table like this:

  articles
    id       number
    author   varchar2(80)
    title    varchar2(120)
    text     clob
and we want to automatically have author and title be part of the indexed document text. We can write a stored procedure following the user datastore interface:
    create procedure myproc(rid in rowid, tlob in out NOCOPY clob) is
      offset number := 1;
    begin
      for c1 in (select author, title, text from articles
                  where rowid = rid)
      loop
        append_varchar_to_lob(tlob, c1.title, offset);
        append_varchar_to_lob(tlob, 'by '||c1.author, offset);
        dbms_lob.append(tlob, c1.text);        
      end loop;
    end;
This procedure takes in a rowid and a temporary clob locator, and concatenates all the articles columns into the temp clob, This assumes a helper procedure append_varchar_to_lob has been written.

Note the use of the NOCOPY in the procedure signature. This is very important to achieve best performance and to reduce TEMP space usage.

To ensure that the DBA has signed off on the code, only ctxsys-owned stored procedures are allowed for the user datastore. Thus, we need to do something like this as ctxsys:

    create procedure s_myproc(rid in rowid, tlob in out NOCOPY clob) is
    begin
      appowner.myproc(rid, tlob);
    end;
And, we need to make sure that the index owner can execute the stub procedure, so:
    grant execute on s_myproc to appowner

Now, back as the app owner, we create the preference, setting the PROCEDURE attribute to the name of the ctxsys stub procedure:

  ctx_ddl.create_preference('myud', 'user_datastore');
  ctx_ddl.set_attribute('myud', 'procedure', 's_myproc');
When we then create the index on articles(text) using this preference, interMedia indexing will see author and title in the document text.

The user datastore can be used for any on-the-fly document synthesis, including more complex master-detail relationships, nested tables, text preprocessing, or multicolumn concatenation, like the example.

There are four constraints on the procedures used in user datastores:

  1. They must be owned by ctxsys
  2. They must be executable by the index owner
  3. They cannot issue DDL or transaction control statements like "commit"
  4. They cannot be safe callouts or call safe callouts
If you change the stored procedure, indexes based upon it will not be notified, so you should manually recreate such indexes. interMedia cannot tell what you are doing in the stored procedure, so if the stored procedure makes use of other columns, and those column values change, the row will not be reindexed. The row is only reindexed when the indexed column changes.

When using the USER_DATASTORE, you may run into OCI_INVALID_HANDLE errors. This is bug 881904, and is fixed in patch release 8.1.5.1.

Filters

The FILTER Class

The filter class takes the document data provided by the datastore class, and filters it to readable text, passing it to the sectioner. With Oracle8i, this does not have to be plain text -- it can be a text format such as HTML.

Filter preferences are plugged into indexes using the filter keyword:

    create index
    ...
    parameters ('filter myfilter');
The filter class has three objects.

NULL_FILTER

The null filter is used when the document contents are not in a binary format. It simply passes text from the datastore to the sectioner. The default CTXSYS.NULL_FILTER preference can be used in any index to employ the null filter.

CHARSET_FILTER

The charset filter is new for Oracle8i. It converts documents from a foreign character set to the database character set. This is most useful for Japanese customers, who have to deal with two widely-used and incompatible character sets (and one infrequently-used incompatible character set).

The charset filter has one attribute, CHARSET, which takes the NLS name of the source character set. A list of NLS charsets can be found in the Oracle National Language Support Guide. The destination character set is always the database character set, so does not need to be specified.

Additionally, you can specify JAAUTO, which is a custom setting for Japanese character set auto-detection. Oracle8i will automatically detect Shift-JIS, JIS7, or EUC for each document and convert it to the database charset if needed.

USER_FILTER

The user filter is a method for plugging in your own filters. You write a filtering program, place it in ?/ctx/bin, and the indexing engine will invoke it to filter each document. Here's an example -- we'll create an uppercase filter which will uppercase every word.

We start by creating a program to do the filtering -- in this case, we'll write a perl script:

  #!/usr/local/bin/perl

  open(IN, $ARGV[0]);
  open(OUT, ">".$ARGV[1]);

  while ()
  {
    tr/a-z/A-Z/;
    print OUT;
  }

  close(IN);
  close(OUT);
This perl script converts a file to uppercase.

User filter programs like this take two arguments. The first argument is the name of the input file. The second argument is the name of the output file. Our filter reads the contents of the input file, filters it, and writes the output to the output file.

the program (called upcase.pl) is placed in ?/ctx/bin. Ensure that it's executable by the oracle operating-system user. Now the preference is created:

    ctx_ddl.create_preference('mypref', 'USER_FILTER');
    ctx_ddl.set_attribute('mypref', 'EXECUTABLE', 'upcase.pl');
When this preference is used in an index, the indexing engine will invoke the user filter program for each document when the text is required.

INSO_FILTER

The INSO filter automatically recognizes and filters over a hundred different formats, including Word 97 and Acrobat. The full list can be found in the documentation.

The INSO filter does not have any attributes at this time, so the CTXSYS.INSO_FILTER preference can be used in any index needing filtering.

IMPORTANT: This filter outputs HTML, not plain text. Make sure you employ the HTML section group or all these tags will be indexed. The default system will employ the HTML section group when the INSO filter is detected.

The INSO filter uses code from Inso, "the world leader in filtering and viewer technology," and Adobe. This code is not integrated code directly, it instead follows the user filter architecture. The user filter executable for the INSO filter is ctxhx.

ctxhx requires shared libraries and other files (.flt and .tpt files) located in ?/ctx/lib. The installer should copy the correct platform- specific files to this directory. Additionally, ensure that the external procedure agent has ?/ctx/lib in the LD_LIBARY_PATH environment variable (PATH on NT). This can be done using the ENVS section in the listener.ora. On NT you may need to have ORACLE_HOME set in the ENVS section, as well.See Installation section above for details.

If you are encountering problems using this filter, it can be run by hand. First get a formatted binary file. Let's call it testfile.doc. You do:

    ctxhx testfile.doc testfile.txt
This should create the file testfile.txt with the HTML representation. The error messages you get, if any, should help you determine what's wrong.

The INSO filter is supported only on Solaris, HP-UX, AIX, and NT. Other platforms are not able to use this filter at this time.

Where's the HTML filter ?

Some of you familiar with previous versions of ConText may be wondering about the HTML filter. In 8i, this code has been moved to the new sectioner class, which allows any filter to spit out HTML if it so desires.

Section Groups

The SECTION_GROUP Class

The section group class is new for Oracle8i, and incorporates functionality that was part of the wordlist and filter classes in ConText. It takes a text format, such as XML or HMTL, as input. It is unique in having two outputs -- the section boundaries, which go to the engine, and the plaintext, which goes to the lexer.

Creating Section Groups

Section groups are not created using create_preference -- they have their own API in ctx_ddl:

    ctx_ddl.create_section_group('mygroup','html_section_group');
The first argument is the name of the new section group. The second argument is the section group type. This specifies the input text format, and tells the sectioner the rules for detecting section boundaries and transforming to plaintext. Each section group type is covered briefly below.

An empty section group can transform formatted test to plaintext, but will not index any section boundaries. You must first tell it which sections to index by adding sections to the group.

Sections have three important attributes: TAG, NAME, and TYPE. TAG tells the section group how to recognize the section. Because the section group already knows the format, you do not need to specify start and end tags; a section in an HTML section group with TAG of B knows that the section starts with <b> and ends with </b>, for instance. Tags are unique across the sections of a section group.

NAME is how you refer to the section in queries. You may want to name the B tag BOLD, for readability. Multiple tags can be mapped to the same name -- they are then treated as instances of the same section. H1, H2, H3 can all be treated as instances of the HEADLINE section. We recommend avoiding non- alphanumeric characters -- such as underscore -- in section names. Using these characters will force you to escape the names in queries.

TYPE is totally new for 8i. There are three different types of sections: ZONE, SPECIAL, and FIELD.

Section Types

ZONE sections are like sections from previous versions. interMedia records where in the document the section start and end tags occur. WITHIN queries check that the hit words occur between the start and end word offset.

If a ZONE section repeats, each instance is treated separately in query semantics. See Examples, below. ZONE sections can enclose other sections, including themselves, and can be enclosed by other sections. ZONE sections are added like this:

    ctx_ddl.add_zone_section('groupname', 'sectionname', 'tag');
SPECIAL sections are so named because they are not recognized by tags. There are two special sections in Oracle8i -- SENTENCE and PARAGRAPH -- both recognized by punctuation in the lexer. They index the start and end of sentences and paragraphs, respectively. You add special sections like this:
    ctx_ddl.add_special_section('groupname', 'sectionname');
No tag argument is needed. There are only two allowed values for section name: SENTENCE, and PARAGRAPH.

ZONE and SPECIAL sections index section start and end word offsets, but do nothing to the words in the section. FIELD sections, on the other hand, extract their contents and index them separately from other words in the document. WITHIN queries are run on this separate, smaller index. This makes field section query faster than zone section query -- up to three times as fast in tests we've conducted -- especially when the section tags occur in every document.

This speed comes at the cost of flexibility. FIELD sections are meant for non-repeating, non-overlapping sections. If a field section repeats, it is treated as a continuation of the section, not a distinct instance. If a field section is overlapped by itself or by another field section, it is implicitly closed at the point where the other section starts. Also, there is a maximum of 64 field sections in any section group. This is 64 section names, not tags. Remember that you can map multiple tags to the same section name. You add field sections like this:

    ctx_ddl.add_field_section('groupname', 'sectionname', 'tag');
To illustrate, let's work through a couple of examples to illustrate the different types of sections and the impact on query semantics. We'll use the following document as an example:
    <A>rat</A><A>ox</A>
    <B>tiger rabbit</B>
    <C>dragon<C>snake</C></C>
This is an XML-style markup, but without a DTD, so we will use the basic section group type:
    ctx_ddl.create_section_group('mygroup','basic_section_group');
Let's start with ZONE sections:
    ctx_ddl.add_zone_section('mygroup', 'asec', 'a');
    ctx_ddl.add_zone_section('mygroup', 'bsec', 'b');
    ctx_ddl.add_zone_section('mygroup', 'csec', 'c');
This tells the section group to recognize A tags as the section ASEC, B tags as the section BSEC, etc. To do section queries, we use the WITHIN operator. Remember to use the section name and not the tag:
    contains(text, 'rat within asec') > 0
This finds the document.
    contains(text, 'tiger within asec') > 0
This does not find the document. Although it has "tiger", it does not occur in the ASEC. If instead of the original setup we had mapped A and B to the same section name:
    ctx_ddl.add_section('mygroup', 'asec', 'a');
    ctx_ddl.add_section('mygroup', 'asec', 'b');
Then both:
    contains(text, 'rat within asec') > 0
    contains(text, 'tiger within asec') > 0
would find the document, because both A and B are treated as ASEC.

An important facet of ZONE section query semantics is that each instance is treated distinctly. That is, a query like this:

    contains(text, '(tiger and rabbit) within bsec') > 0
finds the document, but a query like this:
    contains(text, '(rat and ox) within asec') > 0
does not find the document. Although the document has "rat" and has "ox", and "rat" is in ASEC, and "ox" is in ASEC, "rat" and "ox" are not within the SAME ASEC. Note that
    contains(text, '(dragon and snake) within csec') > 0
finds the document, since they are both in the outer CSEC, even though the inner CSEC contains only "snake".

Special sections follow the same semantics, so a query like this:

    contains(text, '(jumbo and shrimp) within sentence') > 0
is looking for documents with "jumbo" and "shrimp" in the SAME sentence.

Now let's look at field sections. What if in setup we made ASEC a field section instead of a zone section:

    ctx_ddl.add_field_section('mygroup', 'asec', 'a');
Unlike zone sections, each instance of a field section is considered a continuation of the section. The query
    contains(text, '(rat and ox) within asec') > 0
finds the document, although it didn't when ASEC was a zone section. The field section simply stitches the two instances together. Field sections work best on sections like BSEC, which does not repeat like ASEC nor nest like CSEC.

One last subtlety with field sections. If ASEC is a zone section, then

    contains(text, 'rat') > 0
finds the document -- it contains the word rat. However, remember that field sections work by extracting the document contents and indexing them separately. Thus, if ASEC were a field section, this query would not find the document. "rat" is in ASEC, which is separate from the document contents.

You can, however, change this by making the field section visible. There is an optional boolean fourth argument to add_field_section. If this boolean argument is set to true:

    ctx_ddl.add_field_section('mygroup', 'asec', 'a', TRUE);
then the field section is visible, and the section contents are visible to non-within queries, just like zone sections. This is accomplished by double- indexing the word -- once as part of the extracted section, and once as part of the body, so this option has index space cost.

Using Section Groups

Section groups are plugged into indexes using the section group keyword:

    create index
    ...
    parameters ('section group mygroup');
Unlike previous versions, you do not need to set an attribute in the wordlist object. You also no longer need to set STARTJOINS and ENDJOINS in the lexer. However, the WHITESPACE, NEWLINE, and PUNCTUATIONS settings in the basic lexer affect sentence and paragraph boundary recognition, which then impact special sections. There are five section group types.

NULL_SECTION_GROUP

The null section group is used when there is no section information to be extracted -- it simply passes the text through to the lexer. Only special sections can be added to the null section group.

BASIC_SECTION_GROUP

The basic section group is for simple XML-style markup without going full- bore on the DTD stuff. It can be used in custom situations where full XML is unnecessary. Tags are in the form .... Entities and tag attributes are not supported. The only processing it does for plaintext is removal of the markup tags.

HTML_SECTION_GROUP

The HTML section group is for HTML. Good name, huh? It knows the rules for HTML, including ISO Latin-1 entity translation, HTML to plaintext conversion, and

omission. It knows HTML 4.0 tags, and has code to deal with unknown tags.

SCRIPT and STYLE contents, and comments are removed. Contents of the TITLE section are preserved. META tag information indexing is a known customer requirement for a future version.

For those familiar with previous versions, there is no need for KEEP_TAG anymore -- this is automatic when you add the tag to the section group.

XML_SECTION_GROUP

The XML section group is for XML. It does no processing to plaintext except for entity translation and tag removal. It can handle non-system, non-parameter entities in the internal DTD. It cannot access external DTD's in this version.

NEWS_SECTION_GROUP

The news section group handles newsgroup-style postings -- RFC-1036 in particular. This format is used in newsgroups and is similar to the one used for e-mail. Note that while RFC-1036 defines the set of allowable header fields, Oracle does not require this -- any header fields can be used.

Messages in this format consist of a header section, a blank line, and a body section. The news section group recognizes <beginning of line>TAG: as the start of sections, and <end of line> as the end of the section. It stops looking for this pattern when the blank line is encountered.

Translation to plaintext consists of removing all header lines which are not important to section searching. A message like this:

  From:  me
  To:    you
  X-ref: 5!oowj

  hello!  How are you!
with only the from and to tags in the section group, will look like this to the indexing engine:
  <from section start>me<from section end>
  <to section start>you<to section end>
  hello!  How are you!

Lexers

The LEXER Class

The lexer class gets plaintext from the sectioner and splits it into words.

Lexer preferences are plugged into indexes using the lexer keyword:

    create index
    ...
    parameters ('lexer mylex');
The lexer class has five objects.

BASIC_LEXER

The basic lexer can be used for most European languages. It is programmed with default rules for splitting whitespace-delimited text into words. You can modify its parsing behavior to some extent with the lexical attributes.

Ordinarily, non-alphanumeric characters end a word. SQL*Plus is lexed as "SQL" and "Plus". The JOINS attributes allow you to modify this behavior by declaring sets of non-alphabetic characters to treat as valid word letters.

PRINTJOINS are non-alphanumeric characters which are treated as valid characters, and become part of the word. Declaring PRINTJOINS of "*" lets SQL*Plus be lexed as "SQL*Plus".

SKIPJOINS are like PRINTJOINS, but they do not become part of the word. Declaring SKIPJOINS of "*" lets SQL*Plus be lexed as "SQLPlus".

STARTJOINS are like PRINTJOINS, but they only come at the start of the word. If they occur in the middle of a word, a new word is started. Having "*" as STARTJOINS lets SQL*Plus be lexed as "SQL" and "*Plus". Multiple STARTJOINS are allowed at the start of a word.

ENDJOINS are like STARTJOINS, but for the end of the word. They implicitly end a word, too. So, "*" as ENDJOINS lexes SQL*Plus as "SQL*" and "Plus". Multiple ENDJOINS are allowed at the end of a word. STARTJOINS and ENDJOINS used to be important for section searching in previous versions, but with the new sectioner class, they are no longer needed for that purpose.

Each of these four JOINS attributes is a set of characters. Setting PRINTJOINS to "_%*" means that "_", "%", and "*" are all PRINTJOINS.

NUMJOIN is a single character which is the numeric join character. NUMGROUP is the single numeric group character. They are defaulted based on your NLS_LANG setting. For US, NUMJOIN is "." and NUMGROUP is ",".

Finally, CONTINUATION are those characters which indicate line continuation. These and the following newline are removed from the text. Default value is hyphen and backslash.

Then there are three attributes important for sentence/paragraph indexing. PUNCTUATIONS are the set of punctuation marks (?!. by default). WHITESPACE is a set of additional whitespace characters other than space and tab, which you get for free and cannot change. NEWLINE can be set to NEWLINE or CARRIAGE_RETURN. This controls whether lines are ended by \n or \r\n.

A sentence is then recognized as a PUNCTUATION followed by one or more WHITESPACE. A paragraph is a PUNCTUATION followed by a NEWLINE. There are other patterns which are variations on this, but that's the basic idea.

The third set of basic lexer attributes controls term normalization -- the process of converting words to a standard form. BASE_LETTER is a boolean attribute which, if set to YES, will convert accented characters to their unaccented forms.

By default, each word is uppercased during indexing so that queries are case-independent. MIXED_CASE is a boolean, which, if set to YES, does not do this. This makes indexing and queries case-sensitive.

ALTERNATE_SPELLING is an attribute which, if set, uses letter transformation rules for common alternate spellings. It can be set to GERMAN, DANISH, or SWEDISH. In GERMAN mode, for instance, A-umlaut gets transformed to AE. Since the lexer is used at both indexing and query time, query for a word containing A-umlaut will find the word spelled with A-umlaut or AE.

Finally, COMPOSITE controls word decompounding. In German, for instance, multiple words are often combined into a single string. This makes lexing difficult, because the words are no longer whitespace-delimited. If COMPOSITE is set, then the lexer breaks up these compound words, which allows interMedia to index and query as usual. COMPOSITE can be set to GERMAN or DUTCH. Unset or DEFAULT means no decompounding.

The last set of basic lexer attributes control text and theme indexing. New for 8i, the basic lexer can generate and index themes for each document, in addition to splitting it up into words. This merges the functionality found in the separate THEME LEXER and BASIC LEXER of previous versions.

INDEX_TEXT is a boolean which controls word indexing. INDEX_THEMES controls theme indexing. Setting INDEX_THEMES to NO makes it a text-only index. Setting INDEX_TEXT to NO makes it a theme-only index. They cannot, of course, be both NO. Indexing themes takes longer and uses a bit more index space, but improves the efficacy of ABOUT queries. We'll talk more about this when we discuss ABOUT query.

The CTXSYS.DEFAULT_LEXER preference has theme indexing ON for the English language-specific scripts.

JAPANESE_VGRAM_LEXER

Japanese is not a whitespace-delimited language (except in textbooks) so it is very difficult for computers to pick out individual words. The Japanese V-gram lexer gets around this problem by indexing overlapping clusters of characters. The Japanese word ABC might be decomposed into AB and BC, for instance. Query terms are also decomposed this way. Thus, the contains query is not looking for words, but for patterns of characters.

The Japanese lexer does not have any attributes in this version.

CHINESE_VGRAM_LEXER

Chinese is also not whitespace-delimited, so a similar solution has been applied for Chinese, as well. The Chinese v-gram lexer also does not have any attributes.

CHINESE_LEXER

New for Oracle8i is a segmenting Chinese lexer, which can actually recognize some Chinese words as whole entities. The rest of the text is still broken into v-grams, but this should be more space-efficient and have faster query than the pure v-gram method. The Chinese segmentation lexer does not have any attribtues.

KOREAN_LEXER

Korean is whitespace-delimited, but has problems with verbs, which can have thousands of different forms. Our Korean lexer is a lexicon-driven engine (using a third-party 3-soft lexicon) which simply eliminates verbs from indexing. New for Oracle8i is the ability to eliminate adverbs and adjectives, do various form conversions and perform morphological and segmentation decomposition.

Other Indexing Classes

The STOPLIST Class

Not every word in a document is worth indexing. Linguistic lubricant like prepositions and conjunctions are important for language understanding, but are not very useful for information retrieval -- they are very common and so convey very little information about the document by their presence. Most of the time, it's not worth the space to index these words.

The stoplist class holds a list of these words, called stop words. During indexing, the engine consults the stoplist and filters out the stop words. The stoplist class does not use the create_preference API. Instead, it has its own API:

    ctx_ddl.create_stoplist('mylist');
Words are added using add_stopword:
    ctx_ddl.add_stopword('mylist', 'the');
Here the article "THE" has been added to the stoplist. You can see the words in a stoplist using the ctx_stopwords view. A stoplist can have a maximum of 4095 stopwords. Each can be up to 64 bytes in length.

Stopwords are case-sensitive in a stoplist, but if the index is not case- sensitive, the stopwords will not be case-sensitive. So, you could add "THE" and "the" to a stoplist, but if the index is not case-sensitive, there would be no difference -- you'd just be wasting a stopword slot.

Stoplists are plugged into indexes using the stoplist keyword:

    create index
    ...
    parameters ('stoplist mylist');
New for Oracle8i, stopwords can be added to the index without re-indexing:
    alter index myidx rebuild parameters ('add stopword AND');
This adds "AND" as a new stopword after indexing. Remember that the index makes its own copy of everything, so this does not affect the stoplist used to create the index.

The default stoplist, CTXSYS.DEFAULT_STOPLIST, is language-specific, and default stoplists are provided for most European languages.

Other Stop Objects

Stoplists can also hold two other kinds of objects: stop themes and stop classes. Stop classes are classes of words, rather than individual words. The only stop class available in Oracle8i is NUMBERS, which stops all numbers from being indexed. Alphanumeric words, such as 123.456.789 and abc231 are still indexed. This behavior depends on the lexer to recognize numbers.

Stop themes are used only during theme indexing. The entries added as stop themes are themed, and these themes AND THEIR CHILDREN are not indexed. For instance, say you enter "biology" as a stop theme. This will prevent themes like "Physical Sciences:Biology:Genetics" and "Physical Sciences: Biology:Botany" from being indexed. This is just an example, and does not actually reflect the knowledge base.

Stop themes are used only during theme indexing, so adding "biology" as a stop theme will not stop the word "biology" or biology-associated words from being indexed.

The WORDLIST Class

The wordlist class is an odd class because it has no effect on indexing. Instead, it holds fuzzy and stem expansion settings used at query time.

Wordlist preferences are plugged into indexes using the wordlist keyword:

    create index
    ...
    parameters ('wordlist mylist');
There is only one wordlist object: BASIC_WORDLIST, with the following attributes:

STEMMER This attribute is set to the default stemming type. Stemming is an expansion of a word to different forms. The stem expansion of GO might include GO, GOING, WENT, and GONE. The rules for this vary from language to language. (See expansion query operators).

STEMMER can be set to one of the following: ENGLISH (inflectional), (English) DERIVATIONAL, DUTCH, FRENCH, GERMAN, ITALIAN, SPANISH, and NULL, which means no stemming. The CTXSYS.DEFAULT_WORDLIST setting is language-specific.

FUZZY_MATCH This attribute is set to the default type of fuzzy match. Fuzzy match is an expansion technique designed to find words close in form, such as mistyped or mis-OCR'ed versions.

FUZZY_MATCH can be set to one of the following: GENERIC, JAPANESE_VGRAM, KOREAN, CHINESE_VGRAM, ENGLISH, DUTCH, FRENCH, GERMAN, ITALIAN, SPANISH, and OCR. The CTXSYS.DEFAULT_WORDLIST setting is language-specific.

FUZZY_SCORE This is the default score floor for fuzzy expansions. New for Oracle8i is the ability to limit the fuzzy expansion to the best matches. The fuzzy score is a measure of how close the expanded word is to the query word -- higher is better. Setting fuzzy score means that fuzzy expansions below this score will not be produced. You can set this from 0 to 80.
FUZZY_NUMRESULTS This is the default maximum number of fuzzy expansions. Setting fuzzy numresults limits the fuzzy expansion to a certain number of the best matching words. You can set this up to 5000.

The STORAGE Class

The storage class holds storage parameters for the index tables. The Text index is made up of five oracle objects, so the single storage clause in the create index statement is insufficient to specify storage. Storage preferences are plugged into indexes using the storage keyword:

    create index
    ...
    parameters ('storage mystore');
There is only one storage object: BASIC_STORAGE, with the following attributes:

I_TABLE_CLAUSE This attribute is the storage parameters for the I table, which is the main index table. There will be at least one row per unique word in your index -- probably multiple rows per word. Each row has a BLOB which tries to stay inline, making the row a bit large. The value is simply tacked on to the end of the create table statement, so you could set it like this:
      ctx_ddl.set_attribute('mystore', 'I_TABLE_CLAUSE', 
                            'tablespace foo storage (initial 1K)');
K_TABLE_CLAUSE This attribute is the same, but for the K table, which is an IOT for rowid to docid conversion. There will be one row per row in your base table. Each row consists of a number and a rowid.
R_TABLE_CLAUSE This attribute is the same, but for the R table, which is a BLOB table for docid to rowid conversion. There will be only a few rows in this table, and will have a large BLOB in each row.
N_TABLE_CLAUSE This attribute is the same, but for the N table, which has invalid docids waiting for garbage collection. The number of rows in this table will depend on the frequencies of update and delete DML, and of garbage collection optimization. Each row is short, holding two numbers.
I_INDEX_CLAUSE This attribute is the same, but for the unique index on the I table.

Indexing Memory

Those of you familiar with previous versions may be wondering where the index memory setting is. For new users, interMedia builds an inverted index, so documents are not added to the index one at a time, like a b-tree index. Instead, information from multiple documents is buffered up and periodically flushed to the index during indexing. In previous versions, you could set the size of this buffer in the analogue of the storage class. Setting this buffer higher reduces index fragmentation but uses more memory.

In Oracle8i, index memory can be set directly in the parameters string. This allows you to set index memory on a per-operation basis. The keyword used is MEMORY:

    create index
    ...
    parameters ('memory 2M');

Here a paltry 2 megabytes of memory is used for indexing. If not specified, the system parameter DEFAULT_INDEX_MEMORY is consulted for the systemwide default index memory. At install time, this is 12M.

The amount of memory cannot exceed system parameter MAX_INDEX_MEMORY. This allows the system administrator to disallow outrageous amounts of index memory.

The MEMORY keyword is also allowed in the alter index parameter string for batch DML (background DML uses DEFAULT_INDEX_MEMORY). This allows you to have different memory sizes for create index and DML, something which is not possible in previous versions. You simply specify a different memory amount in the create index parameter string then in the alter index parameter string.

DML

DML Processing

Most text search engines have asynchronous document updates and additions -- changes to the index are usually done in batches, after and separately from the document changes. This is due to three aspects of inverted indexing:

  1. Text indexing a single document is a lot of work. The values are usually long, and at the very least you have to break it into words. Depending on the objects, you may also have to do web page requests, filtering, and HTML parsing.
  2. Inverted indexes, composed of lists of documents by word, are best updated in batches of documents at a time. Updating the index one document at a time leads to either word lists one document long or a lot of appending to existing lists.
  3. Most text applications are fairly static, having relatively lower DML frequency, and less need for full consistency semantics than traditional transaction processing systems.
interMedia faces the same challenges. Here is what it does for each DML on an indexed column:

INSERT The document rowid is placed into a queue, dr$pending, for later addition to the text index. Queries before this DML is processed will not find the new document contents.
UPDATE The old document contents are invalidated immediately, and the document rowid is placed into the dr$pending queue for later reindexing. Queries before this DML is processed will not find the old contents, but neither will it find the new contents.
DELETE The old document contents are invalidated immediately. No further DML processing is required -- queries will no longer find the document.

Thus, interMedia has synchronous invalidation but asynchronous addition. This extends to transactions, as well:

  delete from table where contains(text, 'delword') > 0;

  select * from table where contains(text, 'delword') > 0;
  --> you get no rows returned

  rollback;

  select * from table where contains(text, 'delword') > 0;
  --> you get the rows returned
and:
  insert into table values (1, 'insword');
 
  select * from table where contains(text, 'insword') > 0
  --> you do not get the row

  commit;

  select * from table where contains(text, 'insword') > 0
  --> you still do not get the row
The synchronous invalidation is new to this version. Also new is that the DML notification is done through integrated kernel code, and not through triggers as in previous versions.

Processing Additions

So how do you process your inserts and updates when they are sitting in the queue? interMedia provides two methods: sync and background.

Sync is the manual way -- you control when and how it's invoked. Each invocation processes all pending inserts and updates for an index:

    alter index myindex rebuild online parameters ('sync')
The ONLINE keyword is very important. Without this, during the sync operation queries are blocked and DML on the base table fails. Make sure to use it for sync operations.

Background DML requires you to start a ctxsrv background daemon in the operating system:

    ctxsrv -user ctxsys/<password>
Once started, the daemon polls the dr$pending queue for DML, and automatically processes additions as they become available.

You can use either or both -- it's largely a matter of your application requirements. Background DML scans for DML constantly. This leads to new additions being indexed automatically and quickly. However, it also tends to process documents in smaller batches, which increases index fragmentation. Sync DML is done at the user's request, so usually the batches are larger and thus there's less index fragmentation. A happy compromise could possibly be reached by invoking sync through dbms_sql in a dbms_job which runs at longer intervals.

The DML Queues

The dr$pending queue holds the documents waiting to be indexed. It should be queried through the ctx_user_pending view, which makes it more readable. Additionally, there is a dr$waiting queue which is used when documents are waiting to be indexed and they are updated again. The drq_inprog, drq_batches, and drq_batchno tables from previous versions are all no longer needed.

Optimization

Index Fragmentation

The Text index is an inverted index, so essentially consists of words and lists of documents which contain that word. When new documents are added to this index, these lists have to be extended. Instead of appending to these lists, more rows for the word are created. However, this can lead to many rows with sub-optimal, short lists -- called "index fragmentation". For instance, say we have the following in our index:

    DOG   DOC 1 DOC 3 DOC 5
Now a new document with "dog" comes along. Simply add it to the index as a new row:
    DOG   DOC 1 DOC 3 DOC 5
    DOG   DOC 7
Subsequent DML will also create new rows:
    DOG   DOC 1 DOC 3 DOC 5
    DOG   DOC 7
    DOG   DOC 9
    DOG   DOC 11
This is why background DML generally produces more index fragmentation than spot invocation of sync: processing DML less frequently in larger batches allows newly added rows to have longer lists, which reduces the number of rows in the index table.

Keep in mind that a certain amount of fragmentation is expected in 8i. interMedia will try to keep index table rows below 4000 bytes long, to inline the LOB and to speed up index row merges used during phrase search, AND, section search, etc. However, frequent DML will probably result in much more fragmentation than this.

Document Invalidation

During an update or delete, the old document information must be removed from the index so that queries will no longer hit it. However, because the inverted index consists of words and rows of occurrences, this would entail finding all words in the old version of the document (which may not be available), and removing an occurrence from the lists of those rows in the index table. This is too much work to do synchronously, and deletes must be synchronous in the integrated model.

Instead, interMedia marks the old document contents as invalid and does not touch the index. The mark signals queries to remove the document from any query result sets. However, this leaves the old information in the index, taking up space in the index tables.

Optimization

To solve these potential problems in the index, you run optimization. Optimization has two modes: FAST and FULL. FAST optimization targets fragmentation only:

    alter index myindex rebuild online parameters ('optimize fast');
The ONLINE keyword is important -- without it, queries are blocked and DML fails with an error. FAST optimization runs through the whole index table and glues fragmented rows together, to a maximum of 4000 bytes for a single row. This reduces the number of rows in the index table.

FULL optimization does both defragmentation and garbage collection, which removes the old information left over after document invalidation:

    alter index myindex rebuild online parameters ('optimize full');
It loops through each row of the index table, but, unlike fast optimization, cracks open each row's list, removing old document information. It also glues separate rows together where possible.

Because garbage collection is more involved and time-consuming than defragmentation, FULL optimization does not have to be run on the whole table at one time. Instead, you can run it for a limited period of time:

    ... parameters ('optimize full maxtime 5')
which means run for a maximum of 5 minutes. At the end of 5 minutes, it will stop. The next time you run optimization, it will pick up where it left off. This lets you do a little of bit optimization each night during off times, and ensure that it is not loading the system when the off time ends.

Garbage collection is needed only after document invalidation, which happens only for delete and updates. If your document table is append-only, meaning that documents are inserted and never changed or removed, then it should suffice to run only defragmentation.

Optimization Concurrency

Unlike previous versions, optimization (either mode) does not block background DML processing. However, because Oracle prevents alter index operations from running concurrently, sync invocations will be prevented.

Querying

The Basics

At this point all aspects of building an index and keeping it up to date have been discussed. However, indexes are a means to enabling content- based queries.

IMPORTANT: Unlike a value query, which is slower without a b-tree index, contains queries are completely disallowed without a Text index.

You do a query using the contains operator:

    select id
      from texttab
     where contains(textcol, 'query') > 0
The first argument to contains is the name of the text column. The second is the text query, which is limited to 2000 bytes in length. It returns a number, which indicates the strength of match. 0 means not a match, so it is constrained to > 0. The result is that this query finds the id's of all rows in texttab where the textcol has the word "query".

Scoring

Of course, now you're wondering how to use that return value in the select list and order by clause. An operator is provided just for that purpose:

    select id, score(1)
      from texttab
     where contains(textcol, 'query', 1) > 0
     order by score(1) desc
The "1" in the score and contains operators is called the "contains label" and it's a number which links the score in the select list to the contains in the where clause -- it can be any number, so long as it matches.

Score can be between 0 and 100, but the top-scoring document in the query will not necessarily have a score of 100 -- scoring is relative, not absolute. This means that scores are not comparable across indexes, or even across different queries on the same index. Score for each document is computed using the standard Salton formula:

    3f(1+log(N/n))
Where f is the frequency of the search term in the document, N is the total number of rows in the table, and n is the number of rows which contain the search term. This is converted into an integer in the range 0 - 100.

Query operators like AND and OR operate on the scores of their operands.

Where Can I Use contains?

Unlike the transparent query rewrite of previous versions, you can include a contains clause in any SQL statement using any tool which can issue oracle SQL. Here is a short list of new places where contains queries can pop up:

  • subqueries
  • virtual tables
  • PL/SQL
  • View definitions
  • DML (insert as select, e.g.)

There is no longer a limitation on the length of the contains SQL statement, although there is a limit on the query term length.

Other Query Methods

For those of you familiar with previous versions, there is no two-step or text cursors (n-memory query). If you absolutely must have result tables, use insert as select. Instead of text cursors, use real PL/SQL cursors. To sort by score, use an order by clause with a FIRST_ROWS hint. The hint will avoid an oracle SORT ORDER BY.

However, interMedia still supports count_hits. count_hits is a fast way to get the hit count for a text query, in PL/SQL:

    num_hits := ctx_query.count_hits('indexname','query',TRUE);

The first argument is the name of the index. This should be a string, so use quotes if you are using a literal value. The second argument is the query string, and the third is a boolean toggle called "exact". If exact is TRUE, then the result is accurate. If exact is FALSE, then the result is only an upper bound. The actual count could be lower. exact FALSE is faster, because it doesn't screen out invalid documents (see issues on DML and optimization for more details).

You could also do "select count(*)" with a contains clause, and this is recommended when you have structured conditions, but count_hits is faster for text-only queries.

Simple Queries

The simplest query is a single word:

    contains(text, 'dog') > 0
You can escape your word with curlies:
    contains(text, '{dog}') > 0
This is useful when you want to query on a reserved word or your word has special characters which are query operators. The list of reserved words and the query operator characters are in the documentation.

You can query on a phrase just as easily: