Oracle interMedia Text 8.1.7 - Technical Overview

Oracle8i interMedia Text 8.1.7 - Technical Overview

Garrett Kaminaga, Product Development, Database and Application Server Technologies, Oracle HQ

This is a technical overview of the improvements found in interMedia Text, version 8.1.7. This is intended for an audience familiar with version 8.1.6. If you are new to interMedia Text, please start with the 8.1.5, then 8.1.6 technical overviews.

This document is only a summary of new features and internal changes. As always, please refer to the 8.1.7 interMedia Text documentation addendum for detailed information.


Table of Contents


Overview of 8.1.7 Improvements

The improvements in version 8.1.6 of interMedia Text were mainly in the areas of structured document support and heterogenous document sets. While 8.1.7 includes some improvements which extend heterogenous document set handling, the focus is performance and ease-of-use.

For performance, 8.1.7 offers a new indextype which offers fast mixed query for small, simple text fields, prefix indexing for faster wildcard searching, and many internal tweaks -- suggested by exhaustive lab analysis on real-world data -- that improve query performance.

For ease-of-use, 8.1.7 has migrated entirely to the oracle shadow process, eliminating the need for extproc setup. It also provides a new datastore for easy multi-column concatenation, primary-key-less indexing, and many other changes designed to reduce the amount of work needed to create and maintain an interMedia Text application.

Other interesting features include a fuzzy score passthrough, a new tokenizing document service, and German composite query improvements.


Upgrading 8.1.6 to 8.1.7

Automated Upgrade

interMedia Text is not integrated with Oracle's Database Migration Assistant tool (DBMA) in this version. If you have a prior version of interMedia Text installed, you must upgrade manually.

Manual Upgrade From 8.1.5

This process assumes that you are starting with a valid 8.1.5 ctxsys data dictionary, and that the system is not being used for index, DML, or query. Please use SQL*Plus to run all scripts, and NOT svrmgrl.
  1. run ?/ctx/admin/upgrade/s0801060.sql as SYS.
    This grants new, needed database privileges to ctxsys
  2. run ?/ctx/admin/upgrade/u0801060.sql as CTXSYS.
    This upgrades the ctxsys schema to 8.1.6
Now follow the steps below for upgrade from 8.1.6 to 8.1.7.

Manual Upgrade From 8.1.6

This process assumes that you are starting with a valid 8.1.6 ctxsys data dictionary, and that the system is not being used for index, DML, or query. Please use SQL*Plus to run all scripts, and NOT svrmgrl.
  1. run ?/ctx/admin/upgrade/s0801070.sql as SYS.
    This grants new, needed database privileges to ctxsys
  2. run ?/ctx/admin/upgrade/u0801070.sql as CTXSYS.
    This upgrades the ctxsys schema to 8.1.7
  3. run ?/ctx/admin/dr0typec.pkh as CTXSYS.
    This creates the CTXCAT index type, which is new for 8.1.7
  4. run ?/ctx/admin/dr0pkh.sql as CTXSYS.
    This re-creates all public and private package headers
  5. run ?/ctx/admin/dr0plb.sql as CTXSYS.
    This re-creates all public and private package bodies
  6. run ?/ctx/admin/dr0type.plb as CTXSYS.
    This re-creates the CONTEXT index type body
  7. run ?/ctx/admin/dr0typec.plb as CTXSYS.
    This creates the CTXCAT index type body, new for 8.1.7
  8. run ?/ctx/admin/dr0itypc.sql as CTXSYS.
    This creates the bindings and index type for CTXCAT, new for 8.1.7
  9. check for any invalid ctxsys objects and alter compile as needed

Upgrading Existing Indexes

Nothing needs to be done to upgrade existing 8.1.6 context indexes. These should still work without any problems in 8.1.7. However, you may choose to re-build the $X index on the $I table. Indexes built with 8.1.7 will have a $X index comprised of 5 columns. Indexes that were built with 8.1.5 or 8.1.6 have a $X index comprised of 3 columns. While a $X with the old definition will still work, we believe the new $X index offers better performance for large data sets.

A utility script -- ?/ctx/sample/util/dr817rx.sql -- is provided to rebuild existing $X indexes. It should be run as ctxsys, and should be passed the index owner and index name, as in:

  sqlplus ctxsys/ctxsys
  ...
  SQL> @dr817rx scott myindex
Please ensure that ctxsys' TEMP tablespace has sufficient space to re-create the $X index.


Catalog Index

What is the Catalog Index?

The context index type is designed for collections of news articles, technical reports, resumes -- relatively large, coherent documents. This does not describe the majority of database text, which is usually short text fragments -- names, item descriptions, addresses, etc.

While a context index can be applied to these text fragments, it's usually overkill. Filtering, section searching, relevance ranking, and the rest of the rich feature set that makes sense when applied to document collections are less useful when applied to text fragments. Furthermore, the maintenance demands of a context index -- asynchronous DML, optimization -- are too troublesome for documents that are just a few words long. As a result, these text fragments are usually unindexed or b-tree indexed, limiting search to exact equality or prefix searches with LIKE.

Text fragments need text searching. Since the context index is not optimal for fragments, we designed one that is -- a catalog index. Using a different index structure and sacrificing some of the context index's flexibility and feature set, the catalog index provides text indexing that is free from maintenance overhead and text searching that has structured query support and faster response time than the context index.

The description of the catalog index type in this section is a high-level overview. If, after reading this section, you find yourself wondering why ctxcat has a particular limitation, or how it all works, please refer to "Ctxcat Index Internals". If you would like to get an idea of ctxcat indexing speed, required space, or query performance, please see "Ctxcat Performance Overview". Both are separate documents which should be available from wherever you got this document.

Creating a CTXCAT Index

As an example in this chapter of the overview, we'll use an on-line auction site which holds its items in a table like this:
  create table auction (
    item_id           NUMBER   PRIMARY KEY,  -- auction item identifier
    item_desc         VARCHAR2(80),          -- free-form item description
    price             NUMBER,                -- current price of the item
    end_date          DATE                   -- end time of the auction
  );
We would like to catalog-index the ITEM_DESC field. The catalog index is a domain index type just like the context index. This means that to create a catalog index, simply use create index and specify the catalog index type (CTXCAT):
  create index auctionx on auction(item_desc)
  indextype is ctxsys.ctxcat;
You can specify preferences using a parameters clause, just like the context index. However, the keywords you can use in a ctxcat index parameters string are limited to:
  • INDEX SET
  • LEXER
  • MEMORY
  • STOPLIST
  • STORAGE
  • WORDLIST
Index set is new and specific to ctxcat -- we'll discuss this a bit later. ctxcat does not support the datastore, filter, or section group keywords, so it is limited to plain-text, unstructured data stored directly in database columns. Language, format, and charset columns are also not supported, meaning that the data must be homogenous. By extension, this also disallows use of the multi-language stoplist and multi-language lexer.

The CTXCAT Query Language

Once the ctxcat index is created, you query using the operator catsearch instead of contains:
  select item_desc from auction 
   where catsearch(item_desc, 'oracle', null)>0;
This finds all rows in auction which have "oracle" in the item_desc text. Like contains, the first argument is the text column, and the second is the query string. The third argument we'll talk about below -- just ignore it for now. There is no score label argument, because catsearch does not provide scoring -- it is a simple boolean search.

The catsearch operator uses a much simpler query language than contains, with the intent that web-style queries can be passed through to catsearch without extra parsing. The query language rules are:

  • Multiple words are treated as an AND.
  • Vertical bar is used for OR.
  • A leading plus sign on words is ignored and discarded
  • A leading minus sign excludes words (must have included words)
  • Double quotes delimit phrases.
  • Parentheses group operations.
Or, in comparison:
  catearch Query  Equivalent contains query
  --------------  -------------------------
  A B             A & B
  A | B           A | B
  +A -B           A ~ B
  "A B"           A B
  "A B" C         (A B) & C 
  (A B) | C       (A & B) | C
AND, OR, AND-NOT, and PHRASE are the only functions currently available in the catsearch query language. catsearch does not support any of the other features found in the context query language such as accum, within, fuzzy, stem, wildcard, thesaurus, near, etc.

Index Maintenance

Unlike context indexes, ctxcat indexes require no special maintenance. DML is completely transactional. On an insert, the new row is immediately indexed and the information inserted into $I. Newly inserted rows can be found by catsearch queries immediately -- even before they are committed. Deletes and updates are also transactional. The transactional DML processing means that a ctxcat index will cause DML on the base table to slow down. This is just the effect of a b-tree index, but the impact of a ctxcat index is much larger.

Optimization is also unnecessary for ctxcat indexes. The index structure ensures that it is always largely optimal.

Structured Query Support

The ctxcat index also provides structured query support, so that query criteria for other columns can be efficiently combined with the text search. Let's continue our auction example. A plausible requirement is to allow people to limit their auction search to a price range. We could certainly add this criterion to the top-level query:
  select item_desc from auction 
   where catsearch(item_desc, 'oracle', null) > 0
     and price < 15
and it will work, but may be slow as the text results are individually post- filtered to verify the price range. It would be much more efficient to process both the text and range criteria at once - just like a concatenated b-tree index on two columns is faster than two separate indexes when querying on two column criteria. ctxcat indexes support something similar using index sets.

We start by creating an index set:

  ctx_ddl.create_index_set('aucset');
An index set, like a stoplist, is only a holding object. We have to add content in order to define its behavior. Index sets hold indexes. Each index is an ordered list of base table columns which will be used in mixed query. In our case, we want to perform a mixed query with price criteria, so we add an index on price to the index set:
  ctx_ddl.add_index('aucset','price');
The index set can take up to 99 indexes, each made up of ordered lists of columns from the base table. However, there are two important restrictions on these columns:
  • only allowable types are NUMBER, DATE, and CHAR/VARCHAR2 with maximum length <= 30 bytes
  • NULL values are not allowed. A NULL in a user column will cause an index error and the row will not be indexed.
So far we have created our index set and added a "price" index. When we use this index set when we create our index:
  create index auctionx on auction(item_desc)
  indextype is ctxsys.ctxcat
  parameters ('index set aucset')
ctxcat includes the "price" column of the base table in the text index. This is not the same as the widely-used context technique of concatenating multiple columns' data using user_datastore. ctxcat does it quite differently -- see technical description document more details.

Once price is part of the $I index, we can reference it in the third parameter of catsearch:

  select item_desc from auction 
   where catsearch(item_desc, 'oracle', 'price < 15') > 0
which allows the ctxcat index to process both the text query and structured predicate at the same time.

Note that you can't put just any structured criteria in there. It is restricted and simplified. The restrictions are:

  1. left-hand side (the column name) must be a column named in at least one of the indexes of the index set
  2. left-hand side must be a plain column name no functions, expressions, etc.
  3. operator is limited to: <, <=, =, >=, >, BETWEEN, and IN
  4. right-hand side must be composed of literal values no functions, expressions, other columns, subselects, etc.
  5. tiple criteria can be combined with AND. OR is not supported.
  catsearch(text, 'dog', 'foo > 15')
  catsearch(text, 'dog', 'bar = ''SMITH''')
  catsearch(text, 'dog', 'foo between 1 and 15')
  catsearch(text, 'dog', 'foo = 1 and abc = 123')
And these are NOT supported:
  catsearch(text, 'dog', 'upper(bar) = ''A''')
  catsearch(text, 'dog', 'bar LIKE ''A%''')
  catsearch(text, 'dog', 'foo = abc')
  catsearch(text, 'dog', 'foo = 1 or abc = 3')
Of course, this is still a database, so if you really need them you can always move these more sophisticated criteria out of the catsearch and back in the main select -- it may be a bit slower, but the query can be processed.

You can also use the third argument to order the results:

  select item_desc from auction 
   where catsearch(item_desc, 'oracle', 'order by price') > 0
Note that this query does not have an order by clause. The rows will come back from catsearch in price without it. In fact, adding an order by clause will actually slow the query down. The third catsearch argument is not integrated with the oracle SQL query processor, so it will not be aware that the rows coming back from catsearch are already in order. As a result, it will go through an unneeded SORT ORDER BY.

You can combine structured criteria and order by, and also use the ASC and DESC keywords, as long as all columns are going in the same direction:

  catsearch(text, 'dog', 'order by foo desc, bar desc')
  catsearch(text, 'dog', 'foo = 15 order by bar')
Some combinations of structured criteria and order by might not be able to be processed efficiently by the catsearch index. When this happens, the query will fail with the error:
  DRG-10844: index cannot execute this structured predicate efficiently
This can happen when, for instance:
  • a column not in the index set is referenced
  • order by columns are going in different directions
  • the order by column list is not found in the index set
as well as other, more arcane occasions. When this error occurs, you can simply move parts of the structured criteria or order by out of the third argument and back in the main select.

Creating Catalog Indexes -- Further Notes

The catalog index does not support document services (highlighting, markup, themes, gists) or query services (explain, query feedback, browse words). Support may be added in a future release.

The catalog index does not support theme indexing -- if given a lexer preference with INDEX_THEMES set, it will ignore this setting.

In order to support Japanese, the catalog index must be built with prefix indexing -- this feature is new for 8.1.7 and is covered later in this document.

The catalog index has its own set of system parameters for setting preference defaults. This allows an installation to have one set of default preferences for context indexes, and another set for ctxcat indexes. The new system parameters are:

  • DEFAULT_CTXCAT_LEXER
  • DEFAULT_CTXCAT_STOPLIST
  • DEFAULT_CTXCAT_WORDLIST
  • DEFAULT_CTXCAT_STORAGE
  • DEFAULT_CTXCAT_INDEX_SET
While you can specify a wordlist preference for ctxcat indexes, most of the attributes do not apply, since the catsearch query language does not support wildcarding, fuzzy, and stemming. The only attribute of the wordlist preference that is useful is PREFIX_INDEX, for Japanese data.

ctxcat uses a $I table to store the text information, just like context. However, ctxcat $I tables have many more rows and take up much more space than context $I tables for the same data. Ensure that the indexing session has sufficient rollback and adequate storage space.

Index memory can be decreased to reduce rollback usage. ctxcat indexes have a different structure than context indexes, and do not produce sub-optimal indexes with a low index memory setting. ctxcat indexing time is also unaffected by the indexing memory unless set very low.

The ctxcat index will create several regular (b-tree) indexes on its $I table. Because they are created in a PL/SQL stored procedure owned by CTXSYS, these b-tree indexes are created using the temporary tablespace of CTXSYS. Ensure that this is set to a tablespace with sufficient working space.

For further technical and quantitative information, see the separate documents "Ctxcat Index Internals" and "Ctxcat Performance Overview".

How to Choose

In most cases, the choice between ctxcat and context indexes will be easy to make. The limitations of ctxcat preclude its use in many situations, while its zero maintenance and structured query support make it clearly superior in others. Use ctxcat for text fragments and context for information retrieval. ctxcat and context are complements, not competitors.

There is possible overlap is on medium-size (500 - 2000 bytes) unstructured plain text documents which need very simple text searching and structured query support -- resumes which need to be keyword searched and ordered by date, for example. The application needs to balance context's indexing speed and storage efficiency against ctxcat's structured query support. Unless the performance of the context index is unacceptable, the context index should probably be preferred in such cases. While the application may be able to bear ctxcat's limitations now, context offers more flexibility if the application requirements expand in the future.

Finally, if you simply cannot decide, you can always elect to create both. Because they are different index types, it is possible to create both a context and a ctxcat index on the same column.


Datastore Improvements

Multi-Column Datastore

Many customers use the user datastore to concatenate multiple table columns into a single virtual document, for more efficient mixed query. Some cases where user datastore is used in this way are good candidates for the ctxcat index type, discussed above, but many will need to stick with the context index for its more powerful query language and more extensive feature set. (The question of ctxcat vs. context section search is addressed further in "Ctxcat Performance Overview")

Since concatenation of multiple base table columns is such a popular use of the user datastore, we've created a datastore specifically for this purpose: the multi_column_datastore.

Using the multi column datastore is easier than doing the same thing in the user datastore -- you simply create your preference:

  ctx_ddl.create_preference('mymulti','MULTI_COLUMN_DATASTORE');
then set your column list:

ctx_ddl.set_attribute('mymulti', 'COLUMNS', 'column1, column2, column3'); create the index using the multi-datastore:

  create index myindex on mytable(columnx)
  indextype is ctxsys.context
  parameters('datastore mymulti')
and at indexing time, the engine will create a virtual document from all the specified columns like this:
  
  column 1 contents here
  
  
  column2 contents here
  
  etc.
Note that the indexed column is NOT automatically included in the document. Only the specified column list columns are used to create the document. This format is ideal for sectioning with the basic section group, although a section group is NOT automatically created when creating or using a multi- column datastore preference.

The column list can contain expressions, function calls, etc.:

  ctx_ddl.set_attribute('mymulti', 'COLUMNS', 
                        'col1, col2 + 5, someowner.somefunction(col3) col3');
The SQL is executed as user ctxsys, so functions should be fully qualified with owner name.

The column list supports most atomic column types. NUMBERs and DATEs are converted to text using the default format mask. TO_CHAR can be used in the column list to use a different format mask. RAW and BLOB values are not converted to hex, but are directly concatenated as binary data. LONG, LONG RAW, NCHAR, NCLOB, object and collection columns are not supported.

Because the column list can contain expressions, there are no triggers created on the columns in the column list - updates must be propagated by application code or through user-defined triggers.

Because the column list can contain function calls, and because the executing user is CTXSYS, the multi datastore must be securely controlled. Only CTXSYS is allowed to create preferences using MULTI_COLUMN_DATASTORE, although once created any user can use it to create an index. If this is too restrictive, the DBA can create a stored procedure owned by ctxsys which calls create_preference. This stored procedure can then be called by any user to create MULTI_COLUMN_DATASTORE preferences.

MULTI_COLUMN_DATASTORE is implemented entirely in C, so avoids the PL/SQL and temporary lob overhead of user datastore. As a result, it indexes faster, but is less powerful since it is limited to expressions on the columns of the base table. The user_datastore is required for more complex logic or multiple table data concatenation.

Safe Callouts Allowed in USER_DATASTORE

interMedia Text no longer uses the safe callout (extproc) mechanism for indexing in 8.1.7. As a result, user_datastores can now call safe callouts, which was not allowed before.

Optional FILE/URL Security

FILE and URL datastores access files on the database machine. This may be undesirable for more sensitive sites, since any user can browse the file system accessible to the oracle user using these datastores. For those sites, 8.1.7 introduces the FILE_ACCESS_ROLE system parameter. The DBA can set this system parameter to the name of a database role. If set, any user attempting to create an index using FILE or URL datastores must have this role, or the index creation will fail. For instance, if the DBA does:
  ctx_adm.set_parameter('FILE_ACCESS_ROLE','WHITEHAT');
then when user SCOTT tries to do:
  create index foox on foo(text) indextype is ctxsys.context
  parameters('datastore ctxsys.file_datastore')

the engine will test if SCOTT has the role WHITEHAT. If he does, then the create index will proceed as normal. If not, then the create index will fail. Note that this is checked only at create index time. Setting this parameter or granting/reovking the named role has no effect on existing indexes using the file datastore.

Username/Password Support in URL Datastore

FTP protocol URL's in the URL datastore now support embedded username and password, as in:
  ftp://username:password@ftp.hostname.com/dir/file.doc


Filter Improvements

PL/SQL Procedure filter

The user filter object currently employs an external executable which is invoked once for each document. This can be slow because each invocation is burdened by shell overhead and startup costs. The PL/SQL procedure filter replaces the external executable interface with a PL/SQL interface, not unlike the user datastore. With the integration of indexing into the oracle shadow process, safe callout user filters are now supportable. Furthermore, a PL/SQL interface allows PL/SQL or Java stored procedure filtering, which is useful in a wide variety of situations.

When creating a procedure filter, you must specify the procedure name in the PROCEDURE attribute:

  ctx_ddl.create_preference('myfilt','procedure_filter');
  ctx_ddl.set_attribute('myfilt','procedure','normalize');
The procedure filter is subject to the same security constraints as the user datastore. The procedure must be owned by CTXSYS, and any user using the preference to create an index must have been granted EXECUTE permission on the stored procedure.

By default, the filter procedure should have two parameters. The first parameter is a BLOB and is the incoming binary version, provided by the datastore. The second parameter is a CLOB and is the outgoing text version produced by the filter. So, from the example above, we are expecting a procedure like this:

  PROCEDURE NORMALIZE(IN BLOB, IN OUT NOCOPY CLOB)
The input and output types can be changed, using the INPUT_TYPE and OUTPUT_TYPE attributes. For instance:
  ctx_ddl.set_attribute('myfilt','input_type','clob');
  ctx_ddl.set_attribute('myfilt','output_type','varchar2');
specifies a signature of (IN CLOB, IN OUT NOCOPY VARCHAR2) for the filter procedure.

In addition to BLOB, CLOB, and VARCHAR2, both input and output type can be set to FILE. When input is set to FILE, the incoming document will be spooled to a file and the filename passed in as the first argument. When output is set to FILE, interMedia Text will pass a filename in as the second argument. Your filter procedure is expected to place the filtered content into that file. The type of the procedure parameter should be VARCHAR2. This facility is useful when the filter procedure is actually a safe callout, because dealing with OS files is a lot simpler than accessing LOBs.

The filter procedure can also request that the rowid, charset, or format columns be passed in. For instance:

  ctx_ddl.set_attribute('myfilt','ROWID_PARAMETER','TRUE')
tells interMedia Text to pass the rowid in as the first parameter, followed by the input, followed by the output. ROWID_PARAMETER, CHARSET_PARAMETER, and FORMAT_PARAMETER are all independent. The order is rowid, then format, then charset, but the filter procedure is passed minimal parameters. For instance:
  ctx_ddl.create_preference('myfilt','procedure_filter');
  ctx_ddl.set_attribute('myfilt','procedure','normalize');
  ctx_ddl.set_attribute('myfilt','input_type','clob');
  ctx_ddl.set_attribute('myfilt','output_type','varchar2');
  ctx_ddl.set_attribute('myfilt','rowid_parameter','TRUE')
  ctx_ddl.set_attribute('myfilt','charset_parameter','TRUE')
is expecting CTXSYS.NORMALIZE to have the signature
  (IN ROWID, IN VARCHAR2, IN CLOB, IN OUT NOCOPY VARCHAR2)
The first parameter is the document's rowid, the second is the value of the document's charset column, the third is the incoming document content, and the fourth is the outgoing document content.

Note that the INSO filter still uses the external executable method of filtering, but this may change to a safe callout in a future release.


Section Group Improvements

8.1.7 contains no changes to section groups or section searching. However, this area will get a lot more attention in the next release.


Lexer Improvements

Mixed Case and Composite

8.1.7 basic lexer now allows the MIXED_CASE to be FALSE when COMPOSITE is set. In past versions, setting composite forced a mixed case index.

Improved Composite Indexing/Query

Previously, when using German or Dutch decompounding, any word more than 5 characters long was sent through decompounding at index time. At query time, each query term was decompounded and a search done on the composites. This resulted in poor precision -- the excessive decompounding and normalization meant that any single query term could match many other words, with no way to limit the search to exact match.

8.1.7 limits index decompounding to real compound words, and eliminates decompounding at query time. This means that queries are now by default exact match, so are much more precise.


Stoplist Improvements

Multi-stoplist

interMedia Text 8.1.6 introduced the multi-lexer, which allowed storing documents of different languages in a single table. However, the index was limited to a single stoplist, so cross-language homographs (words from different languages which are spelled the same but mean different things - e.g. DIE in German and English, OR in French and English) could not be added to the stoplist, or the stoplist could unintentionally eliminate meaningful words.

Version 8.1.7 introduces a multi-language stoplist. You create a multi- language stoplist preference using the MULTI_STOPLIST object:

  ctx_ddl.create_preference('mystop','MULTI_STOPLIST');
When adding stopwords, each stopword's language must be identified as the second parameter to add_stopword:
  ctx_ddl.add_stopword('mystop', 'Die', 'german');
  ctx_ddl.add_stopword('mystop','Or','english');
At indexing time, the language column of each document is examined, and only the stopwords for that language are eliminated. At query time, the session language setting determines the active stopwords, just like it determines the active lexer when using the multi-lexer.


Wordlist Improvements

Prefix Indexing

Wildcard queries in 8.1.6 are handled by expanding the input pattern to matching tokens in $I using LIKE, then running a query on an equivalence of those matches (if you aren't familiar with equivalence, think of it as an OR). For instance, a query like
  contains(text, 'wood%')
would first go through a query like:
  select distinct token_text
    from dr$index$i
   where token_text like 'WOOD%'
to expand the pattern. Let's say it returns wood, wooden, woody, and woods. The query would then be re-written to something like
  contains(text, 'wood = wooden = woody = woods')
This is problematic for three reasons.
  • The expansion step may take a long time if there are many matching terms or if the terms have many rows in $I.
  • The pattern may match too many words, and the query will fail with "query too complex".
  • If there are many individual tokens, it will take a long time to read and merge all their information.
So, for those applications which need faster wildcard searching, 8.1.7 adds prefix indexing. When prefix indexing on:
  ctx_ddl.create_preference('mywordlist', 'BASIC_WORDLIST');
  ctx_ddl.set_attribute('mywordlist','PREFIX_INDEX', 'YES');
each token is chopped into multiple prefixes at index time, and each prefix indexed as token type 6. For example, a document with "TOKEN" would result in the following $I rows:
  TOKEN_TEXT      TOKEN_TYPE TOKEN_INFO
  --------------- ---------- ----------
  TOKEN           0          DOC 1 OFFSET 1
  T               6          DOC 1 OFFSET 1
  TO              6          DOC 1 OFFSET 1
  TOK             6          DOC 1 OFFSET 1
  TOKE            6          DOC 1 OFFSET 1
  TOKEN           6          DOC 1 OFFSET 1
When a prefix wildcard query is executed:
  contains(text, 'TOK%')
interMedia Text simply removes the trailing wildcard and looks up the information for token "TOK" type 6. No LIKE expansion or multiple token merging is needed. The result is that prefix queries are executed much more quickly and there is no limit on the number of expansions. Incidentally, this is why TOKEN, type 6 must be indexed -- this allows a search for TOKEN% to find TOKEN.

The price for the query performance improvement is increased index size and time. This can be controlled using optional settings for minimum and maximum length. For instance, with:

 ctx_ddl.set_attribute('mywordlist','PREFIX_MIN_LENGTH', 3);
 ctx_ddl.set_attribute('mywordlist','PREFIX_MAX_LENGTH', 4);
Only prefixes between 3 and 4 characters in length are indexed. The token TOKEN results in
  TOKEN_TEXT      TOKEN_TYPE TOKEN_INFO
  --------------- ---------- ----------
  TOKEN           0          DOC 1 OFFSET 1
  TOK             6          DOC 1 OFFSET 1
  TOKE            6          DOC 1 OFFSET 1
A wildcard search whose pattern is below the min length or above the max length resorts to the old style of resolution (i.e. expand the pattern using LIKE and merge the results using equivalence).

Prefix indexing should be especially helpful in Japanese environments, where queries are often translated under the covers to prefix queries, and whose V-GRAM tokens are fairly short to begin with.

Prefix indexing is supported for ctxcat indexes. However, catsearch does not have a wildcard operator, so setting it is not usually useful. If ctxcat indexing with Japanese VGRAM lexer, though, prefix indexing MUST be on -- it is required for query processing.

Prefix indexing and substring indexing (introduced in 8.1.6) are separate and independent. Prefix indexing speeds up right-truncated wildcard queries only (e.g. ABC%), while substring indexing speeds up left- and double-truncated wildcard queries (e.g. %ABC, %ABC%). Prefix indexing gets a speed boost by eliminating expansion and equivalence merging. Substring indexing gets a speed boost by eliminating a full-table scan on $I to do the expansion -- it still has to do equivalence merging on the expanded list.


Index Creation and Maintenance

Full Migration to Shadow Process

In current versions of interMedia Text, query code is linked into the oracle shadow process, while indexing and document services code is invoked through extproc. In 8.1.7, all functions have been migrated to the oracle shadow process, eliminating the need for extproc setup. As a side effect, this allows user datastores and PL/SQL procedure filters to reference safe callouts.

Commit After Every Error

Errors encountered during indexing are recorded in the dr$index_error table. However, in current versions of interMedia Text, these inserts are committed only at index flush time. In some cases, the error table fills up, runs out of extents, and rolls back, which means that the index creation fails, but no error messages are left in the error table to examine and determine the cause of failure. 8.1.7 uses autonomous transactions to commit each index error message immediately, avoiding this problem.

Keyless Index

The ConText Option, with its relatively low degree of integration with the kernel, relied on a primary key value in order to join the text results to the base table. The migration to extensible indexing in 8.1.5 eliminated the need for a primary key, because under the covers rows were identified by rowid. However, a primary key constraint on the table was still required before you could create a context index on it.

8.1.7 removes this restriction, so that you can now create a context or ctxcat index on a table without a primary key constraint. Document service requests for this index default to rowid input for the textkey argument. Note that creating a primary-key-less index may preclude some future operations, if a database-independent method of row identification is ever required.

Meta Column Replacement

interMedia Text 8.1.6 added new language, format, and charset columns for heterogeneous document sets. However, once set in create index, the column name could not be changed in alter index. You could always simply drop the index and re-create it, specifying the meta-data columns, but in 8.1.7 we have added meta-column support to alter index replace:
  alter index  rebuild parameters ('replace language column ')
which is a little easier. Note that this is a usability enhancement only -- the index data is still dropped and rebuilt from scratch, just like any replace operation.

Full Optimization with No Garbage

FULL mode optimization removes deleted document information and defragments rows at the same time. Some installations have no deletes or updates, but would still like to use FULL optimization because its MAXTIME parameter allows it to optimize in time-limited batches. Unfortunately, since it was intended to be run to clean up deleted document information, if there were no deleted documents, FULL mode would return without doing anything. This has been changed in 8.1.7 so that FULL mode optimization runs, even if there are no deleted documents.

Single Token Optimization

A new optimization mode - token optimization - allows optimization of just a single token's information. The syntax is:
  alter index  rebuild online parameters ('optimize token ');
or, in PL/SQL:
  ctx_ddl.optimize_index(, 'TOKEN', token=>'')
This allows a site to always keep the most frequently queried tokens optimal.


Query Improvements

Fuzzy Score Passthrough

The FUZZY operator (represented by the question mark (?)) expands the input term to words which are spelled similarly. It can increase query recall when query terms or words in the indexed documents are frequently misspelled. However, it could also expand to hundreds of words, leading to low precision and long-running queries.

The size of the fuzzy expansion can be controlled at index time through wordlist attributes introduced in version 8.1.5. FUZZY_NUMRESULTS limits the expansion to no more than the specified number of terms. FUZZY_SCORE limits the expansion size by fuzzy score. Fuzzy score is a measure of the degree of similarity of the input term to each result term. This ranges from 50 to 80, with 80 meaning identical. For instance, ?grisam might, internally, expand to:

  Word    Fuzzy Score
  ------- -----------
  Grisham          73
  Grissom          66
  Gresham          65
  Graham           56
Setting FUZZY_SCORE limits the expansion terms to those with a fuzzy score equal to or greater than the specified value. If in our example we set FUZZY_SCORE to 70, ?Grisham would expand only to Grisham.

Query-time control over these limits was introduced in version 8.1.6, although this new feature was not documented in the official documentation nor in the interMedia Text 8.1.6 Technical Overview. The FUZZY function was expanded with two parameters for specification of fuzzy score threshhold or numresults limit:

  FUZZY(term [, fuzzy_score [, fuzzy_numresults]])
For instance,
  FUZZY(term)
uses the index-time wordlist attribute limits. However,
  FUZZY(term, 65)
overrides the index-time setting of FUZZY_SCORE and uses 65 as the fuzzy score threshhold. Similarly,
  FUZZY(term,,15)
overrides the index-time setting of FUZZY_NUMRESULTS and instead limits the size of the expansion to 15 terms.

Although you could limit the terms by fuzzy score, the contains score of the FUZZY function still reflected only the term frequency of the expanded word -- there was no way to find documents containing words similar to X, and order them by similarity to the input term. So, in 8.1.7, a new parameter has been added to the FUZZY function:

  FUZZY(term [, fuzzy_score [, fuzzy_numresults [, weight]]])
When the weight parameter is set to "weight" or "w", the FUZZY function weights each expanded term by ((3 * fuzzyscore) - 140)/10) -- a factor proportional to its fuzzy score. For instance, our example above, FUZZY(grisam), would normally be re-written as:
  grisham = grissom = gresham = graham
with the weight parameter, this becomes
  (grisham * 7.9) = (grissom * 5.8) = (gresham * 5.5) = (graham * 2.8)
And the more similar terms, having higher weights, would tends towards higher scores. Let's see this in practice. We'll start with a simple table, indexed with FUZZY_NUMRESULTS set to 500 and FUZZY_SCORE set to 50. The query:
  select score(1), text from mytable 
   where contains(text, 'FUZZY(Grisam)',1) > 0
   order by score(1) desc
results in:
  SCORE TEXT
  ----- -----------------------
      4 Heather Graham
      4 Gus Grissom
      4 Douglas Gresham
      4 John Grisham
The record most similar to our input term -- John Grisham -- cannot be distinguished from all the other records, because their scores are the same. If instead we do:
  select score(1), text from mytable 
   where contains(text, 'FUZZY(Grisam,,,w)',1) > 0
   order by score(1) desc
The results come back in decreasing similarity order:
  SCORE TEXT
  ----- -----------------------
     31 John Grisham
     23 Gus Grissom
     21 Douglas Gresham
     11 Heather Graham
which allows us to easily pick out the best matches.

It is important to remember that fuzzy weighting is not simply a normalized fuzzy score -- it is a weighting of the regular, inverse-document frequency score. Because of this, it is possible for less similar terms to outscore more similar terms, if the less similar terms occur frequently in the document. For instance, if we add the document "Graham Graham Graham" to the corpus, the weighted query returns:

  SCORE TEXT
  ----- -----------------------
     37 Graham Graham Graham
     35 John Grisham
     26 Gus Grissom
     24 Douglas Gresham
     12 Heather Graham
Consequently, weighted fuzzy is most effective when applied to indexes on small, short text fields (like names), where term repetition in a single document is rarer than in longer documents.

$I Query in Cost Estimate

interMedia Text is integrated with the cost-based optimizer. Contains queries are passed to interMedia Text for estimation of selectivity and cost, if the text index has been analyzed. In 8.1.5 and 8.1.6, analyze index scans all rows of the $I table and calculates the document counts for the most unselective tokens. An average measure is used for all other tokens.

One problem with this approach is that the statistics are gathered at analyze time, and not updated, so over time they become increasingly out-of- date. Furthermore, because of limited statistic storage space, accurate counts are stored for only a few hundred tokens - everything else uses the average measure, which is inaccurate. Finally, because it does so much work, analyze itself is a time-consuming operation.

In 8.1.7, analyze has been changed to do very little - it now simply records the fact that it has been run. At query time, $I is directly queried to get the token document count. The statistics are more accurate, always up-to-date, and analyze index takes seconds instead of hours. Although the cost and selectivity estimation takes longer, the cost estimation queries are only $X index scans, and are the same that the query itself will run, so essentially they cache the blocks that the query will read. Overall query time does not degrade.

Improved Performance on Large Datasets

We have made some internal changes to the code to optimize query-time disk access. This should result in improved text query performance for large datasets. The exact improvements implemented are too detailed for the scope of this document.


Knowledge Base Improvements

User-Defined Knowledge Bases

Version 8.1.5 introduced knowledge base extension: you load a custom thesaurus then use the knowledge base compiler (ctxkbtc) to add these new terms to the provided knowledge base. However, you could only extend the existing English knowledge base. Version 8.1.6 introduced a French knowledge base, and a THEME_LANGUAGE attribute to basic lexer to select the knowledge base to use for theme indexing, but user customization was still limited to extension of the provided knowledge bases.

Version 8.1.7 allows you to load your own knowledge base, and create a completely new knowledge base from scratch. This allows a company with an extensive Spanish thesaurus, for instance, to generate a Spanish knowledge base, and use that when generating theme indexes, in ABOUT query, etc.

The steps are:

  1. Load a custom thesaurus using ctxload
  2. Set NLS_LANG so that the language portion is the target language The charset portion must be a single-byte character set.
  3. Compile the loaded thesaurus using ctxkbtc
         ctxkbtc -user ctxsys/ctxsys -name 
A new knowledge base will be compiled from the loaded thesaurus. To now use in an index, specify the NLS_LANG language as the THEME_LANGUAGE attribute value for the lexer.

Limitations:

  • Oracle supplies knowledge bases in French and English only. You must provide your own thesaurus for any other language,
  • This process only works for single-byte character sets. You cannot create a knowledge base for languages which can be expressed only in multi-byte character sets. If the database character set is multi-byte, such as UTF-8, NLS_LANG must be set to a compatible single-byte character set when compiling the thesaurus.
  • It may not be possible to create effective knowledge bases for languages with extensive compounding, such as Dutch or German, or with many word forms, such as Hebrew or Arabic.
  • You can have at most one knowledge base per NLS language.

During knowledge base compilation, you can also now specify a stoplist:

         ctxkbtc -user ctxsys/ctxsys -name  -stoplist 
The words of this stoplist are marked as low-information words that are prevented from becoming or contributing to a theme.


Document Services Improvements

Tokens Document Service

TOKENS is a new document service which returns the words of a document as seen by the indexing engine. The document is fetched, filtered, sectioned, and lexed, and the tokens placed in a result table or PL/SQL table. This is useful for applications which may want to implement classification, clustering, or routing.

The result table form of the call is:

  ctx_doc.tokens(, , , );
The result table should have the form
  QUERY_ID  NUMBER          -- The query id
  TOKEN     VARCHAR2(64)    -- the lexed token
  OFFSET    NUMBER          -- the character offset of the token
  LENGTH    NUMBER          -- the original length of the token before lexing
The PL/SQL in-memory form is:
  ctx_doc.tokens(, , );
where is of the type ctx_doc.token_tab:
  type token_rec is record (
    token   varchar2(64),
    offset  number,
    length  number
   );
   type token_tab is table of token_rec index by binary_integer;
The output will contain only text tokens - theme tokens, section information, etc. are not output. Stopwords are not output, either. If composite is on, compounds will be produced, so there may be several tokens at the same offset.

NUM_THEMES Limit

CTX_DOC.THEMES and CTX_DOC.GIST have both been extended with a new numeric argument, NUM_THEMES. This argument limits the number of themes or POV gists generated. For instance:
  ctx_doc.themes('myidx','1','ttab',num_themes=>12)
will populate the "ttab" table with the 12 strongest themes, instead of the up-to-50 themes normally generated.


Thesaurus Improvements

In-Memory Thesaurus Build

ctxload thesaurus loading has been re-written to construct the thesaurus in-memory then array insert the phrase information. Loading a large thesaurus is now much faster than in previous versions.

Create/Drop/Alter Translation

The ctx_thes API has been expanded with the following calls for maintaining translation information in an existing thesaurus:
  ctx_thes.create_translation(, , , )
  ctx_thes.drop_translation(, , , )
  ctx_thes.update_translation(, , ,
                              , )

Has_Relation

A relation testing procedure has been added for applications which may need to test that a relation exists without actually doing the expansion:
  ctx_thes.has_relation(,,)
rel can be a single thesaural relation, or a comma-separated list of relations, except PT. The function returns TRUE if the phrase has any of the relations in the list. For instance:
  ctx_thes.has_relation('cat','BT,BTG')
returns TRUE if the phrase "cat" in the DEFAULT thesaurus has any broader terms or broader generic terms.
E-mail this page
Printer View Printer View
Oracle Is The Information Company About Oracle | Oracle RSS Feeds | Careers | Contact Us | Site Maps | Legal Notices | Terms of Use | Privacy