The CTXCAT Index Type - what is it, why should I use it?

Contents

Introduction

Oracle 8i introduced the CONTEXT index type - a "domain" index that was used for fast retrieval of unstructured text.

In Oracle 8i Release 3 (Version 8.1.7), Oracle introduced a new type of index - the "catalog" or CTXCAT index type. The name derives from the main purpose of the index type - online catlogs.

When should I use a CTXCAT index?

CTXCAT indexes work best when text is in "small chunks" - maybe a couple of lines maximum - and searches need to restrict and/or sort the result set according to certain structured criteria - usually numbers or dates.

For example, lets consider an on-line auction site. Each item for sale has a short description, a current bid price and dates for the start and end of the auction. A user might want to see all the records with "antique cabinet" in the description, with a current bid price less than $500. Since he's particularly interested in new items, he wants the results sorted by auction start time.

Such a search would be fairly inefficient using a normal CONTEXT index. The kernel would have to find all the records that matched the text search, then restrict the set to those with the correct price (which requires the use of a different index), and then sort the results using a third index.

By including structured information such as price and date within the CTXCAT index, we are able to make this search very much more efficient.

When is a CTXCAT index NOT suitable?

The query language with CTXCAT is considerably simpler than for CONTEXT indexes. Basically, you can search for phrases and words (with wild cards if required), using AND and OR operators, but that's about it. If your application needs more complex text retrieval featurs - such as stemming, thesaurus, fuzzy matching and so on, you should be using a CONTEXT index.

There are also differences in the time and space needed to create the index. CTXCAT indexes take quite a bit longer to create - and use considerably more disk space - than CONTEXT indexes. If you are tight on disk space, you should consider carefully whether CTXCAT indexes are appropriate for you.

Any other differences?

Yes - in DML processing (updates, inserts and deletes). CTXCAT indexes are transactional. Where a CONTEXT index uses a "deferred indexing" method - the actual index updates don't take place until an index SYNC is performed - CTXCAT indexes work much more like a normal (b-tree) index. When you commit your changes, all necesary changes to the CTXCAT indexes take place before the commit returns.

How do I create a CTXCAT index?

The simplest case is just
                                       
 CREATE INDEX indexname ON table(column) 
  INDEXTYPE IS CTXCAT;

                                    

However, if you do this you're not taking advantage of the ability to include structured information within the index itself.

Let's consider our online auction site mentioned previously:

                                       
 create table auction (

    item_id           NUMBER   PRIMARY KEY,  -- auction item identifier
    item_desc         VARCHAR2(80),          -- free-form item description
    price             NUMBER,                -- current price of the item
    start_time        DATE                   -- end time of the auction
    end_time          DATE                   -- end time of the auction
  )

                                    

A CTXCAT index (like a CONTEXT index) is a "domain" index. Therefore it supports the "PARAMETERS" clause. A number of possible parameter settings are shared with CONTEXT indexes. These are: LEXER, MEMORY, STOPLIST, STORAGE and WORDLIST (no other CONTEXT parameters are supported). Howevwer, the most important parameter is a new one: INDEX SET.

INDEX SET defines the structured columns that are to be included in the CTXCAT index. Let's say I want to create an index on the item_desc column, but I need to be able to limit my search results by price, and sort by start_time. I do this by creating a new INDEX SET, and adding the structured columns to it. New (for 8.1.7) procedures in the CTX_DDL package allow me to do this:

                                       
 ctx_ddl.create_index_set('auction_set');
  ctx_ddl.add_index       ('auction_set', 'price');
  ctx_ddl.add_index       ('auction_set', 'start_time');

                                    

Note that the item_desc column is NOT part of the INDEX SET. item_desc is only mentioned when we come to create the actual index:

                                       
 CREATE INDEX auction_index ON auction (item_desc) 
  INDEXTYPE IS CTXCAT
  PARAMETERS ('INDEX SET auction_set');

                                    

And how do I do a search?

First of all, we'll be using the CATSEARCH operator instead of the CONTAINS operator used for a CONTEXT index. Let's ignore the structured columns for now and look at the basic query syntax. I want to find all auction items which contains the words "toy" and "dog" but not the phrase "live animal":

                                       
SELECT item_id, item_desc
 FROM auction

 WHERE CATSEARCH (item_desc, '(toy dog) | "live animal"', null) > 0;

                                    

A few points to note:

  • ANDed terms do not need the word AND or even an "&" operator. AND is assumed between any adjacent terms.
  • NOT is represented by "|" (OR, not used here, is represented by "|")
  • Parentheses can be used to group terms
  • Double quotes are used to surround a phrase (otherwise "live animal" would have been read as "live AND animal".

The "null" in the query above is a placeholder for a structured clause. There is no default - if you don't provide any structured clause you MUST put "null" in here.

The structured clause allows you to restrict, or sort, your results. If I wanted to extend the query above to find only items costing less than $100, I could use

                                       
 WHERE CATSEARCH (item_desc, '(toy dog) | "live animal"', 'price < 100') > 0

                                    
and if I wanted the results with the newest items first, I could use
                                       
 WHERE CATSEARCH (item_desc, '(toy dog) | "live animal"', 
                   'price < 100 order by start_time desc') > 0

                                    

Note that there are restrictions on what can be used in the structured clause. Of course any columns that are mentioned MUST be included in the index set. Additionally, you cannot use functions, expressions, sub-selects and so on. For full details of the restrictions, see the 8.1.7 documentation addendum.

Some combinations of structured criteria and order by might not be able to be processed efficiently by the catsearch index. When this happens, the query will fail with the error:

                                       
 DRG-10844: index cannot execute this structured predicate efficiently

                                    

although note that the most common reason for getting this message is that you have failed to include a referenced column in the INDEX SET!

Have much index space is used?

The amount of space used by a CTXCAT index will vary according to how many structured columns are included in the index set. Without getting too deep into the technical detail, each structured column requires both more space in the main index table, AND an extra b-tree index on the index table itself.

However, the figures tend to be large.

Tests show that in an auction table as above, with two structured columns in the index set, the final amount of storage required is between 10 and 15 times the size of the base table. This can be compared to CONTEXT indexes, which are typically 25-100% of the base table size.

How much time to create the index?

Again, this varies considerably depending on the number of structured columns in the index set. As an example, creating a CTXCAT index on 5 million short records, with three structured columns, took me approximately 4 hours on a multi-CPU Sun Solaris server with 2GB of memory.

It is worth noting that the creation of the B-Tree indexes will be considerably quicker if the SORT_AREA_SIZE kernel parameter is increased. The default is 64K - a very low figure. Increasing this to 1MB will have a very significant effect - I've seen index-creation times reduce by a factor of 10 by doing this. However note the specified amount of memory will be used by EVERY process connecting, so great care should be taken with this parameter if the database is shared with many other users. It might be possible to increase it just for index creation, then reduce it later.


Last Modified: 18 May 2001 roger.ford@oracle.com



In-Memory Replay Banner