OTN: Concatenated datastore for Oracle Text The Concatenated Datastore - A Utility for Oracle Text

Download the kit

Give us feedback

Contents

Introduction

The Concatenated Datastore is an additional datastore for Oracle Text. It provides for extremely fast searching over multiple columns.

It does this by building a user datastore for you, hiding the complexity of creating a PL/SQL procedure to concatenate the data, and creates the appropriate section groups.

Numeric columns are encoded in such a way that you can do range searches on them. Operators supported are greater than, less than and between.

An update trigger is automatically added so that the concatenated datastore index is automatically updated when any column changes.

An Example

Let's assume we have a table consisting of a primary key, two text columns, and a numeric colum. We'll use a variation on the EMP table:
 Name                                     Null?    Type
 ---------------------------------------- -------- ------------
 EMPNO                                    NOT NULL NUMBER(4)
 ENAME                                             VARCHAR2(10) 
 JOB                                               VARCHAR2(9)
 DEPTNO                                            NUMBER(2)

Now let's say I want to find someone whose name contains the word 'Ford', job contains the word 'Marketing' and department number is 20. The normal way to do this would be:

  SELECT empno, ename FROM emp
    WHERE CONTAINS (ename, 'Ford') > 0
    AND   CONTAINS (job, 'Marketing') > 0
    AND   DEPTNO = 20
  /

In order for Oracle to process this query, it must do lookups in three indexes and combine the results. This is never as efficient as using a single index. Instead, after we've created a concatenated datastore, we could do

  SELECT empno, ename FROM EMP
    WHERE CONTAINS (ename, 
    'Ford within ename and Marketing within job and 20 within deptno') > 0

This only requires a single index access and may be very much faster.

How Do I Define The Concatenated Datastore?

The example above could have been created as follows:
  exec ctx_cd.Create_CDstore('my_cdstore', 'emp')
  exec ctx_cd.Add_Column    ('my_cdstore', 'ename')
  exec ctx_cd.Add_Column    ('my_cdstore', 'job')
  exec ctx_cd.Add_Column    ('my_cdstore', 'deptno');

So we just have to supply a name for the cdstore and the table name, then add each column in turn. There are some more options to the Add_Column procedure which will be covered later.

Then when we create our index we need to specify 'my_cdstore' as the datastore preference and the section group. (The column on which the index is created may be one of the columns in the concatenated datastore, or or any other column in the table which is not of column type LONG, BLOB, BFILE, RAW or LONG RAW.)

  CREATE INDEX myindex ON emp(ename) INDEXTYPE IS ctxsys.context
    PARAMETERS ('datastore my_cdstore section group my_cdstore');

How Does it Work?

A user datastore is created which does the concatenation, and the required section group is defined. The datastore and section groups (which have the same name as your cdstore) must be specified in the CREATE INDEX call.

During indexing, the specified columns are dynamically concatenated into a text block that looks like this:

  <ENAME>Roger Ford</ENAME>
  <JOB>Technical Marketing Manager</JOB>
  <DEPTNO>20</DEPTNO>

Creating an Update Trigger

If your table is static, the above is all you need. However, if your data is likely to change, you must make sure that Oracle Text "notices" when any of the columns involved have changed. A trigger for this purpose can be created which will update the indexed column whenever any of the other columns are modified:
  exec ctx_cd.Add_Update_Trigger('my_cdstore', 'ename')
This will create a trigger of the form:
  create or replace trigger MY_CDSTORE
    before update on username.EMP
    for each row
    begin
      :new.ENAME:= :old.ENAME;
    end;
This trigger will not work on a LOB or LONG column, so columns of these types will cause errors. If you really need to create the index on one of these column types, you will need to write your own trigger. You should also note that the trigger above fires when any column in the row is updated - not just the indexed columns. If other columns change more often than the indexed columns, you may want to write a more sophisticated trigger.

What Column Types Can I Use?

The concatenated columns may be of any column types apart from the binary types: BLOB, BFILE, RAW and LONG RAW.

Numeric and date columns can be indexed either as plain text, or in an encoded form which allows range searching. To utilise this, you must specify maximum and minimum values for the column when calling ctx_cd.add_column, and must use special processing in the query to translate a range search into the encoded equivalent.

CLOB columns are unlimited in size. All other columns - including LONG - are limited to 32KB. The index itself can be created on any of the concatenated columns, or on a totally different column. This only really affects the column name used in the CONTAINS clause of the query.

Note, however, the restriction on column types in the Add_Update_Trigger call in the section above. It is therefore generally better to create the index on a VARCHAR2 column.

Can I Add More Sections?

Yes, there is nothing to stop you from adding more sections to the section group defined (section group 'my_cdstore' in the example above).

However, since only one section group can be specified in an index, you cannot use a predefined section group such as 'CTXSYS.HTML_SECTION_GROUP'. You would have to manually add the necessary sections to the 'my_cdstore' section group.

If you are using 8.1.6, you can use the AUTO_SECTION_GROUP rather than the one created for you by this package.

Installation

Installation is simple. Just get the file cdstore.sql and run it in SQL*Plus as the user who wants to use the concatenated datastore.

Note: In the version of the Concatenated Datastore code prior for database versions prior to 10g, the cdstore code lived in the CTXSYS schema. With the removal of DBA role from CTXSYS in 10g, we have moved the entire code into the user schema.

The necessary procedures (in the package ctx_cd) will be available to the user who installed the code. Other users must install their own version before they can use it.

Data Dictionary Views

Two new tables are installed in the user schema: CTX_CDSTORES and CTX_CDSTORE_COLS.

  SQL> desc ctx_cdstores
   Name                              Null?    Type
   --------------------------------- -------- ----------------------------
   CDSTORE_ID                                 NUMBER(10)
   CDSTORE_NAME                               VARCHAR2(30)
   OWNER                                      VARCHAR2(20)
   TABLE_NAME                                 VARCHAR2(30)

  SQL> desc ctx_cdstore_cols
   Name                              Null?    Type
   --------------------------------- -------- ----------------------------
   CDSTORE_ID                                 NUMBER(10)
   CDSTORE_NAME                               VARCHAR2(30)
   COL_NAME                                   VARCHAR2(30)
   SECTION_NAME                               VARCHAR2(30)
   VISIBLE                                    CHAR(1)
   COL_TYPE                                   VARCHAR2(4)
   MIN_INT                                    NUMBER(8)
   MAX_INT                                    NUMBER(8)
   MIN_DATE                                   DATE
   MAX_DATE                                   DATE

API Specifications

The package ctx_cd consists of several procedures
  procedure create_cdstore
    (
     cdstore_name varchar2,     /* Any legal identifier of your choice */
     table_name   varchar2      /* table to index (must already exist) */
    );

Creates an initial concatenated datastore definition.

  procedure add_column(
     cdstore_name varchar2,               /* Name of an existing concat. datastore               */
     column_name  varchar2,               /* Column to be added - must exist in specified table  */
     section_name varchar2 default null,  /* section name to use - defaults to col name          */
     visible      boolean  default true,  /* should this section be indexed as part of the 
                                             whole document, as well as within the section? 
                                             See the definition of ctx_ddl.add_section           */
     min_int      integer  default null,  /* Minimum value for an integer field                  */
     max_int      integer  default null); /* Maximum value for an integer field                  */

Adds a column to the list to be concatenated into the index.

There is an alternative version of this call for date fields:

  procedure add_column(
     cdstore_name varchar2,               /* Name of an existing concat. datastore               */
     column_name  varchar2,               /* Column to be added - must exist in specified table  */
     section_name varchar2 default null,  /* section name to use - defaults to col name          */
     visible      boolean  default true,  /* should this section be indexed as part of the
                                             whole document, as well as within the section?      
                                             See the definition of ctx_ddl.add_section           */
     min_date     date,                   /* Earliest date to index                              */
     max_date     date);                  /* Latest date to index                                */

Adds a date column to the list to be concatenated into the index.

  procedure add_update_trigger(
     cdstore_name  varchar2,       /* the concatenated datastore name                       */
     column_name   varchar2);      /* the column on which the index is, or will be, created */

Creates a trigger to force updates of the concatenated datastore

  procedure drop_cdstore(
     cdstore_name  varchar2);      /* name of the concatenated datastore to be dropped */

Drops the definition of a concatenated datastore

Range Searching

Numeric and date fields can be stored using a special encoding algorithm which stores numbers as groups of characters. This allows range searching - EQUALS, GREATER THAN OR EQUALS, LESS THAN OR EQUALS, or BETWEEN - on integer values, or the day part of a date (fractions, and the time part of a date field, are not indexed - fractional numbers are rounded to the nearest integer). To take advantage of this range searching, you must:
  • Specify minimum and maximum values for your numeric or date fields when calling ctx_cd.Add_Column.
  • Call a special function to generate the necessary CONTAINS clause at query time.
The smaller the range chosen between minimum and maximum values, the more compact the index will be, and the more efficient the indexes. The maximum range between min and max is 159,999, or in date terms, about 438 years.

The following table shows the effect of selecting different ranges. The average number of tokens that must be indexed (for a single encoded value within that range) increases as shown. Each token is three characters long.

RangeInternal BaseAverage Tokens
15 2 4
80 3 8
255 4 16
624 5 25
1295 6 36
2400 7 49
4095 8 64
6560 9 81
9999 10 100
14640 11 121
RangeInternal BaseAverage Tokens
20735 12 144
28560 13 169
38415 14 196
50624 15 225
65535 16 256
83520 17 289
104975 18324
130320 19361
159999 20400

For most efficient use of indexing, you should choose the minimum value from the first column which suits your needs. Do not, for example, specify min_int => 0 and max_int => 81 as this will force the system to use an average 16 tokens per entry, rather than 8 if you had specified max_int => 80.

Query Processing

For character fields and non-range numeric fields (those specified without min_int and max_int in the ctx_cd.Add_Column call), the application can do a simple section search like:
     ... CONTAINS (colname, 'searchword WITHIN section') > 0;
However, for range queries, an encoded search string must be used. We therefore provide two extra functions to generate the required strings:

For integer range searches:

  function int_contains
    ( cdstore_name     varchar2,/* name of the concatenated datastore                   */
      column_name      varchar2,/* the column in which to search                        */
      int_value        number,  /* in range min_int..max_int                            */
      other_int_value  number,  /* in range min_int..max_int (only needed for BETWEEN)  */
      operator         varchar2 /* 'E' ==>  = p_integer 
                                   'G' ==> >= p_integer
                                   'L' ==> <= p_integer
                                   'B' ==> >= p_integer and
                                           <= p_other_integer                           */
    ) 
      return              varchar2; /* guaranteed to fit in varchar2(4000)              */
For date range searches:
  function date_contains
    ( cdstore_name     varchar2,/* name of the concatenated datastore                   */
      column_name      varchar2,/* the column in which to search                        */
      date_value       date,    /* in range min_date..max_date                          */
      other_date_value date,    /* in range min_date..max_int (only needed for BETWEEN) */
      operator         varchar2 /* 'E' ==>  = p_integer 
                                   'G' ==> >= p_integer
                                   'L' ==> <= p_integer
                                   'B' ==> >= p_integer and
                                           <= p_other_integer                        */
    ) 
      return              varchar2; /* guaranteed to fit in varchar2(4000)              */
These functions can be used to generate a bind variable which can be used in a CONTAINS clause. For example if I want to do a search where SALARY <= 1000 and hiredate >= '1-JAN-97', I might use the following PL/SQL code:
  declare 
    cstring varchar2(4000);
    cursor c1 is
      select ename from emp where contains (indexcol, cstring) > 0;
  begin
    cstring := ctx_cd.int_contains ('empcd', 'salary', 1000, null, 'L');
    cstring :=  cstring || ' AND ' ||
              ctx_cd.date_contains ('empcd', 'hiredate', '1-JAN-97', null, 'G');
    for emprec in c1
    loop
      dbms_output.put_line (emprec.ename);
    end loop;
  end;
  /

Out of Range Values

On indexing, the values to be indexed are checked against the specified maximum and minimum values. Any data outside this range will cause the indexing of that row to fail. An entry will appear in CTX_USER_INDEX_ERRORS as normal.

This enables us to assert that all indexed rows are within the specified range. This means that we never have to reject a search value entered by a user. All searches are valid - it's just that some (such as a "less than" search for a value less than the minimum) will never return any results. This eases application development, as the application never has to check the entered values or return "out of range" errors to the user.

Of course the application developer can choose to warn the user if they search out of the range, but this is not required.

E-mail this page
Printer View Printer View
Oracle Is The Information Company About Oracle | Oracle RSS Feeds | Careers | Contact Us | Site Maps | Legal Notices | Terms of Use | Privacy