| OTN: Concatenated datastore for Oracle Text
The Concatenated Datastore
- A Utility for Oracle Text
Download the kit
Give
us feedback
Contents
Introduction
The Concatenated Datastore is an additional datastore for Oracle Text.
It provides for extremely fast searching over multiple columns.
It does this by building a user datastore for you, hiding the complexity of
creating a PL/SQL procedure to concatenate the data, and creates the appropriate
section groups.
Numeric columns are encoded in such a way that you can do range searches on
them. Operators supported are greater than, less than and between.
An update trigger is automatically added so that the concatenated datastore
index is automatically updated when any column changes.
An Example
Let's assume we have a table consisting of a primary key, two text columns, and
a numeric colum. We'll use a variation on the EMP table:
Name Null? Type
---------------------------------------- -------- ------------
EMPNO NOT NULL NUMBER(4)
ENAME VARCHAR2(10)
JOB VARCHAR2(9)
DEPTNO NUMBER(2)
Now let's say I want to find someone whose name contains the word 'Ford',
job contains the word 'Marketing' and department number is 20. The normal way
to do this would be:
SELECT empno, ename FROM emp
WHERE CONTAINS (ename, 'Ford') > 0
AND CONTAINS (job, 'Marketing') > 0
AND DEPTNO = 20
/
In order for Oracle to process this query, it must do lookups in three indexes
and combine the results. This is never as efficient as using a single index.
Instead, after we've created a concatenated datastore, we could do
SELECT empno, ename FROM EMP
WHERE CONTAINS (ename,
'Ford within ename and Marketing within job and 20 within deptno') > 0
This only requires a single index access and may be very much faster.
How Do I Define The Concatenated Datastore?
The example above could have been created as follows:
exec ctx_cd.Create_CDstore('my_cdstore', 'emp')
exec ctx_cd.Add_Column ('my_cdstore', 'ename')
exec ctx_cd.Add_Column ('my_cdstore', 'job')
exec ctx_cd.Add_Column ('my_cdstore', 'deptno');
So we just have to supply a name for the cdstore and the table name, then
add each column in turn. There are some more options to the Add_Column procedure
which will be covered later.
Then when we create our index we need to specify 'my_cdstore' as the datastore
preference and the section group. (The column on which the index is created
may be one of the columns in the concatenated datastore, or or any other column
in the table which is not of column type LONG, BLOB,
BFILE, RAW or LONG RAW.)
CREATE INDEX myindex ON emp(ename) INDEXTYPE IS ctxsys.context
PARAMETERS ('datastore my_cdstore section group my_cdstore');
How Does it Work?
A user datastore is created which does the concatenation, and the required section
group is defined. The datastore and section groups (which have the same name as
your cdstore) must be specified in the CREATE INDEX call.
During indexing, the specified columns are dynamically concatenated into a
text block that looks like this:
<ENAME>Roger Ford</ENAME>
<JOB>Technical Marketing Manager</JOB>
<DEPTNO>20</DEPTNO>
Creating an Update Trigger
If your table is static, the above is all you need. However, if your data is likely
to change, you must make sure that Oracle Text "notices" when any of the columns
involved have changed. A trigger for this purpose can be created which will update
the indexed column whenever any of the other columns are modified:
exec ctx_cd.Add_Update_Trigger('my_cdstore', 'ename')
This will create a trigger of the form:
create or replace trigger MY_CDSTORE
before update on username.EMP
for each row
begin
:new.ENAME:= :old.ENAME;
end;
This trigger will not work on a LOB or LONG column, so columns of these types
will cause errors. If you really need to create the index on one of these column
types, you will need to write your own trigger. You should also note that the
trigger above fires when any column in the row is updated - not just the
indexed columns. If other columns change more often than the indexed columns,
you may want to write a more sophisticated trigger.
What Column Types Can I Use?
The concatenated columns may be of any column types apart from the
binary types: BLOB, BFILE, RAW and LONG RAW.
Numeric and date columns can be indexed either as plain text, or in an encoded
form which allows range searching. To utilise this, you must specify maximum
and minimum values for the column when calling ctx_cd.add_column,
and must use special processing in the query to translate a range search into
the encoded equivalent.
CLOB columns are unlimited in size. All other columns - including
LONG - are limited to 32KB. The index itself can be created on
any of the concatenated columns, or on a totally different column. This only
really affects the column name used in the CONTAINS clause of the query.
Note, however, the restriction on column types in the Add_Update_Trigger call
in the section above. It is therefore generally better to create
the index on a VARCHAR2 column.
Can I Add More Sections?
Yes, there is nothing to stop you from adding more sections to the section
group defined (section group 'my_cdstore' in the example above).
However, since only one section group can be specified in an index, you cannot
use a predefined section group such as 'CTXSYS.HTML_SECTION_GROUP'. You would
have to manually add the necessary sections to the 'my_cdstore' section group.
If you are using 8.1.6, you can use the AUTO_SECTION_GROUP rather
than the one created for you by this package.
Installation
Installation is simple. Just get the file cdstore.sql and run it in SQL*Plus as the
user who wants to use the concatenated datastore.
Note: In the version of the Concatenated Datastore code prior for database
versions prior to 10g, the cdstore code lived in the CTXSYS schema. With the
removal of DBA role from CTXSYS in 10g, we have moved the entire code into
the user schema.
The necessary procedures (in the package ctx_cd) will be available to the user
who installed the code. Other users must install their own version before they
can use it.
Data Dictionary Views
Two new tables are installed in the user schema:
CTX_CDSTORES and CTX_CDSTORE_COLS.
SQL> desc ctx_user_cdstores
Name Null? Type
--------------------------------- -------- ----------------------------
CDSTORE_ID NUMBER(10)
CDSTORE_NAME VARCHAR2(30)
OWNER VARCHAR2(20)
TABLE_NAME VARCHAR2(30)
SQL> desc ctx_user_cdstore_cols
Name Null? Type
--------------------------------- -------- ----------------------------
CDSTORE_ID NUMBER(10)
CDSTORE_NAME VARCHAR2(30)
COL_NAME VARCHAR2(30)
SECTION_NAME VARCHAR2(30)
VISIBLE CHAR(1)
COL_TYPE VARCHAR2(4)
MIN_INT NUMBER(8)
MAX_INT NUMBER(8)
API Specifications
The package ctx_cd consists of several procedures
procedure create_cdstore
(
cdstore_name varchar2, /* Any legal identifier of your choice */
table_name varchar2 /* table to index (must already exist) */
);
Creates an initial concatenated datastore definition.
procedure add_column(
cdstore_name varchar2, /* Name of an existing concat. datastore */
column_name varchar2, /* Column to be added - must exist in specified table */
section_name varchar2 default null, /* section name to use - defaults to col name */
visible boolean default true, /* should this section be indexed as part of the
whole document, as well as within the section?
See the definition of ctx_ddl.add_section */
min_int integer default null, /* Minimum value for an integer field */
max_int integer default null); /* Maximum value for an integer field */
Adds a column to the list to be concatenated into the index.
There is an alternative version of this call for date fields:
procedure add_column(
cdstore_name varchar2, /* Name of an existing concat. datastore */
column_name varchar2, /* Column to be added - must exist in specified table */
section_name varchar2 default null, /* section name to use - defaults to col name */
visible boolean default true, /* should this section be indexed as part of the
whole document, as well as within the section?
See the definition of ctx_ddl.add_section */
min_date date, /* Earliest date to index */
max_date date); /* Latest date to index */
Adds a date column to the list to be concatenated into the index.
procedure add_update_trigger(
cdstore_name varchar2, /* the concatenated datastore name */
column_name varchar2); /* the column on which the index is, or will be, created */
Creates a trigger to force updates of the concatenated datastore
procedure drop_cdstore(
cdstore_name varchar2); /* name of the concatenated datastore to be dropped */
Drops the definition of a concatenated datastore
Range Searching
Numeric and date fields can be stored using a special encoding
algorithm which stores numbers as groups of characters. This allows
range searching - EQUALS, GREATER THAN OR EQUALS,
LESS THAN OR EQUALS, or BETWEEN - on integer values,
or the day part of a date (fractions, and the time part of a date field,
are not indexed - fractional numbers are rounded to the nearest integer).
To take advantage of this range searching, you must:
-
Specify minimum and maximum values for your numeric or date fields when
calling
ctx_cd.Add_Column.
- Call a special function to generate the necessary CONTAINS clause at query time.
The smaller the range chosen between minimum and maximum values, the
more compact the index will be, and the more efficient the indexes.
The maximum range between min and max is 159,999, or in date terms,
about 438 years.
The following table shows the effect of selecting different ranges.
The average number of tokens that must be indexed (for a single
encoded value within that range) increases as shown. Each token
is three characters long.
| Range | Internal Base | Average Tokens |
| 15 | 2 | 4 |
| 80 | 3 | 8 |
| 255 | 4 | 16 |
| 624 | 5 | 25 |
| 1295 | 6 | 36 |
| 2400 | 7 | 49 |
| 4095 | 8 | 64 |
| 6560 | 9 | 81 |
| 9999 | 10 | 100 |
| 14640 | 11 | 121 |
|
| Range | Internal Base | Average Tokens |
| 20735 | 12 | 144 |
| 28560 | 13 | 169 |
| 38415 | 14 | 196 |
| 50624 | 15 | 225 |
| 65535 | 16 | 256 |
| 83520 | 17 | 289 |
| 104975 | 18 | 324 |
| 130320 | 19 | 361 |
| 159999 | 20 | 400 |
|
For most efficient use of indexing, you should choose
the minimum value from the first column which suits your
needs. Do not, for example, specify min_int => 0
and max_int => 81 as this will force the system
to use an average 16 tokens per entry, rather than 8 if you
had specified max_int => 80.
Query Processing
For character fields and non-range numeric fields (those
specified without min_int
and max_int in the ctx_cd.Add_Column
call), the application can do a simple section search like:
... CONTAINS (colname, 'searchword WITHIN section') > 0;
However, for range queries, an encoded search string must be used.
We therefore provide two extra functions to generate the required
strings:
For integer range searches:
function int_contains
( cdstore_name varchar2,/* name of the concatenated datastore */
column_name varchar2,/* the column in which to search */
int_value number, /* in range min_int..max_int */
other_int_value number, /* in range min_int..max_int (only needed for BETWEEN) */
operator varchar2 /* 'E' ==> = p_integer
'G' ==> >= p_integer
'L' ==> <= p_integer
'B' ==> >= p_integer and
<= p_other_integer */
)
return varchar2; /* guaranteed to fit in varchar2(4000) */
For date range searches:
function date_contains
( cdstore_name varchar2,/* name of the concatenated datastore */
column_name varchar2,/* the column in which to search */
date_value date, /* in range min_date..max_date */
other_date_value date, /* in range min_date..max_int (only needed for BETWEEN) */
operator varchar2 /* 'E' ==> = p_integer
'G' ==> >= p_integer
'L' ==> <= p_integer
'B' ==> >= p_integer and
<= p_other_integer */
)
return varchar2; /* guaranteed to fit in varchar2(4000) */
These functions can be used to generate a bind variable which can be used in
a CONTAINS clause. For example if I want to do a search where SALARY <= 1000
and hiredate >= '1-JAN-97', I might use the following PL/SQL code:
declare
cstring varchar2(4000);
cursor c1 is
select ename from emp where contains (indexcol, cstring) > 0;
begin
cstring := ctx_cd.int_contains ('empcd', 'salary', 1000, null, 'L');
cstring := cstring || ' AND ' ||
ctx_cd.date_contains ('empcd', 'hiredate', '1-JAN-97', null, 'G');
for emprec in c1
loop
dbms_output.put_line (emprec.ename);
end loop;
end;
/
Out of Range Values
On indexing, the values to be indexed are checked against the
specified maximum and minimum values. Any data outside this
range will cause the indexing of that row to fail. An entry
will appear in CTX_USER_INDEX_ERRORS as normal.
This enables us to assert that all indexed rows are within the
specified range. This means that we never have to reject a search
value entered by a user. All searches are valid - it's just that
some (such as a "less than" search for a value less than the
minimum) will never return any results. This eases application
development, as the application never has to check the entered
values or return "out of range" errors to the user.
Of course the application developer can choose to warn the user
if they search out of the range, but this is not required.
|