Oracle interMedia Text 8.1.5 - Technical Overview
Oracle8i interMedia Text 8.1.5 - Technical Overview
Garrett Kaminaga, Product Development, Database and Application Server Technologies, Oracle HQ
interMedia integrates all of the features and functions of the former Oracle
ConText product with image, audio and video, and geographic location services for Web
Content Management applications built with the Oracle8i Internet Platform.
interMedia Text has been completely re-architected, enhanced and
tightly integrated with Oracle8i to provide up to an order of magnitude better
search performance, greater ease of use and new capabilites like thematic searches. Please
refer to the documentation for detailed information.
Table of Contents
- Introduction to interMedia text
- Installation and Setup
- Indexing
- The Preference System
- Datastores
- Filters
- Section Groups
- Lexers
- Other Indexing Classes
- DML
- Optimization
- Querying
- Query Operators (Part 1)
- Query Operators (Part 2)
- Thesaurus
- Query Tuning
- Query Feedback
- Highlighting
- Linguistic Extraction
Introduction to interMedia Text
What does it do?
interMedia Text extends Oracle8i by indexing any text
or documents stored in Oracle8i, in operating system flat files or URLs. It
also enables content-based queries, (e.g., find text and documents which contain
this word) using familiar, standard SQL. It enables file-based text applications
to use Oracle8i to manage text and documents in an integrated fashion
with traditional relational information.
The easiest way to grasp what it does is to see it in action...
How do you use it?
Here is a quick example. We'll start by creating a table to hold our documents.
For now, we'll use a small varchar2 column to hold the document text. We'll
also need a primary key to identify each document.
create table docs (id number primary key, text varchar2(80));
Nothing special about that. Let's populate it with two example documents:
insert into docs values (1, 'first document');
insert into docs values (2, 'second document');
commit;
Now comes the fun part -- we build a Text index:
create index doc_index on docs(text)
indextype is ctxsys.context;
This is, for the most part, familiar DDL -- the difference is the indextype
clause, which instructs Oracle to build a Text index instead of a regular, b-tree
index.
Once the index has been created, we can run content-based queries on our huge
database of two documents using the contains function:
select id from docs
where contains(text, 'first') > 0;
This will find all rows in docs where the text column contains the word "first",
(which is document 1, for those not following closely). The > 0 part is necessary
to make it legal Oracle SQL, which does not support boolean return values for
functions (at this time).
Integration Architecture
interMedia Text is more tightly integrated with Oracle8i.
For those of you familiar with previous versions of ConText, notice that-- in
the previous example, no ctxsrv was required. The result is a specialized "domain
index" which works like any b-tree index. It's all transparent to the user --
Oracle8i is simply extended to support text. Unlike previous versions
of Oracle ConText, there are no servers to start up, there is no query rewrite,
and index creation is done through familiar SQL rather than through a custom
PL/SQL interface.
Installation and Setup
Migrating from Previous Versions
Due to significant architectural changes, interMedia Text
is not backward-compatible with previous versions of Oracle ConText. Existing
installations and applications need to be migrated to work with Oracle8i.
There is a manual which details the steps of this process. Please remember that
some of those steps need to be completed on the existing system before
you install Oracle8i.
Installation
By default, the Oracle interMedia installation procedure installs
interMedia Text. interMedia Text can also be installed separately,
if you choose a custom install.
Important Files
| ?/bin/ctxsrv |
This may be renamed ctxsrv8 or some such. Oracle8i still supports the
ctxsrv server. However, the only valid personality mask is M. You need to
run it only when you want background DML. Batch DML (sync) does not require
a running server. We'll go into DML in more depth later. |
| ?/ctx/admin |
Data dictionary scripts. See below for details. |
| ?/ctx/admin/defaults |
Language-specific default preferences. See below, and an upcoming issue
on the preference system for more details. |
| ?/ctx/data |
This directory has data files needed by interMedia Text,
which includes the linguistic lexicon files, the Xerox stemming files, and
the Korean lexer dictionary. If you get strange internal errors in linguistics,
stemming, etc. check this directory. |
| ?/ctx/bin |
This directory is for user-defined filters, which we'll talk about later.
This directory should also contain the ctxhx program which is used by the
INSO filter. |
| ?/ctx/lib |
This directory has platform-specific shared libraries and template files
used by the INSO filter. It also contains the safe callout used by indexing
and document services -- libctxx8.so (name will change from platform to
platform -- look for "ctxx") |
Data Dictionary Installation
interMedia Text is integrated with the Oracle Database Creation
Assistant (DBCA) so the ctxsys data dictionary should be installed with this
tool. If a manual installation is required please follow these steps.
- Before starting, make sure that:
- The interMedia Text software tree is installed
- The database does not have a ctxsys user
- Current directory is ?/ctx/admin
- You can sqlplus internal
- Create the ctxsys user. Pass it the ctxsys password, default tablespace,
and temporary tablespace as arguments.
sqlplus internal @dr0csys <password> <def_tblspc> <tmp_tblspc>
- Install the data dictionary:
sqlplus ctxsys/<password> @dr0inst <ctxx_library>
The argument is the full path to the ctxx library, for example:
sqlplus ctxsys/<password> @dr0inst /some_directory/app/oracle/product/8.1.5/ctx/lib/libctxx8.so
On Unix, you must not use the environment symbol $ORACLE_HOME.
You have to use the actual physical path.
- Install appropriate language-specific default preferences. There are forty-odd
scripts in ?/ctx/admin/defaults which create language- specific default preferences.
They are named in the form drdefXX.sql, where XX is the language code (from
the Server Reference Manual).
To install the US defaults, for instance:
sqlplus ctxsys/<password> @defaults/drdefus.sql
We'll talk more about this in the upcoming preference system issue.
After these steps, interMedia Text should be installed
and working.
Post-Installation Setup
If this database was an existing ConText site, make sure to remove text_enable
from the init.ora. It is no longer used in Oracle8i, and will actually prevent
Oracle8i from operating properly -- you'll get errors like "cannot find package
DR_REWRITE".
Finally, ensure that the Net8 listener is running and is configured to invoke
external procedures. A brief description of the process is below, with complete
details are in the Oracle8i Server Administrator's Guide.
- Add an entry to the tnsnames.ora:
extproc_connection_data =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = ipc)
(KEY = DBSID))
(CONNECT_DATA = (SID = ep_agt1)))
DBSID is the database SID. ep_agt1 can be named anything. extproc_connection_data
should not be changed.
- Add the following to the listener SID_LIST:
(SID_DESC = (SID_NAME = ep_agt1)
(ORACLE_HOME = /oracle)
(ENVS = LD_LIBRARY_PATH=/oracle/ctx/lib)
(PROGRAM = extproc))
ep_agt1 matches the CONNECT_DATA SID for extproc_connection_data in the tnsnames.ora.
The PROGRAM section tells the Net8 listener to start the external procedure
process.
The ENVS section, which is shown here for UNIX, will ensure that the environment
includes ?/ctx/lib in LD_LIBRARY_PATH. This is needed so that indexing can
use the INSO filters.
On NT, you may need to have ORACLE_HOME set in this section as well.
- Since the extproc_connection_data ADDRESS section specifies ipc, make sure
that the ADDRESS_LIST of listener.ora accepts ipc connections.
A quick way to test the Net8 configuration is to do:
exec ctx_output.start_log('log')
from SQL*Plus. If you get the error:
DRG-50704: Net8 listener is not running or cannot start external procedures
then things aren't set up correctly. Some of the things to check:
- listener is not running
- listener.ora is not configured for extproc
- need to reload the listener
- tnsnames.ora is not configured for extproc
Indexing
Creating Indexes
The first step in using interMedia Text is to create a Text
index. Without a b-tree index, value queries are slower; without a Text index,
contains queries are simply not possible. As we've seen, the index is created
using the create index command:
create index INDEXNAME on TABLE(COLUMN)
indextype is ctxsys.context
Unlike previous versions of ConText, there is no separate policy creation
step. We'll talk more about this when we discuss the preference system. View
indexing is not allowed in Oracle8i, consistent with regular b-tree indexes.
Parallel index creation is also not supported in this first Oracle8i
version.
Composite indexes are not supported -- only one column is allowed in the column
list. This column must be one of the following types: CHAR, VARCHAR, VARCHAR2,
LONG, LONG RAW, BLOB, CLOB, BFILE. Date, number, and nested table columns cannot
be indexed. Object columns also cannot be indexed, but their attributes can
be, provided they are atomic datatypes.
The table must also have a primary key constraint. This is needed mainly for
identifying the documents for document services, and may be used in the future
for other purposes. Composite primary keys are supported, up to 16 columns.
The issuing user does not need the ctxapp role to create an index. If the
user has Oracle grants to create a b-tree index on the column, then they have
sufficient permission to create a Text index.
Unlike previous versions of ConText, the issuing owner, table owner, and index
owner can all be different users, just like regular b-tree indexes.
IMPORTANT: If a syntax error occurs in the create index statement, the index
is still created. This is different from regular b-tree indexes; before you
reissue the corrected statement, you must drop the failed index first.
If an error occurs during actual indexing (e.g. you run out of tablespace)
then you can pick up where you left off (after correcting the problem, of course)
using alter index:
alter index INDEXNAME rebuild parameters ('resume')
Storage clauses to the create index statement are not used. Index storage
parameters are provided using the STORAGE object, which we will discuss later.
Partitioning clauses are also not supported at this time, although you can manually
partition the index tables if you wish, again using the STORAGE object.
Once the index is created, any export will include the index definition. At
import time, imp will re-create the index by issuing the create index statement.
The Indexing Pipeline
Oracle8i detects that this is a Text index and performs the text indexing.
The index is created using a pipeline of steps - "The Indexing Pipeline".
The diagram below shows the indexing pipeline:
+-----------+ +----------+ filtered +-----------+
| | doc data | | doc text | |
| datastore |---------->| filter |----------->| sectioner |---------+
| | | | | |--+ |
+-----------+ +----------+ +-----------+ | |
^ | |
| section offsets | |
|column data +---------------------------------+ |
| | |
| V |
+==========+ +----------+ +-----------+ |
| | | | | | |
| database |<-----------| engine |<-----------| lexer |<--------+
| | index data| | tokens | | plain
+==========+ +----------+ +-----------+ text
Let's step through this pipeline, stage-by-stage. Each stage will be covered in
depth later, so they'll be described here briefly.
Start with the datastore. This stage loops over the rows of the table and
reads the data out of the column. Unlike previous versions, the rows are not
read in any particular order. The datastore passes out document data. Usually,
this will simply be the column data, but some datastores use the column data
as a pointer to the document data. The URL_DATASTORE, for instance, uses the
column data as a URL, does a GET, and passes out the returned data.
The filter takes the document data and transforms it to some kind of text
representation. This is needed when storing binary documents such as Word or
Acrobat files. Unlike previous versions, the output of the filter does not have
to be plain text -- it can be a text format such as XML or HTML.
The sectioner, new for Oracle8i, takes the output from the filter, and converts
it to plain text. There are different sectioners for different text formats,
including XML and HTML. Conversion to plain text includes detecting important
section tags, removing "invisible" information, and reformatting the text.
The lexer takes the plain text from the sectioner and splits it into discrete
tokens. We have lexers for whitespace-delimited languages, and specialized lexers
for Asian languages, where segmentation is quite a bit more complex. The basic
lexer also includes theme functionality to build unified text/theme indexes.
Finally, Oacle8i takes all the tokens from the lexer, the section offsets
from the sectioner, and a list of low-information words called the stoplist,
and builds an inverted index. An inverted index stores tokens, and the documents
in which these tokens occur. For instance, our example from issue #1:
insert into docs values (1, 'first document');
insert into docs values (2, 'second document');
would produce an inverted index like this:
DOCUMENT ---> doc 1 position 2, doc 2 position 2
FIRST ---> doc 1 position 1
SECOND ---> doc 2 position 1
Each document is assigned an identifier called a docid, which is used in the inverted
index. The document primary key isn't stored.
Logging
The time needed to create an index is a function of the amount of text being
indexed. Oracle8i can create a log file which can be monitored during
indexing. start_log starts the logging, and should be called before issuing
the create index, in the same session:
ctx_output.start_log('mylog');
This will create a mylog file in ?/ctx/log which can be used to monitor indexing
progress. The directory can be changed by ctxsys, using the system parameter LOG_DIRECTORY.
See documentation for details.
IMPORTANT: In this first release of interMedia ?/ctx/log is not created
at install time. You may get errors like "unable to open file". Simply have
the DBA create this directory.
Logging is halted when the session exits or when end_log is called:
ctx_output.end_log;
Errors During Indexing
Processing errors encountered during indexing -- filter errors, file cannot
be read, etc. -- do not bring indexing to a halt. Instead, the error is logged
and indexing continues to the next file.
You can see these logged errors in the ctx_user_index_errors view. It will
tell you the rowid of the failed document, and the error which was encountered.
You need to manually empty this table when you are finished reviewing the errors:
delete from ctx_user_index_errors;
The Preference System
Classes, Objects, Preferences
The preference system allows you to customize Text indexing. Each stage of
the indexing pipeline is really an interface into which customized objects can
be plugged in. These stages are called "classes", and they can be found in the
ctx_classes view:
| DATASTORE |
Reads column data and returns document data |
| FILTER |
Reads document data and returns formatted document text |
| SECTION_GROUP |
Reads formatted document text and returns section offsets and plain document
text |
| LEXER |
Reads plain document text and breaks it into tokens |
| WORDLIST |
Contains query expansion defaults |
| STOPLIST |
Contains a list of low-information words |
| STORAGE |
Index table storage parameters |
Each class has one or more "objects" which are like behavioral templates.
The DATASTORE class, for instance, has the following objects:
| DIRECT_DATASTORE |
Passes column data directly as document data |
| DETAIL_DATASTORE |
Concatenates multiple rows of a detail table to construct document data
|
| FILE_DATASTORE |
Interprets column data as a filename. Reads file and passes file contents
as document data |
| URL_DATASTORE |
Interprets column data as a URL. Performs a GET and passes return as
document data |
| USER_DATASTORE |
Invokes a stored procedure to synthesize document data |
Objects can be found in the ctx_objects view.
But objects are only templates, and cannot be used directly in an index. Instead,
you create a "preference" from one of these object templates, customize it by
setting "attributes", then use the preferences to create the index. Let's take
a closer look:
Creating Preferences
You create a preference using ctx_ddl.create_preference, specifying an object
to use as template:
ctx_ddl.create_preference('mypref', 'FILE_DATASTORE');
This creates the preference mypref, using the FILE_DATASTORE object. If you are
familiar with previous versions, the object names have been changed, and there
is no comment argument.
This is done in PL/SQL, so surround it with "begin" and "end", or use "exec"
if you are in SQL*Plus. Also, in order to create preferences, the issuing owner
must have the ctxapp role.
Once the preference is created, we can customize it using set_attribute: For
our example, let's say all our files were in /docs. We can simply set the PATH
attribute to communicate this to interMedia:
ctx_ddl.set_attribute('mypref', 'PATH', '/docs');
If you are familiar with previous versions, set_attribute now comes *after*
create_preference, which allows us to do attribute-level validation, and necessitates
the first argument, which is the preference name.
This attribute is a text attribute, so it is set to a text value. There are
also number attributes and boolean attributes, which should be set to 'TRUE'
or 'FALSE' (or 'YES'/'NO'). The attributes for each object and their types can
be found in view ctx_object_attributes or in the documentation.
Some attributes have a list of allowed values. The allowed values can be found
in the view ctx_object_attribute_lov. Unlike previous versions, LOV attributes
can be set using a descriptive "tag" rather than a numeric value.
Non-preference Classes
Preferences are used for all classes except the SECTION_GROUP and STOPLIST
classes. These classes use specialized objects and have their own API. We'll
discuss this in the respective upcoming issues.
Using Preferences
Once you've created your preferences, you can build all kinds of customized
indexes by mixing and matching them. Because all objects of a class adhere to
a common interface, any filter can be hooked up to any datastore, etc. Got a
bunch of Korean Word 97 files? No problem -- hook up a FILE_DATASTORE preference,
an INSO_FILTER preference, and a KOREAN_LEXER preference.
You plug in your preferences using the parameters clause of create index:
create index doc_index on docs(text)
indextype is ctxsys.context
parameters ('datastore mypref')
This tells create index to use mypref as the datastore. Each of the classes has
a parameters keyword to name a preference. The keyword is the same as the class
name, except for SECTION_GROUP, whose keyword is the phrase "section group". You
can specify multiple classes by simply tacking the keyword-preference pairs on
the end of the string:
parameters('datastore mypref filter myfilter')
Any user can use any preference. To specify a preference in another user's
schema, simply add the owner name in the parameters string:
parameters('datastore kenny.mypref')
Unlike previous versions, the index makes a copy of the preference settings,
which means that after use in an index, the preference can be modified or dropped
without affecting the index. Note that only the preference owner can modify
or drop it.
The Default System
In the example above, only the datastore class is set. Since preferences were
not specified for the other classes in the parameters string, they get their
values from the default system.
The default system uses system parameters. System parameters are name- value
pairs which apply to the interMedia Text installation as a whole
-- kind of our version of init.ora. They can be seen in the view ctx_parameters.
The values for the parameters used by the default system are preference names.
For instance, in our example we have not specified a lexer preference. interMedia
gets the value of the system parameter DEFAULT_LEXER, and uses the preference
named there. This allows the administrator to set the default behavior for the
system.
The set of parameters used by the default system are:
| DEFAULT_DATASTORE |
The default datastore preference. At install time (the default default)
it is set to CTXSYS.DEFAULT_DATASTORE, a preference which uses the DIRECT_DATASTORE
object. |
| DEFAULT_FILTER_BINARY |
This is the filter preference to use when indexing binary datatypes such
as RAW, LONG RAW, or BLOB. At install time it is set to CTXSYS.INSO_FILTER,
which uses the INSO_FILTER object. |
| DEFAULT_FILTER_FILE |
This is the filter preference to use when indexing files, either BFILE
or the FILE_DATASTORE. At install time it is set to CTXSYS.INSO_FILTER.
|
| DEFAULT_FILTER_TEXT |
This is the filter preference to use when indexing text datatypes such
as CHAR, VARCHAR2, or CLOB. At install time it is set to CTXSYS.NULL_FILTER,
which does no filtering. |
| DEFAULT_SECTION_HTML |
This is the section group to use when the filter class uses the INSO filter,
or the datastore uses the URL_DATASTORE. The INSO filter filters binary
files to HTML (see filters). At install time it is set to CTXSYS.HTML_SECTION_GROUP,
which merely converts the HTML to plain text. |
| DEFAULT_SECTION_TEXT |
This is the section group to use in all other cases. At install time it
is set to CTXSYS.NULL_SECTION_GROUP, which does no sectioning. |
| DEFAULT_LEXER |
This is the default lexer preference. At install time it is set to CTXSYS.DEFAULT_LEXER.
The definition of that preference is language-specific. See below for more
details. |
| DEFAULT_WORDLIST |
This is the default wordlist preference. At install time it is set to
CTXSYS.DEFAULT_WORDLIST. The definition of that preference is language-specific.
See below for more details. |
| DEFAULT_STOPLIST |
This is the default stoplist. At install time it is set to CTXSYS.DEFAULT_STOPLIST.
The contents of that stoplist are language-specific. See below for more
details. |
| DEFAULT_STORAGE |
This is the default storage preference. At install time it is set to CTXSYS.DEFAULT_STORAGE,
which has no settings. |
Language-Specific Defaults
The LEXER, WORDLIST, and STOPLIST classes are the most affected by the language
of the documents. Every installation gets DEFAULT_LEXER, DEFAULT_WORDLIST, and
DEFAULT_STOPLIST preferences, but the definition of these preferences will depend
on the language chosen at install time.
?/ctx/admin/defaults holds language-specific scripts for
each supported language. Based on the language chosen, DBCA runs the matching
language-specific script, which creates the default preferences most appropriate
for that language.
The result is that a German instance gets a DEFAULT_LEXER which is case- sensitive
and does German decompounding, a DEFAULT_STOPLIST which has common German words,
and a DEFAULT_WORDLIST with German stemming. A Japanese instance gets different
defaults, more tailored to Japanese.
Browsing the Data Dictionary
Here is a list of views useful for browsing the interMedia Text
data dictionary: ctx_classes Available classes ctx_objects Available objects
for each class ctx_object_attributes Attribute names and types for each object
ctx_object_attribute_lov Allowed values for each LOV object attribute ctx_preferences
Preferences ctx_preference_values Attribute values of each preference ctx_user_index_objects
Objects used by the user's indexes ctx_user_index_values Attribute values for
the user's indexes
Datastores
The DATASTORE class
The datastore class is responsible for reading the column data from the database,
getting the document data based on the column data, and passing that up to the
filter.
It is important to keep in mind that some form of the document data must be
in a table. Datastore objects which access external resources such as files
and web pages still require a table of pointers to those resources.
Datastore preferences are plugged into indexes using the datastore keyword:
create index
...
parameters ('datastore mydstore');
The datastore class has five objects.
DIRECT_DATASTORE
The direct datastore is the simplest case -- it assumes that document data
is stored in the indexed column. It is so simple it has no attributes to customize.
There should be no reason to create a preference based on the DIRECT_DATASTORE
object -- CTXSYS.DEFAULT_DATASTORE can be used for any index.
IMPORTANT: If you are indexing BFILEs, make sure the user ctxsys has READ
permission on the BFILE directory.
FILE_DATASTORE
The file datastore reads the column data as a filename. It opens the file
and returns the contents as the document data. The indexed column cannot be
LONG or LOB types. The files must be accessible from the database host machine.
This object has one attribute: PATH. PATH is optional, and specifies the directory
where the documents are stored.
IMPORTANT: If PATH is specified, the column data must be simple file names;
if not specified, the column data must be full path names. You cannot mix and
match -- for instance, with PATH of /A, and a column of B/C.TXT it will NOT
find /A/B/C.TXT.
On some platforms, PATH can specify multiple directories. If you do this,
make sure that your documents have unique names -- if a document C.TXT is in
directories /A and /B, and /A is in the PATH before /B, only /A/C.TXT will be
indexed.
There is not much of a difference between this object and BFILEs. It's a matter
of choice. Like BFILE's, when the file changes, the row will not be automatically
reindexed. You should update the column value to itself to trigger a reindex.
URL_DATASTORE
The URL datastore reads the column data as a URL. It performs a GET and returns
the contents as the document data. The URLs for the documents must in the table
-- the URL datastore does not crawl. The indexed column cannot be LONG or LOB
types.
http, ftp, and file protocols are supported. The URL datastore also has multi-threaded
read (on most platforms) to maximize throughput. It is very customizable, and
has several attributes:
| TIMEOUT |
Read request timeout in seconds |
| MAXTHREADS |
Maximum number of threads to use for multithreaded read |
| URLSIZE |
Maximum length of an URL |
| MAXURLS |
Maximum number of URL's to read at one time |
| MAXDOCSIZE |
Maximum length of document |
| HTTP_PROXY |
HTTP proxy |
| FTP_PROXY |
FTP proxy |
| NO_PROXY |
Domain for proxy exclusion |
Like the file datastore, if the web page changes, then you should manually
update the column value to itself to trigger the reindex.
DETAIL_DATASTORE
Many legacy applications store text line-by-line, in multiple rows of a detail
table. The detail datastore constructs documents by concatenating the text of
these lines into a single, coherent document.
The detail datastore must be told the specifics of the detail table:
| DETAIL_TABLE |
Name of the detail table |
| DETAIL_KEY |
The foreign key column(s) in the detail table |
| DETAIL_LINENO |
The sequence column in the detail table |
| DETAIL_TEXT |
The text column in the detail table |
For instance, let's say the master and detail tables look like this:
the_master my_detail
ID TITLE M_ID SEQ LINETEXT
1 Grate Expectations 1 1 It was the best of times
1 2 it was the worst of times.
"That's G-R-A-T-E Expectations, also by Edmund Wells."
The attributes of the preference should be set like this:
| DETAIL_TABLE |
my_detail |
| DETAIL_KEY |
m_id |
| DETAIL_LINENO |
seq |
| DETAIL_TEXT |
linetext |
There is one final attribute: BINARY. By default, the detail datastore treats
each row as a separate line, and, when concatenating the contents, will automatically
stick a newline at the end of the text for each row. BINARY set to TRUE suppresses
this. In our example, we should set BINARY to FALSE or simply leave it unset.
The index itself should to be built on the master table:
create index myidx on master(somecolumn)
indextype is ctxsys.context
parameters ('datastore my_detail')
The master table must have a primary key column, just like any other table indexed
by interMedia Text. This primary key column is used to find the
corresponding detail rows, where detail.fk = master.pk.
The indexed column can be any column of allowable type -- the contents are
not used by interMedia. If you can afford the space, you can add a dummy
char(1) column called "text" or "body", to make the queries more readable:
select author, title from master
where contains(text, 'best & worst') > 0;
If detail rows are added, removed, or changed without changing the master
text column, you should manually update the text column to itself to trigger
a reindex.
For those of you familiar with previous versions of ConText, this is similar
to the master detail new datastore. The old master detail datastore is no longer
supported
USER_DATASTORE
The user datastore is new for Oracle8i. You write a stored procedure which,
given a rowid, synthesizes a document. interMedia calls this procedure
whenever document data is required. Let's look at an example. Say we have a
table like this:
articles
id number
author varchar2(80)
title varchar2(120)
text clob
and we want to automatically have author and title be part of the indexed document
text. We can write a stored procedure following the user datastore interface:
create procedure myproc(rid in rowid, tlob in out NOCOPY clob) is
offset number := 1;
begin
for c1 in (select author, title, text from articles
where rowid = rid)
loop
append_varchar_to_lob(tlob, c1.title, offset);
append_varchar_to_lob(tlob, 'by '||c1.author, offset);
dbms_lob.append(tlob, c1.text);
end loop;
end;
This procedure takes in a rowid and a temporary clob locator, and concatenates
all the articles columns into the temp clob, This assumes a helper procedure append_varchar_to_lob
has been written.
Note the use of the NOCOPY in the procedure signature. This is very
important to achieve best performance and to reduce TEMP space usage.
To ensure that the DBA has signed off on the code, only ctxsys-owned stored
procedures are allowed for the user datastore. Thus, we need to do something
like this as ctxsys:
create procedure s_myproc(rid in rowid, tlob in out NOCOPY clob) is
begin
appowner.myproc(rid, tlob);
end;
And, we need to make sure that the index owner can execute the stub procedure,
so:
grant execute on s_myproc to appowner
Now, back as the app owner, we create the preference, setting the PROCEDURE
attribute to the name of the ctxsys stub procedure:
ctx_ddl.create_preference('myud', 'user_datastore');
ctx_ddl.set_attribute('myud', 'procedure', 's_myproc');
When we then create the index on articles(text) using this preference, interMedia
indexing will see author and title in the document text.
The user datastore can be used for any on-the-fly document synthesis, including
more complex master-detail relationships, nested tables, text preprocessing,
or multicolumn concatenation, like the example.
There are four constraints on the procedures used in user datastores:
- They must be owned by ctxsys
- They must be executable by the index owner
- They cannot issue DDL or transaction control statements like "commit"
- They cannot be safe callouts or call safe callouts
If you change the stored procedure, indexes based upon it will not be notified,
so you should manually recreate such indexes. interMedia cannot tell what
you are doing in the stored procedure, so if the stored procedure makes use of
other columns, and those column values change, the row will not be reindexed.
The row is only reindexed when the indexed column changes.
When using the USER_DATASTORE, you may run into OCI_INVALID_HANDLE
errors. This is bug 881904, and is fixed in patch release 8.1.5.1.
Filters
The FILTER Class
The filter class takes the document data provided by the datastore class,
and filters it to readable text, passing it to the sectioner. With Oracle8i,
this does not have to be plain text -- it can be a text format such as HTML.
Filter preferences are plugged into indexes using the filter keyword:
create index
...
parameters ('filter myfilter');
The filter class has three objects.
NULL_FILTER
The null filter is used when the document contents are not in a binary format.
It simply passes text from the datastore to the sectioner. The default CTXSYS.NULL_FILTER
preference can be used in any index to employ the null filter.
CHARSET_FILTER
The charset filter is new for Oracle8i. It converts documents from a foreign
character set to the database character set. This is most useful for Japanese
customers, who have to deal with two widely-used and incompatible character
sets (and one infrequently-used incompatible character set).
The charset filter has one attribute, CHARSET, which takes the NLS name of
the source character set. A list of NLS charsets can be found in the Oracle
National Language Support Guide. The destination character set is always the
database character set, so does not need to be specified.
Additionally, you can specify JAAUTO, which is a custom setting for Japanese
character set auto-detection. Oracle8i will automatically detect Shift-JIS,
JIS7, or EUC for each document and convert it to the database charset if needed.
USER_FILTER
The user filter is a method for plugging in your own filters. You write a
filtering program, place it in ?/ctx/bin, and the indexing engine will invoke
it to filter each document. Here's an example -- we'll create an uppercase filter
which will uppercase every word.
We start by creating a program to do the filtering -- in this case, we'll
write a perl script:
#!/usr/local/bin/perl
open(IN, $ARGV[0]);
open(OUT, ">".$ARGV[1]);
while ()
{
tr/a-z/A-Z/;
print OUT;
}
close(IN);
close(OUT);
This perl script converts a file to uppercase.
User filter programs like this take two arguments. The first argument is the
name of the input file. The second argument is the name of the output file.
Our filter reads the contents of the input file, filters it, and writes the
output to the output file.
the program (called upcase.pl) is placed in ?/ctx/bin. Ensure that it's executable
by the oracle operating-system user. Now the preference is created:
ctx_ddl.create_preference('mypref', 'USER_FILTER');
ctx_ddl.set_attribute('mypref', 'EXECUTABLE', 'upcase.pl');
When this preference is used in an index, the indexing engine will invoke the
user filter program for each document when the text is required.
INSO_FILTER
The INSO filter automatically recognizes and filters over a hundred different
formats, including Word 97 and Acrobat. The full list can be found in the documentation.
The INSO filter does not have any attributes at this time, so the CTXSYS.INSO_FILTER
preference can be used in any index needing filtering.
IMPORTANT: This filter outputs HTML, not plain text. Make sure you employ
the HTML section group or all these tags will be indexed. The default system
will employ the HTML section group when the INSO filter is detected.
The INSO filter uses code from Inso, "the world leader in filtering and viewer
technology," and Adobe. This code is not integrated code directly, it instead
follows the user filter architecture. The user filter executable for the INSO
filter is ctxhx.
ctxhx requires shared libraries and other files (.flt and .tpt files) located
in ?/ctx/lib. The installer should copy the correct platform- specific files
to this directory. Additionally, ensure that the external procedure agent has
?/ctx/lib in the LD_LIBARY_PATH environment variable (PATH on NT). This can
be done using the ENVS section in the listener.ora. On NT you may need to have
ORACLE_HOME set in the ENVS section, as well.See Installation section above
for details.
If you are encountering problems using this filter, it can be run by hand.
First get a formatted binary file. Let's call it testfile.doc. You do:
ctxhx testfile.doc testfile.txt
This should create the file testfile.txt with the HTML representation. The error
messages you get, if any, should help you determine what's wrong.
The INSO filter is supported only on Solaris, HP-UX, AIX, and NT. Other platforms
are not able to use this filter at this time.
Where's the HTML filter ?
Some of you familiar with previous versions of ConText may be wondering about
the HTML filter. In 8i, this code has been moved to the new sectioner class,
which allows any filter to spit out HTML if it so desires.
Section Groups
The SECTION_GROUP Class
The section group class is new for Oracle8i, and incorporates functionality
that was part of the wordlist and filter classes in ConText. It takes a text
format, such as XML or HMTL, as input. It is unique in having two outputs --
the section boundaries, which go to the engine, and the plaintext, which goes
to the lexer.
Creating Section Groups
Section groups are not created using create_preference -- they have their
own API in ctx_ddl:
ctx_ddl.create_section_group('mygroup','html_section_group');
The first argument is the name of the new section group. The second argument is
the section group type. This specifies the input text format, and tells the sectioner
the rules for detecting section boundaries and transforming to plaintext. Each
section group type is covered briefly below.
An empty section group can transform formatted test to plaintext, but will
not index any section boundaries. You must first tell it which sections to index
by adding sections to the group.
Sections have three important attributes: TAG, NAME, and TYPE. TAG tells the
section group how to recognize the section. Because the section group already
knows the format, you do not need to specify start and end tags; a section in
an HTML section group with TAG of B knows that the section starts with <b> and
ends with </b>, for instance. Tags are unique across the sections of a section
group.
NAME is how you refer to the section in queries. You may want to name the
B tag BOLD, for readability. Multiple tags can be mapped to the same name --
they are then treated as instances of the same section. H1, H2, H3 can all be
treated as instances of the HEADLINE section. We recommend avoiding non- alphanumeric
characters -- such as underscore -- in section names. Using these characters
will force you to escape the names in queries.
TYPE is totally new for 8i. There are three different types of sections: ZONE,
SPECIAL, and FIELD.
Section Types
ZONE sections are like sections from previous versions. interMedia records
where in the document the section start and end tags occur. WITHIN queries check
that the hit words occur between the start and end word offset.
If a ZONE section repeats, each instance is treated separately in query semantics.
See Examples, below. ZONE sections can enclose other sections, including themselves,
and can be enclosed by other sections. ZONE sections are added like this:
ctx_ddl.add_zone_section('groupname', 'sectionname', 'tag');
SPECIAL sections are so named because they are not recognized by tags. There are
two special sections in Oracle8i -- SENTENCE and PARAGRAPH -- both recognized
by punctuation in the lexer. They index the start and end of sentences and paragraphs,
respectively. You add special sections like this:
ctx_ddl.add_special_section('groupname', 'sectionname');
No tag argument is needed. There are only two allowed values for section name:
SENTENCE, and PARAGRAPH.
ZONE and SPECIAL sections index section start and end word offsets, but do
nothing to the words in the section. FIELD sections, on the other hand, extract
their contents and index them separately from other words in the document. WITHIN
queries are run on this separate, smaller index. This makes field section query
faster than zone section query -- up to three times as fast in tests we've conducted
-- especially when the section tags occur in every document.
This speed comes at the cost of flexibility. FIELD sections are meant for
non-repeating, non-overlapping sections. If a field section repeats, it is treated
as a continuation of the section, not a distinct instance. If a field section
is overlapped by itself or by another field section, it is implicitly closed
at the point where the other section starts. Also, there is a maximum of 64
field sections in any section group. This is 64 section names, not tags. Remember
that you can map multiple tags to the same section name. You add field sections
like this:
ctx_ddl.add_field_section('groupname', 'sectionname', 'tag');
To illustrate, let's work through a couple of examples to illustrate the different
types of sections and the impact on query semantics. We'll use the following document
as an example:
<A>rat</A><A>ox</A>
<B>tiger rabbit</B>
<C>dragon<C>snake</C></C>
This is an XML-style markup, but without a DTD, so we will use the basic section
group type:
ctx_ddl.create_section_group('mygroup','basic_section_group');
Let's start with ZONE sections:
ctx_ddl.add_zone_section('mygroup', 'asec', 'a');
ctx_ddl.add_zone_section('mygroup', 'bsec', 'b');
ctx_ddl.add_zone_section('mygroup', 'csec', 'c');
This tells the section group to recognize A tags as the section ASEC, B tags as
the section BSEC, etc. To do section queries, we use the WITHIN operator. Remember
to use the section name and not the tag:
contains(text, 'rat within asec') > 0
This finds the document.
contains(text, 'tiger within asec') > 0
This does not find the document. Although it has "tiger", it does not occur in
the ASEC. If instead of the original setup we had mapped A and B to the same section
name:
ctx_ddl.add_section('mygroup', 'asec', 'a');
ctx_ddl.add_section('mygroup', 'asec', 'b');
Then both:
contains(text, 'rat within asec') > 0
contains(text, 'tiger within asec') > 0
would find the document, because both A and B are treated as ASEC.
An important facet of ZONE section query semantics is that each instance is
treated distinctly. That is, a query like this:
contains(text, '(tiger and rabbit) within bsec') > 0
finds the document, but a query like this:
contains(text, '(rat and ox) within asec') > 0
does not find the document. Although the document has "rat" and has "ox", and
"rat" is in ASEC, and "ox" is in ASEC, "rat" and "ox" are not within the SAME
ASEC. Note that
contains(text, '(dragon and snake) within csec') > 0
finds the document, since they are both in the outer CSEC, even though the inner
CSEC contains only "snake".
Special sections follow the same semantics, so a query like this:
contains(text, '(jumbo and shrimp) within sentence') > 0
is looking for documents with "jumbo" and "shrimp" in the SAME sentence.
Now let's look at field sections. What if in setup we made ASEC a field section
instead of a zone section:
ctx_ddl.add_field_section('mygroup', 'asec', 'a');
Unlike zone sections, each instance of a field section is considered a continuation
of the section. The query
contains(text, '(rat and ox) within asec') > 0
finds the document, although it didn't when ASEC was a zone section. The field
section simply stitches the two instances together. Field sections work best on
sections like BSEC, which does not repeat like ASEC nor nest like CSEC.
One last subtlety with field sections. If ASEC is a zone section, then
contains(text, 'rat') > 0
finds the document -- it contains the word rat. However, remember that field sections
work by extracting the document contents and indexing them separately. Thus, if
ASEC were a field section, this query would not find the document. "rat" is in
ASEC, which is separate from the document contents.
You can, however, change this by making the field section visible. There is
an optional boolean fourth argument to add_field_section. If this boolean argument
is set to true:
ctx_ddl.add_field_section('mygroup', 'asec', 'a', TRUE);
then the field section is visible, and the section contents are visible to non-within
queries, just like zone sections. This is accomplished by double- indexing the
word -- once as part of the extracted section, and once as part of the body, so
this option has index space cost.
Using Section Groups
Section groups are plugged into indexes using the section group keyword:
create index
...
parameters ('section group mygroup');
Unlike previous versions, you do not need to set an attribute in the wordlist
object. You also no longer need to set STARTJOINS and ENDJOINS in the lexer. However,
the WHITESPACE, NEWLINE, and PUNCTUATIONS settings in the basic lexer affect sentence
and paragraph boundary recognition, which then impact special sections. There
are five section group types.
NULL_SECTION_GROUP
The null section group is used when there is no section information to be
extracted -- it simply passes the text through to the lexer. Only special sections
can be added to the null section group.
BASIC_SECTION_GROUP
The basic section group is for simple XML-style markup without going full-
bore on the DTD stuff. It can be used in custom situations where full XML is
unnecessary. Tags are in the form .... Entities and tag attributes
are not supported. The only processing it does for plaintext is removal of the
markup tags.
HTML_SECTION_GROUP
The HTML section group is for HTML. Good name, huh? It knows the rules for
HTML, including ISO Latin-1 entity translation, HTML to plaintext conversion,
and
omission. It knows HTML 4.0 tags, and has code to deal with unknown tags.
SCRIPT and STYLE contents, and comments are removed. Contents of the TITLE
section are preserved. META tag information indexing is a known customer requirement
for a future version.
For those familiar with previous versions, there is no need for KEEP_TAG anymore
-- this is automatic when you add the tag to the section group.
XML_SECTION_GROUP
The XML section group is for XML. It does no processing to plaintext except
for entity translation and tag removal. It can handle non-system, non-parameter
entities in the internal DTD. It cannot access external DTD's in this version.
NEWS_SECTION_GROUP
The news section group handles newsgroup-style postings -- RFC-1036 in particular.
This format is used in newsgroups and is similar to the one used for e-mail.
Note that while RFC-1036 defines the set of allowable header fields, Oracle
does not require this -- any header fields can be used.
Messages in this format consist of a header section, a blank line, and a body
section. The news section group recognizes <beginning of line>TAG: as the start
of sections, and <end of line> as the end of the section. It stops looking for
this pattern when the blank line is encountered.
Translation to plaintext consists of removing all header lines which are not
important to section searching. A message like this:
From: me
To: you
X-ref: 5!oowj
hello! How are you!
with only the from and to tags in the section group, will look like this to the
indexing engine:
<from section start>me<from section end>
<to section start>you<to section end>
hello! How are you!
Lexers
The LEXER Class
The lexer class gets plaintext from the sectioner and splits it into words.
Lexer preferences are plugged into indexes using the lexer keyword:
create index
...
parameters ('lexer mylex');
The lexer class has five objects.
BASIC_LEXER
The basic lexer can be used for most European languages. It is programmed
with default rules for splitting whitespace-delimited text into words. You can
modify its parsing behavior to some extent with the lexical attributes.
Ordinarily, non-alphanumeric characters end a word. SQL*Plus is lexed as "SQL"
and "Plus". The JOINS attributes allow you to modify this behavior by declaring
sets of non-alphabetic characters to treat as valid word letters.
PRINTJOINS are non-alphanumeric characters which are treated as valid characters,
and become part of the word. Declaring PRINTJOINS of "*" lets SQL*Plus be lexed
as "SQL*Plus".
SKIPJOINS are like PRINTJOINS, but they do not become part of the word. Declaring
SKIPJOINS of "*" lets SQL*Plus be lexed as "SQLPlus".
STARTJOINS are like PRINTJOINS, but they only come at the start of the word.
If they occur in the middle of a word, a new word is started. Having "*" as
STARTJOINS lets SQL*Plus be lexed as "SQL" and "*Plus". Multiple STARTJOINS
are allowed at the start of a word.
ENDJOINS are like STARTJOINS, but for the end of the word. They implicitly
end a word, too. So, "*" as ENDJOINS lexes SQL*Plus as "SQL*" and "Plus". Multiple
ENDJOINS are allowed at the end of a word. STARTJOINS and ENDJOINS used to be
important for section searching in previous versions, but with the new sectioner
class, they are no longer needed for that purpose.
Each of these four JOINS attributes is a set of characters. Setting PRINTJOINS
to "_%*" means that "_", "%", and "*" are all PRINTJOINS.
NUMJOIN is a single character which is the numeric join character. NUMGROUP
is the single numeric group character. They are defaulted based on your NLS_LANG
setting. For US, NUMJOIN is "." and NUMGROUP is ",".
Finally, CONTINUATION are those characters which indicate line continuation.
These and the following newline are removed from the text. Default value is
hyphen and backslash.
Then there are three attributes important for sentence/paragraph indexing.
PUNCTUATIONS are the set of punctuation marks (?!. by default). WHITESPACE is
a set of additional whitespace characters other than space and tab, which you
get for free and cannot change. NEWLINE can be set to NEWLINE or CARRIAGE_RETURN.
This controls whether lines are ended by \n or \r\n.
A sentence is then recognized as a PUNCTUATION followed by one or more WHITESPACE.
A paragraph is a PUNCTUATION followed by a NEWLINE. There are other patterns
which are variations on this, but that's the basic idea.
The third set of basic lexer attributes controls term normalization -- the
process of converting words to a standard form. BASE_LETTER is a boolean attribute
which, if set to YES, will convert accented characters to their unaccented forms.
By default, each word is uppercased during indexing so that queries are case-independent.
MIXED_CASE is a boolean, which, if set to YES, does not do this. This makes
indexing and queries case-sensitive.
ALTERNATE_SPELLING is an attribute which, if set, uses letter transformation
rules for common alternate spellings. It can be set to GERMAN, DANISH, or SWEDISH.
In GERMAN mode, for instance, A-umlaut gets transformed to AE. Since the lexer
is used at both indexing and query time, query for a word containing A-umlaut
will find the word spelled with A-umlaut or AE.
Finally, COMPOSITE controls word decompounding. In German, for instance, multiple
words are often combined into a single string. This makes lexing difficult,
because the words are no longer whitespace-delimited. If COMPOSITE is set, then
the lexer breaks up these compound words, which allows interMedia to
index and query as usual. COMPOSITE can be set to GERMAN or DUTCH. Unset or
DEFAULT means no decompounding.
The last set of basic lexer attributes control text and theme indexing. New
for 8i, the basic lexer can generate and index themes for each document, in
addition to splitting it up into words. This merges the functionality found
in the separate THEME LEXER and BASIC LEXER of previous versions.
INDEX_TEXT is a boolean which controls word indexing. INDEX_THEMES controls
theme indexing. Setting INDEX_THEMES to NO makes it a text-only index. Setting
INDEX_TEXT to NO makes it a theme-only index. They cannot, of course, be both
NO. Indexing themes takes longer and uses a bit more index space, but improves
the efficacy of ABOUT queries. We'll talk more about this when we discuss ABOUT
query.
The CTXSYS.DEFAULT_LEXER preference has theme indexing ON for the English
language-specific scripts.
JAPANESE_VGRAM_LEXER
Japanese is not a whitespace-delimited language (except in textbooks) so it
is very difficult for computers to pick out individual words. The Japanese V-gram
lexer gets around this problem by indexing overlapping clusters of characters.
The Japanese word ABC might be decomposed into AB and BC, for instance. Query
terms are also decomposed this way. Thus, the contains query is not looking
for words, but for patterns of characters.
The Japanese lexer does not have any attributes in this version.
CHINESE_VGRAM_LEXER
Chinese is also not whitespace-delimited, so a similar solution has been applied
for Chinese, as well. The Chinese v-gram lexer also does not have any attributes.
CHINESE_LEXER
New for Oracle8i is a segmenting Chinese lexer, which can actually
recognize some Chinese words as whole entities. The rest of the text is still
broken into v-grams, but this should be more space-efficient and have faster
query than the pure v-gram method. The Chinese segmentation lexer does not have
any attribtues.
KOREAN_LEXER
Korean is whitespace-delimited, but has problems with verbs, which can have
thousands of different forms. Our Korean lexer is a lexicon-driven engine (using
a third-party 3-soft lexicon) which simply eliminates verbs from indexing. New
for Oracle8i is the ability to eliminate adverbs and adjectives, do various
form conversions and perform morphological and segmentation decomposition.
Other Indexing Classes
The STOPLIST Class
Not every word in a document is worth indexing. Linguistic lubricant like
prepositions and conjunctions are important for language understanding, but
are not very useful for information retrieval -- they are very common and so
convey very little information about the document by their presence. Most of
the time, it's not worth the space to index these words.
The stoplist class holds a list of these words, called stop words. During
indexing, the engine consults the stoplist and filters out the stop words. The
stoplist class does not use the create_preference API. Instead, it has its own
API:
ctx_ddl.create_stoplist('mylist');
Words are added using add_stopword:
ctx_ddl.add_stopword('mylist', 'the');
Here the article "THE" has been added to the stoplist. You can see the words in
a stoplist using the ctx_stopwords view. A stoplist can have a maximum of 4095
stopwords. Each can be up to 64 bytes in length.
Stopwords are case-sensitive in a stoplist, but if the index is not case-
sensitive, the stopwords will not be case-sensitive. So, you could add "THE"
and "the" to a stoplist, but if the index is not case-sensitive, there would
be no difference -- you'd just be wasting a stopword slot.
Stoplists are plugged into indexes using the stoplist keyword:
create index
...
parameters ('stoplist mylist');
New for Oracle8i, stopwords can be added to the index without re-indexing:
alter index myidx rebuild parameters ('add stopword AND');
This adds "AND" as a new stopword after indexing. Remember that the index makes
its own copy of everything, so this does not affect the stoplist used to create
the index.
The default stoplist, CTXSYS.DEFAULT_STOPLIST, is language-specific, and default
stoplists are provided for most European languages.
Other Stop Objects
Stoplists can also hold two other kinds of objects: stop themes and stop classes.
Stop classes are classes of words, rather than individual words. The only stop
class available in Oracle8i is NUMBERS, which stops all numbers from
being indexed. Alphanumeric words, such as 123.456.789 and abc231 are still
indexed. This behavior depends on the lexer to recognize numbers.
Stop themes are used only during theme indexing. The entries added as stop
themes are themed, and these themes AND THEIR CHILDREN are not indexed. For
instance, say you enter "biology" as a stop theme. This will prevent themes
like "Physical Sciences:Biology:Genetics" and "Physical Sciences: Biology:Botany"
from being indexed. This is just an example, and does not actually reflect the
knowledge base.
Stop themes are used only during theme indexing, so adding "biology" as a
stop theme will not stop the word "biology" or biology-associated words from
being indexed.
The WORDLIST Class
The wordlist class is an odd class because it has no effect on indexing. Instead,
it holds fuzzy and stem expansion settings used at query time.
Wordlist preferences are plugged into indexes using the wordlist keyword:
create index
...
parameters ('wordlist mylist');
There is only one wordlist object: BASIC_WORDLIST, with the following attributes:
| STEMMER |
This attribute is set to the default stemming type. Stemming is an expansion
of a word to different forms. The stem expansion of GO might include GO,
GOING, WENT, and GONE. The rules for this vary from language to language.
(See expansion query operators).
STEMMER can be set to one of the following: ENGLISH (inflectional),
(English) DERIVATIONAL, DUTCH, FRENCH, GERMAN, ITALIAN, SPANISH, and NULL,
which means no stemming. The CTXSYS.DEFAULT_WORDLIST setting is language-specific.
|
| FUZZY_MATCH |
This attribute is set to the default type of fuzzy match. Fuzzy match
is an expansion technique designed to find words close in form, such as
mistyped or mis-OCR'ed versions.
FUZZY_MATCH can be set to one of the following: GENERIC, JAPANESE_VGRAM,
KOREAN, CHINESE_VGRAM, ENGLISH, DUTCH, FRENCH, GERMAN, ITALIAN, SPANISH,
and OCR. The CTXSYS.DEFAULT_WORDLIST setting is language-specific.
|
| FUZZY_SCORE |
This is the default score floor for fuzzy expansions. New for Oracle8i
is the ability to limit the fuzzy expansion to the best matches. The fuzzy
score is a measure of how close the expanded word is to the query word --
higher is better. Setting fuzzy score means that fuzzy expansions below
this score will not be produced. You can set this from 0 to 80. |
| FUZZY_NUMRESULTS |
This is the default maximum number of fuzzy expansions. Setting fuzzy
numresults limits the fuzzy expansion to a certain number of the best matching
words. You can set this up to 5000. |
The STORAGE Class
The storage class holds storage parameters for the index tables. The Text
index is made up of five oracle objects, so the single storage clause in the
create index statement is insufficient to specify storage. Storage preferences
are plugged into indexes using the storage keyword:
create index
...
parameters ('storage mystore');
There is only one storage object: BASIC_STORAGE, with the following attributes:
| I_TABLE_CLAUSE |
This attribute is the storage parameters for the I table, which is the
main index table. There will be at least one row per unique word in your
index -- probably multiple rows per word. Each row has a BLOB which tries
to stay inline, making the row a bit large. The value is simply tacked on
to the end of the create table statement, so you could set it like this:
ctx_ddl.set_attribute('mystore', 'I_TABLE_CLAUSE',
'tablespace foo storage (initial 1K)');
|
| K_TABLE_CLAUSE |
This attribute is the same, but for the K table, which is an IOT for
rowid to docid conversion. There will be one row per row in your base table.
Each row consists of a number and a rowid. |
| R_TABLE_CLAUSE |
This attribute is the same, but for the R table, which is a BLOB table
for docid to rowid conversion. There will be only a few rows in this table,
and will have a large BLOB in each row. |
| N_TABLE_CLAUSE |
This attribute is the same, but for the N table, which has invalid docids
waiting for garbage collection. The number of rows in this table will depend
on the frequencies of update and delete DML, and of garbage collection optimization.
Each row is short, holding two numbers. |
| I_INDEX_CLAUSE |
This attribute is the same, but for the unique index on the I table.
|
Indexing Memory
Those of you familiar with previous versions may be wondering where the index
memory setting is. For new users, interMedia builds an inverted index,
so documents are not added to the index one at a time, like a b-tree index.
Instead, information from multiple documents is buffered up and periodically
flushed to the index during indexing. In previous versions, you could set the
size of this buffer in the analogue of the storage class. Setting this buffer
higher reduces index fragmentation but uses more memory.
In Oracle8i, index memory can be set directly in the parameters string. This
allows you to set index memory on a per-operation basis. The keyword used is
MEMORY:
create index
...
parameters ('memory 2M');
Here a paltry 2 megabytes of memory is used for indexing. If not specified,
the system parameter DEFAULT_INDEX_MEMORY is consulted for the systemwide default
index memory. At install time, this is 12M.
The amount of memory cannot exceed system parameter MAX_INDEX_MEMORY. This
allows the system administrator to disallow outrageous amounts of index memory.
The MEMORY keyword is also allowed in the alter index parameter string for
batch DML (background DML uses DEFAULT_INDEX_MEMORY). This allows you to have
different memory sizes for create index and DML, something which is not possible
in previous versions. You simply specify a different memory amount in the create
index parameter string then in the alter index parameter string.
DML
DML Processing
Most text search engines have asynchronous document updates and additions --
changes to the index are usually done in batches, after and separately from
the document changes. This is due to three aspects of inverted indexing:
- Text indexing a single document is a lot of work. The values are usually
long, and at the very least you have to break it into words. Depending on
the objects, you may also have to do web page requests, filtering, and HTML
parsing.
- Inverted indexes, composed of lists of documents by word, are best updated
in batches of documents at a time. Updating the index one document at a time
leads to either word lists one document long or a lot of appending to existing
lists.
- Most text applications are fairly static, having relatively lower DML frequency,
and less need for full consistency semantics than traditional transaction
processing systems.
interMedia faces the same challenges. Here is what it does for each DML
on an indexed column:
| INSERT |
The document rowid is placed into a queue, dr$pending, for later addition
to the text index. Queries before this DML is processed will not find the
new document contents. |
| UPDATE |
The old document contents are invalidated immediately, and the document
rowid is placed into the dr$pending queue for later reindexing. Queries
before this DML is processed will not find the old contents, but neither
will it find the new contents. |
| DELETE |
The old document contents are invalidated immediately. No further DML
processing is required -- queries will no longer find the document. |
Thus, interMedia has synchronous invalidation but asynchronous addition.
This extends to transactions, as well:
delete from table where contains(text, 'delword') > 0;
select * from table where contains(text, 'delword') > 0;
--> you get no rows returned
rollback;
select * from table where contains(text, 'delword') > 0;
--> you get the rows returned
and:
insert into table values (1, 'insword');
select * from table where contains(text, 'insword') > 0
--> you do not get the row
commit;
select * from table where contains(text, 'insword') > 0
--> you still do not get the row
The synchronous invalidation is new to this version. Also new is that the DML
notification is done through integrated kernel code, and not through triggers
as in previous versions.
Processing Additions
So how do you process your inserts and updates when they are sitting in the
queue? interMedia provides two methods: sync and background.
Sync is the manual way -- you control when and how it's invoked. Each invocation
processes all pending inserts and updates for an index:
alter index myindex rebuild online parameters ('sync')
The ONLINE keyword is very important. Without this, during the sync operation
queries are blocked and DML on the base table fails. Make sure to use it for sync
operations.
Background DML requires you to start a ctxsrv background daemon in the operating
system:
ctxsrv -user ctxsys/<password>
Once started, the daemon polls the dr$pending queue for DML, and automatically
processes additions as they become available.
You can use either or both -- it's largely a matter of your application requirements.
Background DML scans for DML constantly. This leads to new additions being indexed
automatically and quickly. However, it also tends to process documents in smaller
batches, which increases index fragmentation. Sync DML is done at the user's
request, so usually the batches are larger and thus there's less index fragmentation.
A happy compromise could possibly be reached by invoking sync through dbms_sql
in a dbms_job which runs at longer intervals.
The DML Queues
The dr$pending queue holds the documents waiting to be indexed. It should be queried
through the ctx_user_pending view, which makes it more readable. Additionally,
there is a dr$waiting queue which is used when documents are waiting to be indexed
and they are updated again. The drq_inprog, drq_batches, and drq_batchno tables
from previous versions are all no longer needed.
Optimization
Index Fragmentation
The Text index is an inverted index, so essentially consists of words and
lists of documents which contain that word. When new documents are added to
this index, these lists have to be extended. Instead of appending to these lists,
more rows for the word are created. However, this can lead to many rows with
sub-optimal, short lists -- called "index fragmentation". For instance, say
we have the following in our index:
DOG DOC 1 DOC 3 DOC 5
Now a new document with "dog" comes along. Simply add it to the index as a new
row:
DOG DOC 1 DOC 3 DOC 5
DOG DOC 7
Subsequent DML will also create new rows:
DOG DOC 1 DOC 3 DOC 5
DOG DOC 7
DOG DOC 9
DOG DOC 11
This is why background DML generally produces more index fragmentation than spot
invocation of sync: processing DML less frequently in larger batches allows newly
added rows to have longer lists, which reduces the number of rows in the index
table.
Keep in mind that a certain amount of fragmentation is expected in 8i. interMedia
will try to keep index table rows below 4000 bytes long, to inline the LOB and
to speed up index row merges used during phrase search, AND, section search,
etc. However, frequent DML will probably result in much more fragmentation than
this.
Document Invalidation
During an update or delete, the old document information must be removed from
the index so that queries will no longer hit it. However, because the inverted
index consists of words and rows of occurrences, this would entail finding all
words in the old version of the document (which may not be available), and removing
an occurrence from the lists of those rows in the index table. This is too much
work to do synchronously, and deletes must be synchronous in the integrated
model.
Instead, interMedia marks the old document contents as invalid and
does not touch the index. The mark signals queries to remove the document from
any query result sets. However, this leaves the old information in the index,
taking up space in the index tables.
Optimization
To solve these potential problems in the index, you run optimization. Optimization
has two modes: FAST and FULL. FAST optimization targets fragmentation only:
alter index myindex rebuild online parameters ('optimize fast');
The ONLINE keyword is important -- without it, queries are blocked and DML fails
with an error. FAST optimization runs through the whole index table and glues
fragmented rows together, to a maximum of 4000 bytes for a single row. This reduces
the number of rows in the index table.
FULL optimization does both defragmentation and garbage collection, which
removes the old information left over after document invalidation:
alter index myindex rebuild online parameters ('optimize full');
It loops through each row of the index table, but, unlike fast optimization, cracks
open each row's list, removing old document information. It also glues separate
rows together where possible.
Because garbage collection is more involved and time-consuming than defragmentation,
FULL optimization does not have to be run on the whole table at one time. Instead,
you can run it for a limited period of time:
... parameters ('optimize full maxtime 5')
which means run for a maximum of 5 minutes. At the end of 5 minutes, it will stop.
The next time you run optimization, it will pick up where it left off. This lets
you do a little of bit optimization each night during off times, and ensure that
it is not loading the system when the off time ends.
Garbage collection is needed only after document invalidation, which happens
only for delete and updates. If your document table is append-only, meaning
that documents are inserted and never changed or removed, then it should suffice
to run only defragmentation.
Optimization Concurrency
Unlike previous versions, optimization (either mode) does not block background
DML processing. However, because Oracle prevents alter index operations from
running concurrently, sync invocations will be prevented.
Querying
The Basics
At this point all aspects of building an index and keeping it up to date have
been discussed. However, indexes are a means to enabling content- based queries.
IMPORTANT: Unlike a value query, which is slower without a b-tree index, contains
queries are completely disallowed without a Text index.
You do a query using the contains operator:
select id
from texttab
where contains(textcol, 'query') > 0
The first argument to contains is the name of the text column. The second is the
text query, which is limited to 2000 bytes in length. It returns a number, which
indicates the strength of match. 0 means not a match, so it is constrained to
> 0. The result is that this query finds the id's of all rows in texttab where
the textcol has the word "query".
Scoring
Of course, now you're wondering how to use that return value in the select
list and order by clause. An operator is provided just for that purpose:
select id, score(1)
from texttab
where contains(textcol, 'query', 1) > 0
order by score(1) desc
The "1" in the score and contains operators is called the "contains label" and
it's a number which links the score in the select list to the contains in the
where clause -- it can be any number, so long as it matches.
Score can be between 0 and 100, but the top-scoring document in the query
will not necessarily have a score of 100 -- scoring is relative, not absolute.
This means that scores are not comparable across indexes, or even across different
queries on the same index. Score for each document is computed using the standard
Salton formula:
3f(1+log(N/n))
Where f is the frequency of the search term in the document, N is the total number
of rows in the table, and n is the number of rows which contain the search term.
This is converted into an integer in the range 0 - 100.
Query operators like AND and OR operate on the scores of their operands.
Where Can I Use contains?
Unlike the transparent query rewrite of previous versions, you can include
a contains clause in any SQL statement using any tool which can issue oracle
SQL. Here is a short list of new places where contains queries can pop up:
- subqueries
- virtual tables
- PL/SQL
- View definitions
- DML (insert as select, e.g.)
There is no longer a limitation on the length of the contains SQL statement,
although there is a limit on the query term length.
Other Query Methods
For those of you familiar with previous versions, there is no two-step or
text cursors (n-memory query). If you absolutely must have result tables, use
insert as select. Instead of text cursors, use real PL/SQL cursors. To sort
by score, use an order by clause with a FIRST_ROWS hint. The hint will avoid
an oracle SORT ORDER BY.
However, interMedia still supports count_hits. count_hits is a fast
way to get the hit count for a text query, in PL/SQL:
num_hits := ctx_query.count_hits('indexname','query',TRUE);
The first argument is the name of the index. This should be a string, so use
quotes if you are using a literal value. The second argument is the query string,
and the third is a boolean toggle called "exact". If exact is TRUE, then the
result is accurate. If exact is FALSE, then the result is only an upper bound.
The actual count could be lower. exact FALSE is faster, because it doesn't screen
out invalid documents (see issues on DML and optimization for more details).
You could also do "select count(*)" with a contains clause, and this is recommended
when you have structured conditions, but count_hits is faster for text-only
queries.
Simple Queries
The simplest query is a single word:
contains(text, 'dog') > 0
You can escape your word with curlies:
contains(text, '{dog}') > 0
This is useful when you want to query on a reserved word or your word has special
characters which are query operators. The list of reserved words and the query
operator characters are in the documentation.
You can query on a phrase just as easily:
|