Oracle interMedia Text 8.1.5 - Technical Overview
Oracle8i interMedia Text 8.1.5 - Technical Overview
Garrett Kaminaga, Product Development, Database and Application Server Technologies, Oracle HQ
interMedia integrates all of the features and functions of the former Oracle
ConText product with image, audio and video, and geographic location services for Web
Content Management applications built with the Oracle8i Internet Platform.
interMedia Text has been completely re-architected, enhanced and
tightly integrated with Oracle8i to provide up to an order of magnitude better
search performance, greater ease of use and new capabilites like thematic searches. Please
refer to the documentation for detailed information.
Table of Contents
- Introduction to interMedia text
- Installation and Setup
- Indexing
- The Preference System
- Datastores
- Filters
- Section Groups
- Lexers
- Other Indexing Classes
- DML
- Optimization
- Querying
- Query Operators (Part 1)
- Query Operators (Part 2)
- Thesaurus
- Query Tuning
- Query Feedback
- Highlighting
- Linguistic Extraction
Introduction to interMedia Text
What does it do?
interMedia Text extends Oracle8i by indexing any text
or documents stored in Oracle8i, in operating system flat files or URLs. It
also enables content-based queries, (e.g., find text and documents which contain
this word) using familiar, standard SQL. It enables file-based text applications
to use Oracle8i to manage text and documents in an integrated fashion
with traditional relational information.
The easiest way to grasp what it does is to see it in action...
How do you use it?
Here is a quick example. We'll start by creating a table to hold our documents.
For now, we'll use a small varchar2 column to hold the document text. We'll
also need a primary key to identify each document.
create table docs (id number primary key, text varchar2(80));
Nothing special about that. Let's populate it with two example documents:
insert into docs values (1, 'first document');
insert into docs values (2, 'second document');
commit;
Now comes the fun part -- we build a Text index:
create index doc_index on docs(text)
indextype is ctxsys.context;
This is, for the most part, familiar DDL -- the difference is the indextype
clause, which instructs Oracle to build a Text index instead of a regular, b-tree
index.
Once the index has been created, we can run content-based queries on our huge
database of two documents using the contains function:
select id from docs
where contains(text, 'first') > 0;
This will find all rows in docs where the text column contains the word "first",
(which is document 1, for those not following closely). The > 0 part is necessary
to make it legal Oracle SQL, which does not support boolean return values for
functions (at this time).
Integration Architecture
interMedia Text is more tightly integrated with Oracle8i.
For those of you familiar with previous versions of ConText, notice that-- in
the previous example, no ctxsrv was required. The result is a specialized "domain
index" which works like any b-tree index. It's all transparent to the user --
Oracle8i is simply extended to support text. Unlike previous versions
of Oracle ConText, there are no servers to start up, there is no query rewrite,
and index creation is done through familiar SQL rather than through a custom
PL/SQL interface.
Installation and Setup
Migrating from Previous Versions
Due to significant architectural changes, interMedia Text
is not backward-compatible with previous versions of Oracle ConText. Existing
installations and applications need to be migrated to work with Oracle8i.
There is a manual which details the steps of this process. Please remember that
some of those steps need to be completed on the existing system before
you install Oracle8i.
Installation
By default, the Oracle interMedia installation procedure installs
interMedia Text. interMedia Text can also be installed separately,
if you choose a custom install.
Important Files
| ?/bin/ctxsrv |
This may be renamed ctxsrv8 or some such. Oracle8i still supports the
ctxsrv server. However, the only valid personality mask is M. You need to
run it only when you want background DML. Batch DML (sync) does not require
a running server. We'll go into DML in more depth later. |
| ?/ctx/admin |
Data dictionary scripts. See below for details. |
| ?/ctx/admin/defaults |
Language-specific default preferences. See below, and an upcoming issue
on the preference system for more details. |
| ?/ctx/data |
This directory has data files needed by interMedia Text,
which includes the linguistic lexicon files, the Xerox stemming files, and
the Korean lexer dictionary. If you get strange internal errors in linguistics,
stemming, etc. check this directory. |
| ?/ctx/bin |
This directory is for user-defined filters, which we'll talk about later.
This directory should also contain the ctxhx program which is used by the
INSO filter. |
| ?/ctx/lib |
This directory has platform-specific shared libraries and template files
used by the INSO filter. It also contains the safe callout used by indexing
and document services -- libctxx8.so (name will change from platform to
platform -- look for "ctxx") |
Data Dictionary Installation
interMedia Text is integrated with the Oracle Database Creation
Assistant (DBCA) so the ctxsys data dictionary should be installed with this
tool. If a manual installation is required please follow these steps.
- Before starting, make sure that:
- The interMedia Text software tree is installed
- The database does not have a ctxsys user
- Current directory is ?/ctx/admin
- You can sqlplus internal
- Create the ctxsys user. Pass it the ctxsys password, default tablespace,
and temporary tablespace as arguments.
sqlplus internal @dr0csys <password> <def_tblspc> <tmp_tblspc>
- Install the data dictionary:
sqlplus ctxsys/<password> @dr0inst <ctxx_library>
The argument is the full path to the ctxx library, for example:
sqlplus ctxsys/<password> @dr0inst /some_directory/app/oracle/product/8.1.5/ctx/lib/libctxx8.so
On Unix, you must not use the environment symbol $ORACLE_HOME.
You have to use the actual physical path.
- Install appropriate language-specific default preferences. There are forty-odd
scripts in ?/ctx/admin/defaults which create language- specific default preferences.
They are named in the form drdefXX.sql, where XX is the language code (from
the Server Reference Manual).
To install the US defaults, for instance:
sqlplus ctxsys/<password> @defaults/drdefus.sql
We'll talk more about this in the upcoming preference system issue.
After these steps, interMedia Text should be installed
and working.
Post-Installation Setup
If this database was an existing ConText site, make sure to remove text_enable
from the init.ora. It is no longer used in Oracle8i, and will actually prevent
Oracle8i from operating properly -- you'll get errors like "cannot find package
DR_REWRITE".
Finally, ensure that the Net8 listener is running and is configured to invoke
external procedures. A brief description of the process is below, with complete
details are in the Oracle8i Server Administrator's Guide.
- Add an entry to the tnsnames.ora:
extproc_connection_data =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = ipc)
(KEY = DBSID))
(CONNECT_DATA = (SID = ep_agt1)))
DBSID is the database SID. ep_agt1 can be named anything. extproc_connection_data
should not be changed.
- Add the following to the listener SID_LIST:
(SID_DESC = (SID_NAME = ep_agt1)
(ORACLE_HOME = /oracle)
(ENVS = LD_LIBRARY_PATH=/oracle/ctx/lib)
(PROGRAM = extproc))
ep_agt1 matches the CONNECT_DATA SID for extproc_connection_data in the tnsnames.ora.
The PROGRAM section tells the Net8 listener to start the external procedure
process.
The ENVS section, which is shown here for UNIX, will ensure that the environment
includes ?/ctx/lib in LD_LIBRARY_PATH. This is needed so that indexing can
use the INSO filters.
On NT, you may need to have ORACLE_HOME set in this section as well.
- Since the extproc_connection_data ADDRESS section specifies ipc, make sure
that the ADDRESS_LIST of listener.ora accepts ipc connections.
A quick way to test the Net8 configuration is to do:
exec ctx_output.start_log('log')
from SQL*Plus. If you get the error:
DRG-50704: Net8 listener is not running or cannot start external procedures
then things aren't set up correctly. Some of the things to check:
- listener is not running
- listener.ora is not configured for extproc
- need to reload the listener
- tnsnames.ora is not configured for extproc
Indexing
Creating Indexes
The first step in using interMedia Text is to create a Text
index. Without a b-tree index, value queries are slower; without a Text index,
contains queries are simply not possible. As we've seen, the index is created
using the create index command:
create index INDEXNAME on TABLE(COLUMN)
indextype is ctxsys.context
Unlike previous versions of ConText, there is no separate policy creation
step. We'll talk more about this when we discuss the preference system. View
indexing is not allowed in Oracle8i, consistent with regular b-tree indexes.
Parallel index creation is also not supported in this first Oracle8i
version.
Composite indexes are not supported -- only one column is allowed in the column
list. This column must be one of the following types: CHAR, VARCHAR, VARCHAR2,
LONG, LONG RAW, BLOB, CLOB, BFILE. Date, number, and nested table columns cannot
be indexed. Object columns also cannot be indexed, but their attributes can
be, provided they are atomic datatypes.
The table must also have a primary key constraint. This is needed mainly for
identifying the documents for document services, and may be used in the future
for other purposes. Composite primary keys are supported, up to 16 columns.
The issuing user does not need the ctxapp role to create an index. If the
user has Oracle grants to create a b-tree index on the column, then they have
sufficient permission to create a Text index.
Unlike previous versions of ConText, the issuing owner, table owner, and index
owner can all be different users, just like regular b-tree indexes.
IMPORTANT: If a syntax error occurs in the create index statement, the index
is still created. This is different from regular b-tree indexes; before you
reissue the corrected statement, you must drop the failed index first.
If an error occurs during actual indexing (e.g. you run out of tablespace)
then you can pick up where you left off (after correcting the problem, of course)
using alter index:
alter index INDEXNAME rebuild parameters ('resume')
Storage clauses to the create index statement are not used. Index storage
parameters are provided using the STORAGE object, which we will discuss later.
Partitioning clauses are also not supported at this time, although you can manually
partition the index tables if you wish, again using the STORAGE object.
Once the index is created, any export will include the index definition. At
import time, imp will re-create the index by issuing the create index statement.
The Indexing Pipeline
Oracle8i detects that this is a Text index and performs the text indexing.
The index is created using a pipeline of steps - "The Indexing Pipeline".
The diagram below shows the indexing pipeline:
+-----------+ +----------+ filtered +-----------+
| | doc data | | doc text | |
| datastore |---------->| filter |----------->| sectioner |---------+
| | | | | |--+ |
+-----------+ +----------+ +-----------+ | |
^ | |
| section offsets | |
|column data +---------------------------------+ |
| | |
| V |
+==========+ +----------+ +-----------+ |
| | | | | | |
| database |<-----------| engine |<-----------| lexer |<--------+
| | index data| | tokens | | plain
+==========+ +----------+ +-----------+ text
Let's step through this pipeline, stage-by-stage. Each stage will be covered in
depth later, so they'll be described here briefly.
Start with the datastore. This stage loops over the rows of the table and
reads the data out of the column. Unlike previous versions, the rows are not
read in any particular order. The datastore passes out document data. Usually,
this will simply be the column data, but some datastores use the column data
as a pointer to the document data. The URL_DATASTORE, for instance, uses the
column data as a URL, does a GET, and passes out the returned data.
The filter takes the document data and transforms it to some kind of text
representation. This is needed when storing binary documents such as Word or
Acrobat files. Unlike previous versions, the output of the filter does not have
to be plain text -- it can be a text format such as XML or HTML.
The sectioner, new for Oracle8i, takes the output from the filter, and converts
it to plain text. There are different sectioners for different text formats,
including XML and HTML. Conversion to plain text includes detecting important
section tags, removing "invisible" information, and reformatting the text.
The lexer takes the plain text from the sectioner and splits it into discrete
tokens. We have lexers for whitespace-delimited languages, and specialized lexers
for Asian languages, where segmentation is quite a bit more complex. The basic
lexer also includes theme functionality to build unified text/theme indexes.
Finally, Oacle8i takes all the tokens from the lexer, the section offsets
from the sectioner, and a list of low-information words called the stoplist,
and builds an inverted index. An inverted index stores tokens, and the documents
in which these tokens occur. For instance, our example from issue #1:
insert into docs values (1, 'first document');
insert into docs values (2, 'second document');
would produce an inverted index like this:
DOCUMENT ---> doc 1 position 2, doc 2 position 2
FIRST ---> doc 1 position 1
SECOND ---> doc 2 position 1
Each document is assigned an identifier called a docid, which is used in the inverted
index. The document primary key isn't stored.
Logging
The time needed to create an index is a function of the amount of text being
indexed. Oracle8i can create a log file which can be monitored during
indexing. start_log starts the logging, and should be called before issuing
the create index, in the same session:
ctx_output.start_log('mylog');
This will create a mylog file in ?/ctx/log which can be used to monitor indexing
progress. The directory can be changed by ctxsys, using the system parameter LOG_DIRECTORY.
See documentation for details.
IMPORTANT: In this first release of interMedia ?/ctx/log is not created
at install time. You may get errors like "unable to open file". Simply have
the DBA create this directory.
Logging is halted when the session exits or when end_log is called:
ctx_output.end_log;
Errors During Indexing
Processing errors encountered during indexing -- filter errors, file cannot
be read, etc. -- do not bring indexing to a halt. Instead, the error is logged
and indexing continues to the next file.
You can see these logged errors in the ctx_user_index_errors view. It will
tell you the rowid of the failed document, and the error which was encountered.
You need to manually empty this table when you are finished reviewing the errors:
delete from ctx_user_index_errors;
The Preference System
Classes, Objects, Preferences
The preference system allows you to customize Text indexing. Each stage of
the indexing pipeline is really an interface into which customized objects can
be plugged in. These stages are called "classes", and they can be found in the
ctx_classes view:
| DATASTORE |
Reads column data and returns document data |
| FILTER |
Reads document data and returns formatted document text |
| SECTION_GROUP |
Reads formatted document text and returns section offsets and plain document
text |
| LEXER |
Reads plain document text and breaks it into tokens |
| WORDLIST |
Contains query expansion defaults |
| STOPLIST |
Contains a list of low-information words |
| STORAGE |
Index table storage parameters |
Each class has one or more "objects" which are like behavioral templates.
The DATASTORE class, for instance, has the following objects:
| DIRECT_DATASTORE |
Passes column data directly as document data |
| DETAIL_DATASTORE |
Concatenates multiple rows of a detail table to construct document data
|
| FILE_DATASTORE |
Interprets column data as a filename. Reads file and passes file contents
as document data |
| URL_DATASTORE |
Interprets column data as a URL. Performs a GET and passes return as
document data |
| USER_DATASTORE |
Invokes a stored procedure to synthesize document data |
Objects can be found in the ctx_objects view.
But objects are only templates, and cannot be used directly in an index. Instead,
you create a "preference" from one of these object templates, customize it by
setting "attributes", then use the preferences to create the index. Let's take
a closer look:
Creating Preferences
You create a preference using ctx_ddl.create_preference, specifying an object
to use as template:
ctx_ddl.create_preference('mypref', 'FILE_DATASTORE');
This creates the preference mypref, using the FILE_DATASTORE object. If you are
familiar with previous versions, the object names have been changed, and there
is no comment argument.
This is done in PL/SQL, so surround it with "begin" and "end", or use "exec"
if you are in SQL*Plus. Also, in order to create preferences, the issuing owner
must have the ctxapp role.
Once the preference is created, we can customize it using set_attribute: For
our example, let's say all our files were in /docs. We can simply set the PATH
attribute to communicate this to interMedia:
ctx_ddl.set_attribute('mypref', 'PATH', '/docs');
If you are familiar with previous versions, set_attribute now comes *after*
create_preference, which allows us to do attribute-level validation, and necessitates
the first argument, which is the preference name.
This attribute is a text attribute, so it is set to a text value. There are
also number attributes and boolean attributes, which should be set to 'TRUE'
or 'FALSE' (or 'YES'/'NO'). The attributes for each object and their types can
be found in view ctx_object_attributes or in the documentation.
Some attributes have a list of allowed values. The allowed values can be found
in the view ctx_object_attribute_lov. Unlike previous versions, LOV attributes
can be set using a descriptive "tag" rather than a numeric value.
Non-preference Classes
Preferences are used for all classes except the SECTION_GROUP and STOPLIST
classes. These classes use specialized objects and have their own API. We'll
discuss this in the respective upcoming issues.
Using Preferences
Once you've created your preferences, you can build all kinds of customized
indexes by mixing and matching them. Because all objects of a class adhere to
a common interface, any filter can be hooked up to any datastore, etc. Got a
bunch of Korean Word 97 files? No problem -- hook up a FILE_DATASTORE preference,
an INSO_FILTER preference, and a KOREAN_LEXER preference.
You plug in your preferences using the parameters clause of create index:
create index doc_index on docs(text)
indextype is ctxsys.context
parameters ('datastore mypref')
This tells create index to use mypref as the datastore. Each of the classes has
a parameters keyword to name a preference. The keyword is the same as the class
name, except for SECTION_GROUP, whose keyword is the phrase "section group". You
can specify multiple classes by simply tacking the keyword-preference pairs on
the end of the string:
parameters('datastore mypref filter myfilter')
Any user can use any preference. To specify a preference in another user's
schema, simply add the owner name in the parameters string:
parameters('datastore kenny.mypref')
Unlike previous versions, the index makes a copy of the preference settings,
which means that after use in an index, the preference can be modified or dropped
without affecting the index. Note that only the preference owner can modify
or drop it.
The Default System
In the example above, only the datastore class is set. Since preferences were
not specified for the other classes in the parameters string, they get their
values from the default system.
The default system uses system parameters. System parameters are name- value
pairs which apply to the interMedia Text installation as a whole
-- kind of our version of init.ora. They can be seen in the view ctx_parameters.
The values for the parameters used by the default system are preference names.
For instance, in our example we have not specified a lexer preference. interMedia
gets the value of the system parameter DEFAULT_LEXER, and uses the preference
named there. This allows the administrator to set the default behavior for the
system.
The set of parameters used by the default system are:
| DEFAULT_DATASTORE |
The default datastore preference. At install time (the default default)
it is set to CTXSYS.DEFAULT_DATASTORE, a preference which uses the DIRECT_DATASTORE
object. |
| DEFAULT_FILTER_BINARY |
This is the filter preference to use when indexing binary datatypes such
as RAW, LONG RAW, or BLOB. At install time it is set to CTXSYS.INSO_FILTER,
which uses the INSO_FILTER object. |
| DEFAULT_FILTER_FILE |
This is the filter preference to use when indexing files, either BFILE
or the FILE_DATASTORE. At install time it is set to CTXSYS.INSO_FILTER.
|
| DEFAULT_FILTER_TEXT |
This is the filter preference to use when indexing text datatypes such
as CHAR, VARCHAR2, or CLOB. At install time it is set to CTXSYS.NULL_FILTER,
which does no filtering. |
| DEFAULT_SECTION_HTML |
This is the section group to use when the filter class uses the INSO filter,
or the datastore uses the URL_DATASTORE. The INSO filter filters binary
files to HTML (see filters). At install time it is set to CTXSYS.HTML_SECTION_GROUP,
which merely converts the HTML to plain text. |
| DEFAULT_SECTION_TEXT |
This is the section group to use in all other cases. At install time it
is set to CTXSYS.NULL_SECTION_GROUP, which does no sectioning. |
| DEFAULT_LEXER |
This is the default lexer preference. At install time it is set to CTXSYS.DEFAULT_LEXER.
The definition of that preference is language-specific. See below for more
details. |
| DEFAULT_WORDLIST |
This is the default wordlist preference. At install time it is set to
CTXSYS.DEFAULT_WORDLIST. The definition of that preference is language-specific.
See below for more details. |
| DEFAULT_STOPLIST |
This is the default stoplist. At install time it is set to CTXSYS.DEFAULT_STOPLIST.
The contents of that stoplist are language-specific. See below for more
details. |
| DEFAULT_STORAGE |
This is the default storage preference. At install time it is set to CTXSYS.DEFAULT_STORAGE,
which has no settings. |
Language-Specific Defaults
The LEXER, WORDLIST, and STOPLIST classes are the most affected by the language
of the documents. Every installation gets DEFAULT_LEXER, DEFAULT_WORDLIST, and
DEFAULT_STOPLIST preferences, but the definition of these preferences will depend
on the language chosen at install time.
?/ctx/admin/defaults holds language-specific scripts for
each supported language. Based on the language chosen, DBCA runs the matching
language-specific script, which creates the default preferences most appropriate
for that language.
The result is that a German instance gets a DEFAULT_LEXER which is case- sensitive
and does German decompounding, a DEFAULT_STOPLIST which has common German words,
and a DEFAULT_WORDLIST with German stemming. A Japanese instance gets different
defaults, more tailored to Japanese.
Browsing the Data Dictionary
Here is a list of views useful for browsing the interMedia Text
data dictionary: ctx_classes Available classes ctx_objects Available objects
for each class ctx_object_attributes Attribute names and types for each object
ctx_object_attribute_lov Allowed values for each LOV object attribute ctx_preferences
Preferences ctx_preference_values Attribute values of each preference ctx_user_index_objects
Objects used by the user's indexes ctx_user_index_values Attribute values for
the user's indexes
Datastores
The DATASTORE class
The datastore class is responsible for reading the column data from the database,
getting the document data based on the column data, and passing that up to the
filter.
It is important to keep in mind that some form of the document data must be
in a table. Datastore objects which access external resources such as files
and web pages still require a table of pointers to those resources.
Datastore preferences are plugged into indexes using the datastore keyword:
create index
...
parameters ('datastore mydstore');
The datastore class has five objects.
DIRECT_DATASTORE
The direct datastore is the simplest case -- it assumes that document data
is stored in the indexed column. It is so simple it has no attributes to customize.
There should be no reason to create a preference based on the DIRECT_DATASTORE
object -- CTXSYS.DEFAULT_DATASTORE can be used for any index.
IMPORTANT: If you are indexing BFILEs, make sure the user ctxsys has READ
permission on the BFILE directory.
FILE_DATASTORE
The file datastore reads the column data as a filename. It opens the file
and returns the contents as the document data. The indexed column cannot be
LONG or LOB types. The files must be accessible from the database host machine.
This object has one attribute: PATH. PATH is optional, and specifies the directory
where the documents are stored.
IMPORTANT: If PATH is specified, the column data must be simple file names;
if not specified, the column data must be full path names. You cannot mix and
match -- for instance, with PATH of /A, and a column of B/C.TXT it will NOT
find /A/B/C.TXT.
On some platforms, PATH can specify multiple directories. If you do this,
make sure that your documents have unique names -- if a document C.TXT is in
directories /A and /B, and /A is in the PATH before /B, only /A/C.TXT will be
indexed.
There is not much of a difference between this object and BFILEs. It's a matter
of choice. Like BFILE's, when the file changes, the row will not be automatically
reindexed. You should update the column value to itself to trigger a reindex.
URL_DATASTORE
The URL datastore reads the column data as a URL. It performs a GET and returns
the contents as the document data. The URLs for the documents must in the table
-- the URL datastore does not crawl. The indexed column cannot be LONG or LOB
types.
http, ftp, and file protocols are supported. The URL datastore also has multi-threaded
read (on most platforms) to maximize throughput. It is very customizable, and
has several attributes:
| TIMEOUT |
Read request timeout in seconds |
| MAXTHREADS |
Maximum number of threads to use for multithreaded read |
| URLSIZE |
Maximum length of an URL |
| MAXURLS |
Maximum number of URL's to read at one time |
| MAXDOCSIZE |
Maximum length of document |
| HTTP_PROXY |
HTTP proxy |
| FTP_PROXY |
FTP proxy |
| NO_PROXY |
Domain for proxy exclusion |
Like the file datastore, if the web page changes, then you should manually
update the column value to itself to trigger the reindex.
DETAIL_DATASTORE
Many legacy applications store text line-by-line, in multiple rows of a detail
table. The detail datastore constructs documents by concatenating the text of
these lines into a single, coherent document.
The detail datastore must be told the specifics of the detail table:
| DETAIL_TABLE |
Name of the detail table |
| DETAIL_KEY |
The foreign key column(s) in the detail table |
| DETAIL_LINENO |
The sequence column in the detail table |
| DETAIL_TEXT |
The text column in the detail table |
For instance, let's say the master and detail tables look like this:
the_master my_detail
ID TITLE M_ID SEQ LINETEXT
1 Grate Expectations 1 1 It was the best of times
1 2 it was the worst of times.
"That's G-R-A-T-E Expectations, also by Edmund Wells."
The attributes of the preference should be set like this:
| DETAIL_TABLE |
my_detail |
| DETAIL_KEY |
m_id |
| DETAIL_LINENO |
seq |
| DETAIL_TEXT |
linetext |
There is one final attribute: BINARY. By default, the detail datastore treats
each row as a separate line, and, when concatenating the contents, will automatically
stick a newline at the end of the text for each row. BINARY set to TRUE suppresses
this. In our example, we should set BINARY to FALSE or simply leave it unset.
The index itself should to be built on the master table:
create index myidx on master(somecolumn)
indextype is ctxsys.context
parameters ('datastore my_detail')
The master table must have a primary key column, just like any other table indexed
by interMedia Text. This primary key column is used to find the
corresponding detail rows, where detail.fk = master.pk.
The indexed column can be any column of allowable type -- the contents are
not used by interMedia. If you can afford the space, you can add a dummy
char(1) column called "text" or "body", to make the queries more readable:
select author, title from master
where contains(text, 'best & worst') > 0;
If detail rows are added, removed, or changed without changing the master
text column, you should manually update the text column to itself to trigger
a reindex.
For those of you familiar with previous versions of ConText, this is similar
to the master detail new datastore. The old master detail datastore is no longer
supported
USER_DATASTORE
The user datastore is new for Oracle8i. You write a stored procedure which,
given a rowid, synthesizes a document. interMedia calls this procedure
whenever document data is required. Let's look at an example. Say we have a
table like this:
articles
id number
author varchar2(80)
title varchar2(120)
text clob
and we want to automatically have author and title be part of the indexed document
text. We can write a stored procedure following the user datastore interface:
create procedure myproc(rid in rowid, tlob in out NOCOPY clob) is
offset number := 1;
begin
for c1 in (select author, title, text from articles
where rowid = rid)
loop
append_varchar_to_lob(tlob, c1.title, offset);
append_varchar_to_lob(tlob, 'by '||c1.author, offset);
dbms_lob.append(tlob, c1.text);
end loop;
end;
This procedure takes in a rowid and a temporary clob locator, and concatenates
all the articles columns into the temp clob, This assumes a helper procedure append_varchar_to_lob
has been written.
Note the use of the NOCOPY in the procedure signature. This is very
important to achieve best performance and to reduce TEMP space usage.
To ensure that the DBA has signed off on the code, only ctxsys-owned stored
procedures are allowed for the user datastore. Thus, we need to do something
like this as ctxsys:
create procedure s_myproc(rid in rowid, tlob in out NOCOPY clob) is
begin
appowner.myproc(rid, tlob);
end;
And, we need to make sure that the index owner can execute the stub procedure,
so:
grant execute on s_myproc to appowner
Now, back as the app owner, we create the preference, setting the PROCEDURE
attribute to the name of the ctxsys stub procedure:
ctx_ddl.create_preference('myud', 'user_datastore');
ctx_ddl.set_attribute('myud', 'procedure', 's_myproc');
When we then create the index on articles(text) using this preference, interMedia
indexing will see author and title in the document text.
The user datastore can be used for any on-the-fly document synthesis, including
more complex master-detail relationships, nested tables, text preprocessing,
or multicolumn concatenation, like the example.
There are four constraints on the procedures used in user datastores:
- They must be owned by ctxsys
- They must be executable by the index owner
- They cannot issue DDL or transaction control statements like "commit"
- They cannot be safe callouts or call safe callouts
If you change the stored procedure, indexes based upon it will not be notified,
so you should manually recreate such indexes. interMedia cannot tell what
you are doing in the stored procedure, so if the stored procedure makes use of
other columns, and those column values change, the row will not be reindexed.
The row is only reindexed when the indexed column changes.
When using the USER_DATASTORE, you may run into OCI_INVALID_HANDLE
errors. This is bug 881904, and is fixed in patch release 8.1.5.1.
Filters
The FILTER Class
The filter class takes the document data provided by the datastore class,
and filters it to readable text, passing it to the sectioner. With Oracle8i,
this does not have to be plain text -- it can be a text format such as HTML.
Filter preferences are plugged into indexes using the filter keyword:
create index
...
parameters ('filter myfilter');
The filter class has three objects.
NULL_FILTER
The null filter is used when the document contents are not in a binary format.
It simply passes text from the datastore to the sectioner. The default CTXSYS.NULL_FILTER
preference can be used in any index to employ the null filter.
CHARSET_FILTER
The charset filter is new for Oracle8i. It converts documents from a foreign
character set to the database character set. This is most useful for Japanese
customers, who have to deal with two widely-used and incompatible character
sets (and one infrequently-used incompatible character set).
The charset filter has one attribute, CHARSET, which takes the NLS name of
the source character set. A list of NLS charsets can be found in the Oracle
National Language Support Guide. The destination character set is always the
database character set, so does not need to be specified.
Additionally, you can specify JAAUTO, which is a custom setting for Japanese
character set auto-detection. Oracle8i will automatically detect Shift-JIS,
JIS7, or EUC for each document and convert it to the database charset if needed.
USER_FILTER
The user filter is a method for plugging in your own filters. You write a
filtering program, place it in ?/ctx/bin, and the indexing engine will invoke
it to filter each document. Here's an example -- we'll create an uppercase filter
which will uppercase every word.
We start by creating a program to do the filtering -- in this case, we'll
write a perl script:
#!/usr/local/bin/perl
open(IN, $ARGV[0]);
open(OUT, ">".$ARGV[1]);
while ()
{
tr/a-z/A-Z/;
print OUT;
}
close(IN);
close(OUT);
This perl script converts a file to uppercase.
User filter programs like this take two arguments. The first argument is the
name of the input file. The second argument is the name of the output file.
Our filter reads the contents of the input file, filters it, and writes the
output to the output file.
the program (called upcase.pl) is placed in ?/ctx/bin. Ensure that it's executable
by the oracle operating-system user. Now the preference is created:
ctx_ddl.create_preference('mypref', 'USER_FILTER');
ctx_ddl.set_attribute('mypref', 'EXECUTABLE', 'upcase.pl');
When this preference is used in an index, the indexing engine will invoke the
user filter program for each document when the text is required.
INSO_FILTER
The INSO filter automatically recognizes and filters over a hundred different
formats, including Word 97 and Acrobat. The full list can be found in the documentation.
The INSO filter does not have any attributes at this time, so the CTXSYS.INSO_FILTER
preference can be used in any index needing filtering.
IMPORTANT: This filter outputs HTML, not plain text. Make sure you employ
the HTML section group or all these tags will be indexed. The default system
will employ the HTML section group when the INSO filter is detected.
The INSO filter uses code from Inso, "the world leader in filtering and viewer
technology," and Adobe. This code is not integrated code directly, it instead
follows the user filter architecture. The user filter executable for the INSO
filter is ctxhx.
ctxhx requires shared libraries and other files (.flt and .tpt files) located
in ?/ctx/lib. The installer should copy the correct platform- specific files
to this directory. Additionally, ensure that the external procedure agent has
?/ctx/lib in the LD_LIBARY_PATH environment variable (PATH on NT). This can
be done using the ENVS section in the listener.ora. On NT you may need to have
ORACLE_HOME set in the ENVS section, as well.See Installation section above
for details.
If you are encountering problems using this filter, it can be run by hand.
First get a formatted binary file. Let's call it testfile.doc. You do:
ctxhx testfile.doc testfile.txt
This should create the file testfile.txt with the HTML representation. The error
messages you get, if any, should help you determine what's wrong.
The INSO filter is supported only on Solaris, HP-UX, AIX, and NT. Other platforms
are not able to use this filter at this time.
Where's the HTML filter ?
Some of you familiar with previous versions of ConText may be wondering about
the HTML filter. In 8i, this code has been moved to the new sectioner class,
which allows any filter to spit out HTML if it so desires.
Section Groups
The SECTION_GROUP Class
The section group class is new for Oracle8i, and incorporates functionality
that was part of the wordlist and filter classes in ConText. It takes a text
format, such as XML or HMTL, as input. It is unique in having two outputs --
the section boundaries, which go to the engine, and the plaintext, which goes
to the lexer.
Creating Section Groups
Section groups are not created using create_preference -- they have their
own API in ctx_ddl:
ctx_ddl.create_section_group('mygroup','html_section_group');
The first argument is the name of the new section group. The second argument is
the section group type. This specifies the input text format, and tells the sectioner
the rules for detecting section boundaries and transforming to plaintext. Each
section group type is covered briefly below.
An empty section group can transform formatted test to plaintext, but will
not index any section boundaries. You must first tell it which sections to index
by adding sections to the group.
Sections have three important attributes: TAG, NAME, and TYPE. TAG tells the
section group how to recognize the section. Because the section group already
knows the format, you do not need to specify start and end tags; a section in
an HTML section group with TAG of B knows that the section starts with <b> and
ends with </b>, for instance. Tags are unique across the sections of a section
group.
NAME is how you refer to the section in queries. You may want to name the
B tag BOLD, for readability. Multiple tags can be mapped to the same name --
they are then treated as instances of the same section. H1, H2, H3 can all be
treated as instances of the HEADLINE section. We recommend avoiding non- alphanumeric
characters -- such as underscore -- in section names. Using these characters
will force you to escape the names in queries.
TYPE is totally new for 8i. There are three different types of sections: ZONE,
SPECIAL, and FIELD.
Section Types
ZONE sections are like sections from previous versions. interMedia records
where in the document the section start and end tags occur. WITHIN queries check
that the hit words occur between the start and end word offset.
If a ZONE section repeats, each instance is treated separately in query semantics.
See Examples, below. ZONE sections can enclose other sections, including themselves,
and can be enclosed by other sections. ZONE sections are added like this:
ctx_ddl.add_zone_section('groupname', 'sectionname', 'tag');
SPECIAL sections are so named because they are not recognized by tags. There are
two special sections in Oracle8i -- SENTENCE and PARAGRAPH -- both recognized
by punctuation in the lexer. They index the start and end of sentences and paragraphs,
respectively. You add special sections like this:
ctx_ddl.add_special_section('groupname', 'sectionname');
No tag argument is needed. There are only two allowed values for section name:
SENTENCE, and PARAGRAPH.
ZONE and SPECIAL sections index section start and end word offsets, but do
nothing to the words in the section. FIELD sections, on the other hand, extract
their contents and index them separately from other words in the document. WITHIN
queries are run on this separate, smaller index. This makes field section query
faster than zone section query -- up to three times as fast in tests we've conducted
-- especially when the section tags occur in every document.
This speed comes at the cost of flexibility. FIELD sections are meant for
non-repeating, non-overlapping sections. If a field section repeats, it is treated
as a continuation of the section, not a distinct instance. If a field section
is overlapped by itself or by another field section, it is implicitly closed
at the point where the other section starts. Also, there is a maximum of 64
field sections in any section group. This is 64 section names, not tags. Remember
that you can map multiple tags to the same section name. You add field sections
like this:
ctx_ddl.add_field_section('groupname', 'sectionname', 'tag');
To illustrate, let's work through a couple of examples to illustrate the different
types of sections and the impact on query semantics. We'll use the following document
as an example:
<A>rat</A><A>ox</A>
<B>tiger rabbit</B>
<C>dragon<C>snake</C></C>
This is an XML-style markup, but without a DTD, so we will use the basic section
group type:
ctx_ddl.create_section_group('mygroup','basic_section_group');
Let's start with ZONE sections:
ctx_ddl.add_zone_section('mygroup', 'asec', 'a');
ctx_ddl.add_zone_section('mygroup', 'bsec', 'b');
ctx_ddl.add_zone_section('mygroup', 'csec', 'c');
This tells the section group to recognize A tags as the section ASEC, B tags as
the section BSEC, etc. To do section queries, we use the WITHIN operator. Remember
to use the section name and not the tag:
contains(text, 'rat within asec') > 0
This finds the document.
contains(text, 'tiger within asec') > 0
This does not find the document. Although it has "tiger", it does not occur in
the ASEC. If instead of the original setup we had mapped A and B to the same section
name:
ctx_ddl.add_section('mygroup', 'asec', 'a');
ctx_ddl.add_section('mygroup', 'asec', 'b');
Then both:
contains(text, 'rat within asec') > 0
contains(text, 'tiger within asec') > 0
would find the document, because both A and B are treated as ASEC.
An important facet of ZONE section query semantics is that each instance is
treated distinctly. That is, a query like this:
contains(text, '(tiger and rabbit) within bsec') > 0
finds the document, but a query like this:
contains(text, '(rat and ox) within asec') > 0
does not find the document. Although the document has "rat" and has "ox", and
"rat" is in ASEC, and "ox" is in ASEC, "rat" and "ox" are not within the SAME
ASEC. Note that
contains(text, '(dragon and snake) within csec') > 0
finds the document, since they are both in the outer CSEC, even though the inner
CSEC contains only "snake".
Special sections follow the same semantics, so a query like this:
contains(text, '(jumbo and shrimp) within sentence') > 0
is looking for documents with "jumbo" and "shrimp" in the SAME sentence.
Now let's look at field sections. What if in setup we made ASEC a field section
instead of a zone section:
ctx_ddl.add_field_section('mygroup', 'asec', 'a');
Unlike zone sections, each instance of a field section is considered a continuation
of the section. The query
contains(text, '(rat and ox) within asec') > 0
finds the document, although it didn't when ASEC was a zone section. The field
section simply stitches the two instances together. Field sections work best on
sections like BSEC, which does not repeat like ASEC nor nest like CSEC.
One last subtlety with field sections. If ASEC is a zone section, then
contains(text, 'rat') > 0
finds the document -- it contains the word rat. However, remember that field sections
work by extracting the document contents and indexing them separately. Thus, if
ASEC were a field section, this query would not find the document. "rat" is in
ASEC, which is separate from the document contents.
You can, however, change this by making the field section visible. There is
an optional boolean fourth argument to add_field_section. If this boolean argument
is set to true:
ctx_ddl.add_field_section('mygroup', 'asec', 'a', TRUE);
then the field section is visible, and the section contents are visible to non-within
queries, just like zone sections. This is accomplished by double- indexing the
word -- once as part of the extracted section, and once as part of the body, so
this option has index space cost.
Using Section Groups
Section groups are plugged into indexes using the section group keyword:
create index
...
parameters ('section group mygroup');
Unlike previous versions, you do not need to set an attribute in the wordlist
object. You also no longer need to set STARTJOINS and ENDJOINS in the lexer. However,
the WHITESPACE, NEWLINE, and PUNCTUATIONS settings in the basic lexer affect sentence
and paragraph boundary recognition, which then impact special sections. There
are five section group types.
NULL_SECTION_GROUP
The null section group is used when there is no section information to be
extracted -- it simply passes the text through to the lexer. Only special sections
can be added to the null section group.
BASIC_SECTION_GROUP
The basic section group is for simple XML-style markup without going full-
bore on the DTD stuff. It can be used in custom situations where full XML is
unnecessary. Tags are in the form .... Entities and tag attributes
are not supported. The only processing it does for plaintext is removal of the
markup tags.
HTML_SECTION_GROUP
The HTML section group is for HTML. Good name, huh? It knows the rules for
HTML, including ISO Latin-1 entity translation, HTML to plaintext conversion,
and
omission. It knows HTML 4.0 tags, and has code to deal with unknown tags.
SCRIPT and STYLE contents, and comments are removed. Contents of the TITLE
section are preserved. META tag information indexing is a known customer requirement
for a future version.
For those familiar with previous versions, there is no need for KEEP_TAG anymore
-- this is automatic when you add the tag to the section group.
XML_SECTION_GROUP
The XML section group is for XML. It does no processing to plaintext except
for entity translation and tag removal. It can handle non-system, non-parameter
entities in the internal DTD. It cannot access external DTD's in this version.
NEWS_SECTION_GROUP
The news section group handles newsgroup-style postings -- RFC-1036 in particular.
This format is used in newsgroups and is similar to the one used for e-mail.
Note that while RFC-1036 defines the set of allowable header fields, Oracle
does not require this -- any header fields can be used.
Messages in this format consist of a header section, a blank line, and a body
section. The news section group recognizes <beginning of line>TAG: as the start
of sections, and <end of line> as the end of the section. It stops looking for
this pattern when the blank line is encountered.
Translation to plaintext consists of removing all header lines which are not
important to section searching. A message like this:
From: me
To: you
X-ref: 5!oowj
hello! How are you!
with only the from and to tags in the section group, will look like this to the
indexing engine:
<from section start>me<from section end>
<to section start>you<to section end>
hello! How are you!
Lexers
The LEXER Class
The lexer class gets plaintext from the sectioner and splits it into words.
Lexer preferences are plugged into indexes using the lexer keyword:
create index
...
parameters ('lexer mylex');
The lexer class has five objects.
BASIC_LEXER
The basic lexer can be used for most European languages. It is programmed
with default rules for splitting whitespace-delimited text into words. You can
modify its parsing behavior to some extent with the lexical attributes.
Ordinarily, non-alphanumeric characters end a word. SQL*Plus is lexed as "SQL"
and "Plus". The JOINS attributes allow you to modify this behavior by declaring
sets of non-alphabetic characters to treat as valid word letters.
PRINTJOINS are non-alphanumeric characters which are treated as valid characters,
and become part of the word. Declaring PRINTJOINS of "*" lets SQL*Plus be lexed
as "SQL*Plus".
SKIPJOINS are like PRINTJOINS, but they do not become part of the word. Declaring
SKIPJOINS of "*" lets SQL*Plus be lexed as "SQLPlus".
STARTJOINS are like PRINTJOINS, but they only come at the start of the word.
If they occur in the middle of a word, a new word is started. Having "*" as
STARTJOINS lets SQL*Plus be lexed as "SQL" and "*Plus". Multiple STARTJOINS
are allowed at the start of a word.
ENDJOINS are like STARTJOINS, but for the end of the word. They implicitly
end a word, too. So, "*" as ENDJOINS lexes SQL*Plus as "SQL*" and "Plus". Multiple
ENDJOINS are allowed at the end of a word. STARTJOINS and ENDJOINS used to be
important for section searching in previous versions, but with the new sectioner
class, they are no longer needed for that purpose.
Each of these four JOINS attributes is a set of characters. Setting PRINTJOINS
to "_%*" means that "_", "%", and "*" are all PRINTJOINS.
NUMJOIN is a single character which is the numeric join character. NUMGROUP
is the single numeric group character. They are defaulted based on your NLS_LANG
setting. For US, NUMJOIN is "." and NUMGROUP is ",".
Finally, CONTINUATION are those characters which indicate line continuation.
These and the following newline are removed from the text. Default value is
hyphen and backslash.
Then there are three attributes important for sentence/paragraph indexing.
PUNCTUATIONS are the set of punctuation marks (?!. by default). WHITESPACE is
a set of additional whitespace characters other than space and tab, which you
get for free and cannot change. NEWLINE can be set to NEWLINE or CARRIAGE_RETURN.
This controls whether lines are ended by \n or \r\n.
A sentence is then recognized as a PUNCTUATION followed by one or more WHITESPACE.
A paragraph is a PUNCTUATION followed by a NEWLINE. There are other patterns
which are variations on this, but that's the basic idea.
The third set of basic lexer attributes controls term normalization -- the
process of converting words to a standard form. BASE_LETTER is a boolean attribute
which, if set to YES, will convert accented characters to their unaccented forms.
By default, each word is uppercased during indexing so that queries are case-independent.
MIXED_CASE is a boolean, which, if set to YES, does not do this. This makes
indexing and queries case-sensitive.
ALTERNATE_SPELLING is an attribute which, if set, uses letter transformation
rules for common alternate spellings. It can be set to GERMAN, DANISH, or SWEDISH.
In GERMAN mode, for instance, A-umlaut gets transformed to AE. Since the lexer
is used at both indexing and query time, query for a word containing A-umlaut
will find the word spelled with A-umlaut or AE.
Finally, COMPOSITE controls word decompounding. In German, for instance, multiple
words are often combined into a single string. This makes lexing difficult,
because the words are no longer whitespace-delimited. If COMPOSITE is set, then
the lexer breaks up these compound words, which allows interMedia to
index and query as usual. COMPOSITE can be set to GERMAN or DUTCH. Unset or
DEFAULT means no decompounding.
The last set of basic lexer attributes control text and theme indexing. New
for 8i, the basic lexer can generate and index themes for each document, in
addition to splitting it up into words. This merges the functionality found
in the separate THEME LEXER and BASIC LEXER of previous versions.
INDEX_TEXT is a boolean which controls word indexing. INDEX_THEMES controls
theme indexing. Setting INDEX_THEMES to NO makes it a text-only index. Setting
INDEX_TEXT to NO makes it a theme-only index. They cannot, of course, be both
NO. Indexing themes takes longer and uses a bit more index space, but improves
the efficacy of ABOUT queries. We'll talk more about this when we discuss ABOUT
query.
The CTXSYS.DEFAULT_LEXER preference has theme indexing ON for the English
language-specific scripts.
JAPANESE_VGRAM_LEXER
Japanese is not a whitespace-delimited language (except in textbooks) so it
is very difficult for computers to pick out individual words. The Japanese V-gram
lexer gets around this problem by indexing overlapping clusters of characters.
The Japanese word ABC might be decomposed into AB and BC, for instance. Query
terms are also decomposed this way. Thus, the contains query is not looking
for words, but for patterns of characters.
The Japanese lexer does not have any attributes in this version.
CHINESE_VGRAM_LEXER
Chinese is also not whitespace-delimited, so a similar solution has been applied
for Chinese, as well. The Chinese v-gram lexer also does not have any attributes.
CHINESE_LEXER
New for Oracle8i is a segmenting Chinese lexer, which can actually
recognize some Chinese words as whole entities. The rest of the text is still
broken into v-grams, but this should be more space-efficient and have faster
query than the pure v-gram method. The Chinese segmentation lexer does not have
any attribtues.
KOREAN_LEXER
Korean is whitespace-delimited, but has problems with verbs, which can have
thousands of different forms. Our Korean lexer is a lexicon-driven engine (using
a third-party 3-soft lexicon) which simply eliminates verbs from indexing. New
for Oracle8i is the ability to eliminate adverbs and adjectives, do various
form conversions and perform morphological and segmentation decomposition.
Other Indexing Classes
The STOPLIST Class
Not every word in a document is worth indexing. Linguistic lubricant like
prepositions and conjunctions are important for language understanding, but
are not very useful for information retrieval -- they are very common and so
convey very little information about the document by their presence. Most of
the time, it's not worth the space to index these words.
The stoplist class holds a list of these words, called stop words. During
indexing, the engine consults the stoplist and filters out the stop words. The
stoplist class does not use the create_preference API. Instead, it has its own
API:
ctx_ddl.create_stoplist('mylist');
Words are added using add_stopword:
ctx_ddl.add_stopword('mylist', 'the');
Here the article "THE" has been added to the stoplist. You can see the words in
a stoplist using the ctx_stopwords view. A stoplist can have a maximum of 4095
stopwords. Each can be up to 64 bytes in length.
Stopwords are case-sensitive in a stoplist, but if the index is not case-
sensitive, the stopwords will not be case-sensitive. So, you could add "THE"
and "the" to a stoplist, but if the index is not case-sensitive, there would
be no difference -- you'd just be wasting a stopword slot.
Stoplists are plugged into indexes using the stoplist keyword:
create index
...
parameters ('stoplist mylist');
New for Oracle8i, stopwords can be added to the index without re-indexing:
alter index myidx rebuild parameters ('add stopword AND');
This adds "AND" as a new stopword after indexing. Remember that the index makes
its own copy of everything, so this does not affect the stoplist used to create
the index.
The default stoplist, CTXSYS.DEFAULT_STOPLIST, is language-specific, and default
stoplists are provided for most European languages.
Other Stop Objects
Stoplists can also hold two other kinds of objects: stop themes and stop classes.
Stop classes are classes of words, rather than individual words. The only stop
class available in Oracle8i is NUMBERS, which stops all numbers from
being indexed. Alphanumeric words, such as 123.456.789 and abc231 are still
indexed. This behavior depends on the lexer to recognize numbers.
Stop themes are used only during theme indexing. The entries added as stop
themes are themed, and these themes AND THEIR CHILDREN are not indexed. For
instance, say you enter "biology" as a stop theme. This will prevent themes
like "Physical Sciences:Biology:Genetics" and "Physical Sciences: Biology:Botany"
from being indexed. This is just an example, and does not actually reflect the
knowledge base.
Stop themes are used only during theme indexing, so adding "biology" as a
stop theme will not stop the word "biology" or biology-associated words from
being indexed.
The WORDLIST Class
The wordlist class is an odd class because it has no effect on indexing. Instead,
it holds fuzzy and stem expansion settings used at query time.
Wordlist preferences are plugged into indexes using the wordlist keyword:
create index
...
parameters ('wordlist mylist');
There is only one wordlist object: BASIC_WORDLIST, with the following attributes:
| STEMMER |
This attribute is set to the default stemming type. Stemming is an expansion
of a word to different forms. The stem expansion of GO might include GO,
GOING, WENT, and GONE. The rules for this vary from language to language.
(See expansion query operators).
STEMMER can be set to one of the following: ENGLISH (inflectional),
(English) DERIVATIONAL, DUTCH, FRENCH, GERMAN, ITALIAN, SPANISH, and NULL,
which means no stemming. The CTXSYS.DEFAULT_WORDLIST setting is language-specific.
|
| FUZZY_MATCH |
This attribute is set to the default type of fuzzy match. Fuzzy match
is an expansion technique designed to find words close in form, such as
mistyped or mis-OCR'ed versions.
FUZZY_MATCH can be set to one of the following: GENERIC, JAPANESE_VGRAM,
KOREAN, CHINESE_VGRAM, ENGLISH, DUTCH, FRENCH, GERMAN, ITALIAN, SPANISH,
and OCR. The CTXSYS.DEFAULT_WORDLIST setting is language-specific.
|
| FUZZY_SCORE |
This is the default score floor for fuzzy expansions. New for Oracle8i
is the ability to limit the fuzzy expansion to the best matches. The fuzzy
score is a measure of how close the expanded word is to the query word --
higher is better. Setting fuzzy score means that fuzzy expansions below
this score will not be produced. You can set this from 0 to 80. |
| FUZZY_NUMRESULTS |
This is the default maximum number of fuzzy expansions. Setting fuzzy
numresults limits the fuzzy expansion to a certain number of the best matching
words. You can set this up to 5000. |
The STORAGE Class
The storage class holds storage parameters for the index tables. The Text
index is made up of five oracle objects, so the single storage clause in the
create index statement is insufficient to specify storage. Storage preferences
are plugged into indexes using the storage keyword:
create index
...
parameters ('storage mystore');
There is only one storage object: BASIC_STORAGE, with the following attributes:
| I_TABLE_CLAUSE |
This attribute is the storage parameters for the I table, which is the
main index table. There will be at least one row per unique word in your
index -- probably multiple rows per word. Each row has a BLOB which tries
to stay inline, making the row a bit large. The value is simply tacked on
to the end of the create table statement, so you could set it like this:
ctx_ddl.set_attribute('mystore', 'I_TABLE_CLAUSE',
'tablespace foo storage (initial 1K)');
|
| K_TABLE_CLAUSE |
This attribute is the same, but for the K table, which is an IOT for
rowid to docid conversion. There will be one row per row in your base table.
Each row consists of a number and a rowid. |
| R_TABLE_CLAUSE |
This attribute is the same, but for the R table, which is a BLOB table
for docid to rowid conversion. There will be only a few rows in this table,
and will have a large BLOB in each row. |
| N_TABLE_CLAUSE |
This attribute is the same, but for the N table, which has invalid docids
waiting for garbage collection. The number of rows in this table will depend
on the frequencies of update and delete DML, and of garbage collection optimization.
Each row is short, holding two numbers. |
| I_INDEX_CLAUSE |
This attribute is the same, but for the unique index on the I table.
|
Indexing Memory
Those of you familiar with previous versions may be wondering where the index
memory setting is. For new users, interMedia builds an inverted index,
so documents are not added to the index one at a time, like a b-tree index.
Instead, information from multiple documents is buffered up and periodically
flushed to the index during indexing. In previous versions, you could set the
size of this buffer in the analogue of the storage class. Setting this buffer
higher reduces index fragmentation but uses more memory.
In Oracle8i, index memory can be set directly in the parameters string. This
allows you to set index memory on a per-operation basis. The keyword used is
MEMORY:
create index
...
parameters ('memory 2M');
Here a paltry 2 megabytes of memory is used for indexing. If not specified,
the system parameter DEFAULT_INDEX_MEMORY is consulted for the systemwide default
index memory. At install time, this is 12M.
The amount of memory cannot exceed system parameter MAX_INDEX_MEMORY. This
allows the system administrator to disallow outrageous amounts of index memory.
The MEMORY keyword is also allowed in the alter index parameter string for
batch DML (background DML uses DEFAULT_INDEX_MEMORY). This allows you to have
different memory sizes for create index and DML, something which is not possible
in previous versions. You simply specify a different memory amount in the create
index parameter string then in the alter index parameter string.
DML
DML Processing
Most text search engines have asynchronous document updates and additions --
changes to the index are usually done in batches, after and separately from
the document changes. This is due to three aspects of inverted indexing:
- Text indexing a single document is a lot of work. The values are usually
long, and at the very least you have to break it into words. Depending on
the objects, you may also have to do web page requests, filtering, and HTML
parsing.
- Inverted indexes, composed of lists of documents by word, are best updated
in batches of documents at a time. Updating the index one document at a time
leads to either word lists one document long or a lot of appending to existing
lists.
- Most text applications are fairly static, having relatively lower DML frequency,
and less need for full consistency semantics than traditional transaction
processing systems.
interMedia faces the same challenges. Here is what it does for each DML
on an indexed column:
| INSERT |
The document rowid is placed into a queue, dr$pending, for later addition
to the text index. Queries before this DML is processed will not find the
new document contents. |
| UPDATE |
The old document contents are invalidated immediately, and the document
rowid is placed into the dr$pending queue for later reindexing. Queries
before this DML is processed will not find the old contents, but neither
will it find the new contents. |
| DELETE |
The old document contents are invalidated immediately. No further DML
processing is required -- queries will no longer find the document. |
Thus, interMedia has synchronous invalidation but asynchronous addition.
This extends to transactions, as well:
delete from table where contains(text, 'delword') > 0;
select * from table where contains(text, 'delword') > 0;
--> you get no rows returned
rollback;
select * from table where contains(text, 'delword') > 0;
--> you get the rows returned
and:
insert into table values (1, 'insword');
select * from table where contains(text, 'insword') > 0
--> you do not get the row
commit;
select * from table where contains(text, 'insword') > 0
--> you still do not get the row
The synchronous invalidation is new to this version. Also new is that the DML
notification is done through integrated kernel code, and not through triggers
as in previous versions.
Processing Additions
So how do you process your inserts and updates when they are sitting in the
queue? interMedia provides two methods: sync and background.
Sync is the manual way -- you control when and how it's invoked. Each invocation
processes all pending inserts and updates for an index:
alter index myindex rebuild online parameters ('sync')
The ONLINE keyword is very important. Without this, during the sync operation
queries are blocked and DML on the base table fails. Make sure to use it for sync
operations.
Background DML requires you to start a ctxsrv background daemon in the operating
system:
ctxsrv -user ctxsys/<password>
Once started, the daemon polls the dr$pending queue for DML, and automatically
processes additions as they become available.
You can use either or both -- it's largely a matter of your application requirements.
Background DML scans for DML constantly. This leads to new additions being indexed
automatically and quickly. However, it also tends to process documents in smaller
batches, which increases index fragmentation. Sync DML is done at the user's
request, so usually the batches are larger and thus there's less index fragmentation.
A happy compromise could possibly be reached by invoking sync through dbms_sql
in a dbms_job which runs at longer intervals.
The DML Queues
The dr$pending queue holds the documents waiting to be indexed. It should be queried
through the ctx_user_pending view, which makes it more readable. Additionally,
there is a dr$waiting queue which is used when documents are waiting to be indexed
and they are updated again. The drq_inprog, drq_batches, and drq_batchno tables
from previous versions are all no longer needed.
Optimization
Index Fragmentation
The Text index is an inverted index, so essentially consists of words and
lists of documents which contain that word. When new documents are added to
this index, these lists have to be extended. Instead of appending to these lists,
more rows for the word are created. However, this can lead to many rows with
sub-optimal, short lists -- called "index fragmentation". For instance, say
we have the following in our index:
DOG DOC 1 DOC 3 DOC 5
Now a new document with "dog" comes along. Simply add it to the index as a new
row:
DOG DOC 1 DOC 3 DOC 5
DOG DOC 7
Subsequent DML will also create new rows:
DOG DOC 1 DOC 3 DOC 5
DOG DOC 7
DOG DOC 9
DOG DOC 11
This is why background DML generally produces more index fragmentation than spot
invocation of sync: processing DML less frequently in larger batches allows newly
added rows to have longer lists, which reduces the number of rows in the index
table.
Keep in mind that a certain amount of fragmentation is expected in 8i. interMedia
will try to keep index table rows below 4000 bytes long, to inline the LOB and
to speed up index row merges used during phrase search, AND, section search,
etc. However, frequent DML will probably result in much more fragmentation than
this.
Document Invalidation
During an update or delete, the old document information must be removed from
the index so that queries will no longer hit it. However, because the inverted
index consists of words and rows of occurrences, this would entail finding all
words in the old version of the document (which may not be available), and removing
an occurrence from the lists of those rows in the index table. This is too much
work to do synchronously, and deletes must be synchronous in the integrated
model.
Instead, interMedia marks the old document contents as invalid and
does not touch the index. The mark signals queries to remove the document from
any query result sets. However, this leaves the old information in the index,
taking up space in the index tables.
Optimization
To solve these potential problems in the index, you run optimization. Optimization
has two modes: FAST and FULL. FAST optimization targets fragmentation only:
alter index myindex rebuild online parameters ('optimize fast');
The ONLINE keyword is important -- without it, queries are blocked and DML fails
with an error. FAST optimization runs through the whole index table and glues
fragmented rows together, to a maximum of 4000 bytes for a single row. This reduces
the number of rows in the index table.
FULL optimization does both defragmentation and garbage collection, which
removes the old information left over after document invalidation:
alter index myindex rebuild online parameters ('optimize full');
It loops through each row of the index table, but, unlike fast optimization, cracks
open each row's list, removing old document information. It also glues separate
rows together where possible.
Because garbage collection is more involved and time-consuming than defragmentation,
FULL optimization does not have to be run on the whole table at one time. Instead,
you can run it for a limited period of time:
... parameters ('optimize full maxtime 5')
which means run for a maximum of 5 minutes. At the end of 5 minutes, it will stop.
The next time you run optimization, it will pick up where it left off. This lets
you do a little of bit optimization each night during off times, and ensure that
it is not loading the system when the off time ends.
Garbage collection is needed only after document invalidation, which happens
only for delete and updates. If your document table is append-only, meaning
that documents are inserted and never changed or removed, then it should suffice
to run only defragmentation.
Optimization Concurrency
Unlike previous versions, optimization (either mode) does not block background
DML processing. However, because Oracle prevents alter index operations from
running concurrently, sync invocations will be prevented.
Querying
The Basics
At this point all aspects of building an index and keeping it up to date have
been discussed. However, indexes are a means to enabling content- based queries.
IMPORTANT: Unlike a value query, which is slower without a b-tree index, contains
queries are completely disallowed without a Text index.
You do a query using the contains operator:
select id
from texttab
where contains(textcol, 'query') > 0
The first argument to contains is the name of the text column. The second is the
text query, which is limited to 2000 bytes in length. It returns a number, which
indicates the strength of match. 0 means not a match, so it is constrained to
> 0. The result is that this query finds the id's of all rows in texttab where
the textcol has the word "query".
Scoring
Of course, now you're wondering how to use that return value in the select
list and order by clause. An operator is provided just for that purpose:
select id, score(1)
from texttab
where contains(textcol, 'query', 1) > 0
order by score(1) desc
The "1" in the score and contains operators is called the "contains label" and
it's a number which links the score in the select list to the contains in the
where clause -- it can be any number, so long as it matches.
Score can be between 0 and 100, but the top-scoring document in the query
will not necessarily have a score of 100 -- scoring is relative, not absolute.
This means that scores are not comparable across indexes, or even across different
queries on the same index. Score for each document is computed using the standard
Salton formula:
3f(1+log(N/n))
Where f is the frequency of the search term in the document, N is the total number
of rows in the table, and n is the number of rows which contain the search term.
This is converted into an integer in the range 0 - 100.
Query operators like AND and OR operate on the scores of their operands.
Where Can I Use contains?
Unlike the transparent query rewrite of previous versions, you can include
a contains clause in any SQL statement using any tool which can issue oracle
SQL. Here is a short list of new places where contains queries can pop up:
- subqueries
- virtual tables
- PL/SQL
- View definitions
- DML (insert as select, e.g.)
There is no longer a limitation on the length of the contains SQL statement,
although there is a limit on the query term length.
Other Query Methods
For those of you familiar with previous versions, there is no two-step or
text cursors (n-memory query). If you absolutely must have result tables, use
insert as select. Instead of text cursors, use real PL/SQL cursors. To sort
by score, use an order by clause with a FIRST_ROWS hint. The hint will avoid
an oracle SORT ORDER BY.
However, interMedia still supports count_hits. count_hits is a fast
way to get the hit count for a text query, in PL/SQL:
num_hits := ctx_query.count_hits('indexname','query',TRUE);
The first argument is the name of the index. This should be a string, so use
quotes if you are using a literal value. The second argument is the query string,
and the third is a boolean toggle called "exact". If exact is TRUE, then the
result is accurate. If exact is FALSE, then the result is only an upper bound.
The actual count could be lower. exact FALSE is faster, because it doesn't screen
out invalid documents (see issues on DML and optimization for more details).
You could also do "select count(*)" with a contains clause, and this is recommended
when you have structured conditions, but count_hits is faster for text-only
queries.
Simple Queries
The simplest query is a single word:
contains(text, 'dog') > 0
You can escape your word with curlies:
contains(text, '{dog}') > 0
This is useful when you want to query on a reserved word or your word has special
characters which are query operators. The list of reserved words and the query
operator characters are in the documentation.
You can query on a phrase just as easily:
contains(text, 'dog my cat') > 0
Unlike internet search engines, this searches for the phrase "dog my cat", not
just any document which has these words.
If your phrase contains a stopword:
contains(text, 'dog the cat') > 0
then the stopword ("the" in this case) is treated as a wildcard, matching any
word. This query would match all the following documents:
dog the cat
dog my cat
dog cat cat
but not this document:
dog cat frog
In other words, the stopword must match something. Stopwords alone disappear from
the query. The query:
contains(text, 'the & cat') > 0
is reduced to
contains(text, 'cat') > 0
Queries on just stopwords will return no rows.
Query Operators (Part 1)
Boolean Operators
Boolean operators are used to combine the results of subqueries using boolean
algebra. interMedia has three boolean operators:
AND (&)
OR (|) |
AND and OR are the basic boolean operators. However, the operands are
returning numeric scores, rather than boolean values, so we've implemented
AND as a minimum of its operand scores, and OR as a maximum of its operand
scores. This results in expected behavior -- if any operand of AND scores
0 (the word is not in the document) -- then the AND retuns 0 (the document
is not returned). You can use either word or symbol: 'dog and cat' is the
same as 'dog & cat'. |
| NOT (~) |
NOT is not the unary negator. Instead, it is "AND NOT". 'dog NOT cat'
returns all document which have "dog" except those which also have "cat".
The score returned is the score of the left child. |
Subqueries and Grouping
You can use parentheses for subqueries and precedence, as in:
(dog my cat AND fido) OR horse
Without parentheses, the operators follow a precedence hierarchy which can be
found in the documentation. Since AND is higher precedence than OR, the parens
are not actually needed in the example above. However, they are needed in this
query:
dog my cat AND (fido OR horse)
in order to override the precedence hierarchy.
Scoring Operators
The scoring operators operate on the scores of their operands. There are four
scoring operators:
| WEIGHT (*) |
The weighting operator multiplies a search term's score to make it more
or less important in the query. The multiplier can be from .1 to 10:
contains(text, '(dog*2) AND cat') > 0
This query looks for documents with "dog" and "cat", but the score for "dog"
is multiplied by 2. This makes documents with "dog" more likely to have
a higher score in the result set than those with "cat". WEIGHT does not
have a word equivalent -- you must use the character *. |
| THRESHOLD (>) |
The threshold operator filters out documents below a particular score.
'dog > 10' will result only those documents containing "dog" which score
higher than 10. You can also do it this way:
contains(text, 'dog') > 10
but threshold is implemented at a lower level, so will be faster. THRESHOLD
does not have a word equivalent -- you must use >. |
| MINUS (-) |
The minus operator takes the left operand score and subtracts the right
operand score. 'dog - cat' looks for documents with "dog", but those documents
which also have "cat" will probably score lower than those without. This
is not the same as NOT, which completely filters out the document if the
right operand search term is present. |
| ACCUM (,) |
Accumulate groups several words or phrases, and scores higher when more
of its operands are hit. 'dog, cat, frog' will score any document with all
three words higher than any document with two of the three, which will score
higher than any document with only one of the three. This is changed from
previous versions, where ACCUM merely added the scores of its operands.
You can either use the symbol , or the word ACCUM. |
Set Operators
For those of you familiar with previous versions, the MAXDOC (:) and FIRST/NEXT
(#) operators have been removed. You can use cursor fetching loops to get the
same effect.
Word Expansion Operators
Word expansion operators expand a word to find similar forms of the word in
the document set. For Oracle8i, the limit on maximum number of expansions has
been removed. However, the more words a term expands to, the slower the query
will run. There are five word expansion operators:
| WILDCARD (% _) |
You can use the SQL like wildcards % and _ to find words matching a pattern.
'do%', for instance, finds all documents with words beginning with do, such
as dog, door, etc. This is done via a like query on the index table, so
a wildcard query like '%do%' will be slow, as a full table scan of the index
table is required. |
| FUZZY (?) |
Fuzzy query finds words with similar form, using a proprietary algorithm.
This is useful for finding mis-typed or mis-OCR'd words. The fuzzy operator
is ?, as in '?dog'. You can change the rules used for fuzzy and limit the
expansion using the wordlist class. FUZZY has no word equivalent -- you
must use the ? operator. |
| STEM ($) |
Stem query finds words with the same stem form, using integrated Xerox
linguistic code. This is useful for finding "GOING" and "WENT" from "GO",
for instance. You can change the settings for stem expansion using the wordlist
class preferences. STEM has no word equivalent -- you must use the $ operator,
as in '$go'. |
| SOUNDEX (!) |
Soundex query finds words which sound alike. This no longer uses a separate
table, but is instead a specialized fuzzy expansion. SOUNDEX has no word
equivalent -- you must use the ! operator, as in '!dog'. |
| EQUIV (=) |
Equivalence is for manually inputting the various forms of a word, as
in 'dog = cat', searching for documents with 'dog' or 'cat', treating them
as different forms of the same word. Equiv only works on simple words --
you cannot equiv phrases. You can either use the symbol = or the word EQUIV.
|
Proximity Operator
Other than phrases, interMedia has one proximity operator -- NEAR,
which finds documents where the input phrases are close to each other. The closer
the words are to each other, the higher the score. NEAR has two forms. The shorthand
form uses the ; character;
dog ; cat ; boat
This query finds documents which have dog, cat, and boat, and scores the document
higher the closer they are to each other.
The function form has options to set range and directionality:
NEAR((dog,boat), 10, TRUE)
The first argument to NEAR is the list of words. In the example, we are looking
for dog and boat. The second argument is the maximum allowed span. In the example,
we are constraining it to within a 10-word span. That means there cannot be more
than 10 words between "dog" and "boat". This range can be up to 100. It is optional
-- if omitted, then it defaults to 100. The third argument is directionality.
If TRUE, as in the example, then the words have to appear in the order specified
in the list. If FALSE, the words can appear in any order, as long as they fit
within the specified word span. This can be omitted, in which case FALSE is the
default.
Query Operators (Part 2)
Within
The within operator limits a subquery to a particular section:
dog within title
will find all documents with "dog", but only if it occurs in the "title" section.
The name of the section is used rather than the tag. The "title" section has to
have defined in your section group -- you cannot specify arbitrary tags. If your
section name has non-alphanumeric characters, you should enclose the name in curlies:
dog within {my_title}
which is why it is't recommended to use non-alphanumeric characters in section
names. Within is high on the precedence hierarchy, so if you have a subquery,
it is safer to use parentheses to group it:
(dog and cat) within title
If you don't use parentheses, it will search for documents with "dog" anywhere
and with "cat" in the title section:
dog and cat within title == dog and (cat within title)
which is a different query.
If the section is a zone or special section, the subquery is constrained to
a particular instance of the section:
(dog and cat) within sentence
means dog and cat within the SAME sentence. This is different from:
(dog within sentence) and (cat within sentence)
for instance -- which means dog in any sentence, and cat in any sentence. Section
types and query semantics were covered in issue #7.
Within does not nest, so this is not allowed:
ABOUT is an advanced query operator. We take the input to ABOUT and do our
best to increase precision and recall. If your index has a theme component (INDEX_THEMES
is YES for BASIC_LEXER) then ABOUT does a theme query. It determines the theme
of the input term, and finds documents with the same theme. This allows a query
like this:
contains(text, 'about(canines)')
to find relevant documents even if they don't have the word "canines" in them
-- they might have "dogs" or "wolves" instead, for instance. It will also filter
out documents which have the word "canines" but are not significantly about canines,
such as an article on dentistry, for instance.
The input to ABOUT does not have to be a single term. It can be an unstructured
phrase:
about(japanese banking investments in indonesia)
This phrase will most likely have multiple themes -- banking, Japan, etc. The
documents retrieved by this query will have all those themes. This allows you
to use plainly-worded descriptive phrases instead of rigid query syntax and still
find relevant documents.
If your index does not have a theme component, then ABOUT will do a query
expansion to increase recall -- it will split the input into words, stem each,
then string it back together using accumulate:
about(go home now)
is transformed to
$go,$home,$now
Note that the purpose of ABOUT is to have interMedia do what it can to
find the best-fitting documents. It does not have a formulaic definition like
the other operators.
ABOUT can be combined with other operators:
contains(text, 'about(dogs) and $cat')
for instance. This allows you to do a combined theme and text search, which was
not as easy to do in previous versions.
Thesaurus Operators
The thesaurus operators allow you to expand a thesaurus phrase using a thesaurus.
The SYN operator, for instance, expands a word into synonyms:
SYN(dog) == {dog} | {mutt} | {canine}
The expansion uses a previously loaded thesaurus. This is usually a thesaurus
of your own creation -- interMedia does not install a thesaurus by default,
although a sample thesaurus is included in ?/ctx/sample/thes. You can see the
expansion by using the ctx_thes package functions:
declare
output varchar2(80);
begin
output := ctx_thes.syn('dog');
dbms_output.put_line(output);
end;
Thesaurus operators take simple thesaurus terms as input -- they cannot nest or
take expansions as input:
SYN($dog) <-- BAD
SYN(BT(dog)) <-- BAD
SQE
SQE's are Stored Query Expressions -- a macro operation. You store a query
string and give it a short name, using the ctx_query PL/SQL package:
ctx_query.store_sqe('cats', 'cat = feline = kitty');
The first argument is the name of the SQE -- "cats". The second argument is the
stored expression. When I use it in a query:
contains(text, 'SQE(cats)')
interMedia expands it to the stored query expression, making this equivalent
to:
contains(text, 'cat = feline = kitty')
SQE text can even refer to other SQE's:
ctx_query.store_sqe('animals', 'frog | hog | sqe(cats)');
Although if you set up a circular reference, you will get an error at query time.
There are fundamental changes to SQE in 8i. First, stored query expressions are
no longer tied to an index -- you can use any SQE in any query on any index. You
can even use other user's SQE's by prepending the owner name:
SQE(bob.cats)
Second, SQE partial results are no longer stored. Instead, SQE works as a macro
operation, replacing lexically during query parsing. Finally, there is no longer
the notion of session-level SQE -- store_sqe does a commit.
PL/SQL
For those of you familiar with previous versions, the PL/SQL operator (@)
has been removed. This is due mainly to stricter transactional restrictions
resulting from tighter integration with the kernel
Thesaurus
Overview of Thesaurus Functionality
interMedia provides a program to load formatted, file-based thesauri
into ctxsys- owned tables in the database. These thesauri can then be used at
query time to expand query terms for improved recall. interMedia also
provides PL/SQL functions for browsing the thesaurus while it's in the database,
and a program for exporting the thesaurus back out to a file.
These thesauri are usually customer-defined. interMedia provides a
sample thesaurus in ?/ctx/sample/thes, but this is not imported by default.
The Thesaurus File
The first step in thesaurus import is construction of the thesaurus file.
This is a list of terms in the thesaurus, each followed by their relationship
to other terms. For instance:
canine
SYN mutt
BT mammal
NT wolf
lizard
BT reptile
The first term in this thesaurus is "canine". The second line declares that "canine"
has a synonym, "mutt". The following lines declare that the broader term of "canine"
is "mammal", and a narrower term of "canine" is "wolf".
Terms, such as "canine" or "lizard" in our example above, must be on their
own line, and must be at the start of the line -- no whitespace is allowed between
the beginning of the line and the term.
Relationship lines, such as "SYN mutt", must have whitespace before them.
The relationship word (SYN) is not case-sensitive. You cannot have more than
one relationship word / word pair on a single line:
canine
Relationships cannot be on the same line as the term:
These rules are required for proper parsing of the thesaurus file. The more relaxed
rules in previous versions were not able to correctly parse certain cases, especially
when terms had common substrings with relationship words, such as "NT 4.0".
Thesaurus Relationships
The thesaurus file supports the following set of relationship words:
| SYN |
Synonymity. A SYN B means that A and B mean the same thing. A SYN B implies
B SYN A. A SYN B and B SYN C implies A SYN C. |
| UF |
Use For. Same as SYN. |
| PT |
Preferred term. A PT B means that B is the preferred term for A. There
can be only one preferred term for any thesaurus term. If multiple PT lines
are seen, latter PT lines override previous lines. |
| USE |
Same as PT. |
| SEE |
Same as PT. |
| BT |
Broader Term. A BT B means that B is a broader term of A. A BT B implies
B NT A (A is a narrower term of B). Terms can have multiple broader terms.
Circular references should not be introduced.
You can specify whole hierarchies under a single term using BTn:
dog
BT1 canine
BT2 mammal
BT3 vertebrate
BT4 animal
This isn't declaring multiple broader terms of dog, but instead the whole
hierarchy above dog -- canine is the first-level BT of dog, then mammal
is the second-level BT -- meaning that mammal is the BT of canine. BTn goes
up to BT16. |
| BTP |
Broader Term Partative. A BTP B means that B is a broader term of A in
the sense that A is a part of B. Hard drive BTP Computer, for instance --
Computer is a broader term of hard drive in the sense that a hard drive
is a part of a computer (NCA notwithstanding). A BTP B implies B NTP A,
but does not imply A BT B -- the partative hierarchy is separate from the
normal hierarchy. This relationship is sometimes called "meronymity", although
not by anyone with a life. |
| BTG |
Broader Term Generic. A BTG B means that B is a broader term of A in
the sense that B is a generic name for A. Rats BTG Rodents, for instance.
A BTG B implies B NTG A, but does not imply A BT B. This relationship is
sometimes called "holonymity". |
| BTI |
Broader Term Instance. A BTI B means that B is a broader term of A in
the sense that A is an instance of B. Cinderella BTI Fairy Tale, for instance.
A BTI B implies B NTI A, but does not imply A BT B. I don't know any fancy
name for this relationship. |
| NT |
Narrower Term. The opposite of BT. You can use NTn, just like BTn. |
| NTP |
Narrower Term Partative. The opposite of BTP. |
| NTG |
Narrower Term Generic. The opposite of BTG. |
| NTI |
Narrower Term Instance. The opposite of BTI. |
| RT |
Related Term. A RT B means that A and B are associated, but are not synonyms
and cannot be arranged into a hierarchy. A RT B implies B RT A. A RT B and
B RT C does not imply A RT C. |
| <lang>: |
Foreign language translation. <lang> can be any label you wish to
use for the language, 10 characters or less. This applies to the immediately
preceding term, even if it is a relationship word:
cat
french: chat
RT hat
french: chapeau
Here "chapeau" is the french translation for "hat", not "cat". |
| SN |
Scope note. You can attach a little note to a term as a comment. This
can be up to 2000 characters. If you need multiple lines, simply repeat
the SN keyword -- the text is concatenated:
cat
SN I had a cat once. He was brown and
SN grey and well-behaved.
|
Our thesaurus functionality also supports homographic disambiguation using
parenthesis:
mercury (god)
BT Greek mythology
mercury (element)
SYN quicksilver
as well as compound terms using the + sign:
thermometer
SYN temperature + instrument
Importing a Thesaurus File
Once you have the file, you need to import it into the interMedia data
dictionary. From the OS command line:
ctxload -user ctxsys/ctxsys -thes -name mythes -file mythes.txt
The login user can be any CTXAPP user -- it does not have to be ctxsys. -thes
is the toggle for thesaurus import. mythes is the name for the new thesaurus,
and mythes.txt is the file to import.
If a thesaurus of this name already exists, you will get an error and you
must drop the existing thesaurus before importing this one. Thesaurus names
are database-global.
Optionally, you can specify the thescase argument to make the thesaurus terms
case-sensitive:
ctxload ... -thescase Y
The default is N, meaning that the thesaurus is not case-sensitive. Mercury (the
god) and mercury (the element) are the same term if found in the thesaurus file.
Thesaurus Query Operators
Once the thesaurus is loaded, any user can use it in any query on any index
using the thesaurus operators. These mimic the relationship names for the most
part. For instance:
contains(text, 'SYN(dog, mythes)') > 0
will search for documents which contain any of the synonyms of "dog" defined in
the "mythes" thesaurus. The actual expansion uses OR like this:
{canine}|{doggie}|{mutt}|{dog}
Homographic disambiguators are not included in the expansion:
mercury (element)
SYN quicksilver
SYN(quicksilver) === {quicksilver}|{mercury}
Compound phrases are treated as a conjunctive:
thermometer
SYN temperature + instrument
SYN(thermometer) == {thermometer}|({temperature}&{instrument})
Thesaurus operators take simple thesaurus terms as input -- they cannot nest or
take expansions as input:
SYN($dog) <-- BAD
SYN(BT(dog)) <-- BAD
Here are the specific thesaurus functions:
| SYN( term [, thesname] ) |
The input term is always part of the output; if no synonyms are found
then term is the expansion. thesname is the name of the thesaurus to use.
If omitted, thesaurus DEFAULT is consulted. There is no DEFAULT thesaurus
installed automatically -- it is up to the user to load a thesaurus named
DEFAULT. |
| PT( term [, thesname] ) |
Preferred term. Only the preferred term is returned in the expansion.
|
BT( term [, level [,thesname]] )
BTP( term [, level [,thesname]] )
BTG( term [, level [,thesname]] )
BTI( term [, level [,thesname]] )
NT( term [, level [,thesname]] )
NTP( term [, level [,thesname]] )
NTG( term [, level [,thesname]] )
NTI( term [, level [,thesname]] )
|
Broader/Narrower terms of all types. The optional second argument is
a numeric scope indicating how many levels of hierarchy you want. A BT with
a level of 2, for instance, would search for the term, all broader terms
of the term, and all broader terms of those terms.
NOTE: the syntax demands that if you want to specify thesname
then you MUST specify level too, eg
'NT(term,99,mythes)' |
| TT( term [, thesname] ) |
Top term. Travel up the BT hierarchy to the root term. Only this root
term is searched. |
| RT( term [,thesname] ) |
Related term. Search for term and all related terms. |
| TR( term [, lang [, thesname]] ) |
Foreign language translation of the term. Lang argument should match
label used in the input thesaurus. If omitted all foreign language translations
will be searched. Note that the foreign language translation must have been
entered in the input thesaurus -- we don't do automatic translation. |
| TRSYN( term [, lang [, thesname]] ) |
Foreign language translation of term synonyms. This is functionally equivalent
to taking SYN of the term, then TR of each resulting term. Like TR, if the
target language is omitted, all available translations will be searched.
NOTE: the syntax demands that if you want to specify thesname
then you MUST specify lang too, eg
'TR(term,all,mythes)' and 'TRSYN(term,all,mythes)' |
Browsing a Loaded Thesaurus
There are numerous functions in the ctx_thes package for browsing the thesaurus
through PL/SQL. For the most part, they have the same names and arguments as
the query functions, but they return the expansion text. For instance:
declare
exp varchar2(4000);
begin
exp := ctx_thes.syn('dog','mythes');
end;
Here the synonyms of "dog" in the "mythes" thesaurus are returned. The expansion
is returned as a string, so exp will get the value:
{canine}|{doggie}|{mutt}|{dog}
The return can be up to 32000 characters in length.
Exporting a Thesaurus
Thesaurus export is also done through ctxload:
ctxload -thesdump -user ctxsys/ctxsys -name mythes -file mythes.txt
The exported version will be structurally similar to the imported version -- all
the words will be there in the right places, but it may not be exactly the same
-- in particular, it will probably be more verbose, specifying explicitly relationships
which were implied in the input thesaurus.
Thesaurus Standards
The phrase ISO-2788 or -5964 thesaurus standard is somewhat misleading. The
computing industry considers a "standard" to be a specification of behavior
or interface. These standards do not specify anything. If you are looking for
a thesaurus function interface, or a standard thesaurus file format, you won't
find it here. Instead, these are guidelines for thesaurus compilers -- compiler
being an actual human, not a program. They give general advice on how to build
and maintain a thesaurus, such as "Adverbs should not be used alone as indexing
terms".
What Oracle has done is taken the ideas in these guidelines and in ANSI Z39.19
-- an American version of ISO-2788 -- and used them as the basis for a specification
of our own creation.Therefore interMedia can handle all the inter-word
relationships mentioned in Z39.19, and even use the same abbreviations, but
the interface is our own. the interMedia file format looks a bit like
"Figure 1" of ISO-2788, but Oracle imposed its own rules to make it parseable.
interMedia can implement foreign language lookups like ISO-5964, but
the input thesaurus looks nothing like their examples. So, Oracle supports ISO-2788
relationships or ISO-2788 compliant thesauri.
Query Tuning
Analyzing Text Queries
In Oracle8i, tuning Text queries is not very different from tuning
regular SQL queries. This is demonstrated with an extended example, using this
table of magazine articles:
create table articles (
id number primary key, -- article id
issue number, -- issue number
title varchar2(80), -- article title
published date, -- publication date
text clob -- article text
);
Assume issue has a non-unique b-tree index called art_issx, and text has a Text
index called art_textx.
The first step in query tuning is analysis of the query plan. This is done
using explain plan. Note that the SQL explain plan command is different from
ctx_query.explain. explain plan works on a SQL query, and prints the SQL query
plan executed by oracle. ctx_query.explain works on the input to the contains
function, and prints out the Text query plan executed by interMedia.
ctx_query.
The Oracle Server Tuning manual details how to set up and run SQL explain
plan, so check there if you do not know how this is done. You don't need to
do anything special for interMedia queries -- thanks to tighter integration
with the kernel, the optimizer now understands what the contains clause means
and can generate plans for Text queries -- it's all transparent to the end user.
Here is a simple query:
select title from articles -- query 1
where contains(text, 'iced lolly') > 0
This will normally generate a query plan like this:
1.1 TABLE ACCESS BY INDEX ROWID ARTICLES
2.1 DOMAIN INDEX ART_TEXTX
The DOMAIN INDEX part is our equivalent of an INDEX RANGE SCAN -- it means that
the index (the Text index, in this case) is producing a stream of rowids for the
kernel. This is called rowsource invocation, and, like an index range scan, is
a fairly efficient way of finding matching rows.
Score
add score to the select list:
select score(1) score, title from articles -- query 2
where contains(text, 'iced lolly', 1) > 0
This generates the same query plan as above -- score does not appear in the query
plan. However, score does make query 2 execute a bit more slowly than query 1.
The SCORE function is called "ancillary data", and it is delivered separately
from the rowid stream. The kernel will ask for some rowids, then, in a separate
call, will give a rowid and ask for that row's score. This second call imposes
a bit of overhead, which is why query 2 is slower than query 1.
But how about a query like this:
select score(1) score, title from articles -- query 3
where contains(text, 'iced lolly', 1) > 0
order by published desc
1.1 SORT ORDER BY
2.1 TABLE ACCESS BY INDEX ROWID ARTICLES
3.1 DOMAIN INDEX ART_TEXTX
The kernel classifies the SORT ORDER BY as a "blocking operation". The presence
of a blocking operation means that Oracle8i cannot guarantee that it will
ask for scores in the same order as the produced rowid stream. It could produce
rowid ABC first, but get asked for its score last. This means that the scores
for all rows must be kept in memory, called "batch processing". If the hitlist
is large, this could use disk, and be slower than query 2 score resolution. Of
course, the order by would normally make query 3 slower anyway.
The presence of any blocking operation switches interMedia to batch
processing, even something like this:
select score(1) score, title from articles -- query 4
where contains(text, 'iced lolly', 1) > 0
order by score(1) desc
could use incremental processing, but still uses batch processing.
Now, query 4 is important in information retrieval, because it is very common
to show only the first few top-scoring hits in the hitlist. Web search engines
work this way, for instance. This case puts a premium on response time -- the
time to get the first page is critical, the time to get the second page is not
as important. Unfortunately, in the database world, the order by decreases response
time, because all the hits have to be produced and sorted before the first row
is displayed.
Because this is a common situation, interMedia has an alternative query
method which improves response time for score sorting:
select /*+ FIRST_ROWS */ score(1) score, title -- query 5
from articles
where contains(text, 'iced lolly', 1) > 0
order by score(1) desc
The addition of the FIRST_ROWS hint causes the kernel to notify interMedia
of the sort by score. interMedia produces the rowid stream in score-sorted
order, and the kernel eliminates the SORT ORDER BY -- similar to a b-tree index
range scan ordered by the key prefix:
1.1 TABLE ACCESS BY INDEX ROWID ARTICLES
2.1 DOMAIN INDEX ART_TEXTX
This is called "pushing down" the sort. Also, the elimination of the SORT ORDER
BY allows us to use incremental score processing instead of batch score processing.
The payoff of FIRST_ROWS is improved response time; the cost is decreased
throughput. interMedia cannot produce the hits in score-sorted order
naturally. Instead, it runs the Text portion internally, and remembers the top-scoring
500 or so rows. Because the set size is small and because we keep it all in
memory rather than disk, Oracle8i can sort it faster than a SORT ORDER BY. However,
when it runs out of hits in this batch, it has to re-execute the query to get
another batch. Thus, if you're fetching all the results, do the SORT ORDER BY;
if you just want the first few hits, use the FIRST_ROWS hint.
The FIRST_ROWS hint will not just affect interMedia -- the entire query
will be optimized for response time. This is not usually a problem, since you
want the whole query optimized for response time or throughput. However, if
you want to optimize the overall query one way and interMedia another
way, you can use the DOMAIN_INDEX hints. To have interMedia score-sorting,
but optimize the overall query for throughput:
select /*+ ALL_ROWS DOMAIN_INDEX_SORT */
Conversely, if you want the overall query optimized for response time, but you
don't want interMedia score-sorting:
select /*+ FIRST_ROWS DOMAIN_INDEX_NO_SORT */
Mixed Queries
So far, the discussion has looked at rowsource invocation for contains --
Oracle8i asks for a stream of rowids which match the contains. However,
there is an alternate invocation called "function invocation". In function invocation,
Oracle8i produces the rowid stream some other way -- a b-tree index,
a full table scan, etc. and then asks , for each rowid in that stream, if the
row matches the contains:
select /*+ FULL(articles) */ title -- query 6
from articles
where contains(text, 'face flannel') > 0
1.1 TABLE ACCESS FULL ARTICLES
Here a full table scan is forced, which makes Oracle8i go through the table
row by row, calling function contains for each one. Note that the contains function
invocation is not shown in the plan, just as a structured condition filter would
not be shown.
For the same number of rowids, function invocation is much slower than rowsource
invocation. So why have it? Consider a query like this:
select title from articles -- query 7
where contains(text, 'face flannel') > 0
and issue = 7
Recall that issue has a non-unique b-tree index. rowsource invocation could be
used - get a rowid stream from the Text index, and check each rowid to see if
issue = 7:
1.1 TABLE ACCESS BY INDEX ROWID ARTICLES -- plan 7
2.1 DOMAIN INDEX ART_TEXTX
On the other hand, the b-tree index could be used on issue - find all rows where
issue = 7, and call function contains individually for each of those:
1.1 TABLE ACCESS BY INDEX ROWID ARTICLES -- plan 8
2.1 INDEX RANGE SCAN ART_ISSX NON-UNIQUE
The query plan is driving from the index on issue, which implies function invocation
of contains. Why do that? Function invocation is slower than rowsource for the
same number of rowids.
If the number of articles in issue 7 is very small, and the number of articles
with "face flannel" in the text is very large (they were quite the rage in the
80's) then the number of rowids is not the same. It may be faster to drive from
the b-tree index and use function contains on a small number of rows than to
drive from the Text index and do a TABLE ACCESS for a large number of rows.
Of course, it is usually impossible to have a priori knowledge of the selectivities,
which makes it impractical to use hints to specify which index to use. Thanks
to integration with the cost-based optimizer, you don't have to hard-code it
-- let the CBO decide which method is more efficient.
Using the CBO
In order to choose correctly, the cost-based optimizer needs to know the selectivities
of the structured and Text conditions. You provide this information by generating
statistics:
analyze table articles
compute statistics
This will (eventually) generate stats for the table and all indexes, including
the Text index. You can use estimate statistics, as well -- is makes no difference
with respect to the Text index.
When interMedia is asked to compute statistics, it analyzes term frequencies
in the Text index. For the most common terms, we record the number of documents
which contain that term. It also calculates the average number of documents
per term to use for all other terms.
Given a Text query, and using these stats, it can compute an estimate of the
query selectivity. Because it doesn't record exact term frequency for every
term, and because a Text query can combine terms using AND, phrase, WITHIN,
etc. this selectivity is only a rough estimate, not an exact number.
The CBO, confronted with a query like query 7, can then use the b-tree stats
and the Text index stats and -- without hints -- can choose the best index to
use to drive the query.
If your b-tree data is not uniformly distributed, then histograms for the
data can help:
analyze table articles
compute statistics for column issue size 127
See Oracle SQL Reference and Server Tuning manuals for more information. Nothing
special is done for histograms -- it simply provides more accurate stats to the
CBO. Like any SQL query, you can use hints to override the CBO when it's wrong.
Rowid Bitmap Merges
The CBO can choose a third option for query 7 -- use both indexes, and perform
a rowid bitmap merge of the two results. You can force this using the INDEX_COMBINE
hint:
select /*+ INDEX_COMBINE(articles art_issx art_textx) */ -- query 9
title
from articles
where contains(text, 'face flannel') > 0
and issue = 7
which produces the following plan:
1.1 TABLE ACCESS BY INDEX ROWID ARTICLES -- plan 9
2.1 BITMAP CONVERSION TO ROWIDS
3.1 BITMAP AND
4.1 BITMAP CONVERSION FROM ROWIDS
5.1 SORT ORDER BY
6.1 DOMAIN INDEX ART_TEXTX
4.2 BITMAP CONVERSION FROM ROWIDS
5.1 INDEX RANGE SCAN ART_TEXTX NON-UNIQUE
This is the best performer when the result lists for the structured and contains
are roughly equal, somewhat large, and the intersection is small. In comparison,
plan 8 would be hobbled by function contains overhead, and plan 7 would require
unnecessary TABLE ACCESSes to check the structured condition. A similar plan is
used when bitmap indexes are employed.
Bitmap merge, like SORT ORDER BY, is a blocking operation, which, if you are
selecting SCORE, will force batch processing instead of incremental processing.
Query Feedback
Text Query Tweaking
Text queries rarely give you exactly what you want. Consider web search engines
like AltaVista. Sometimes there's too many documents. Sometimes you wonder why
a certain document turned up in the hitlist. Sometimes the document you wanted
to find isn't in the results at all.
Similar results could ocur with interMedia, so it provides two query
feedback methods which allow you to refine your Text query.
Explain
Explain is the interMedia analogue of explain plan, and let's you see
exactly what your query looks like. This lets you trim unintended wildcard expansions,
for instance, or add parentheses to correct precedence errors.
You need a result table, just like explain plan, with the following columns,
in the following order:
create table xres (
explain_id varchar2(30),
id number,
parent_id number,
operation varchar2(30),
options varchar2(30),
object_name varchar2(64),
position number
)
| explain_id |
A label which distinguishes one query plan from another. |
| id |
A step sequence number. This is an ascending sequence number assigned
to each step which ignores hierarchy. The root step of each plan has id
1, the second step has id 2, etc. |
| parent_id |
The id of the parent of this query plan step. The root step has parent_id
= 0. |
| operation |
The type of operation for this step. This can include WORD, AND, WITHIN,
etc. |
| options |
The expansion operators (fuzzy, wildcard, etc.) are transformed into
equivalence operators. The options column tells you what the original expansion
function was. options is also used to show arguments to NEAR. |
| object_name |
The operand of the query step. For WORD operations, this is the actual
word being searched. It also holds section names for SECTION operations,
weight values, etc. |
| position |
A sequence number to sequence children of the same parent. The first
child step will have position 1, the second will have position 2, etc. |
You then run explain for your query on your index:
ctx_query.explain('myindex', 'dog and ($run spot)', 'xres');
The first argument is the name of the index. The second argument is the query
string, and the third is the name of the result table. The query is explained
and the results inserted into the plan table. These results are not committed,
which mimics the behavior of SQL explain plan. You can then print it out in explain
plan style using this SQL:
select lpad(' ',2*(level-1))||level||'.'||position||' '||
operation||' '||
decode(options, null, null, options || ' ') ||
object_name plan
from xres
start with id = 1
connect by prior id = parent_id;
Which produces the following plan:
1.1 AND
2.1 WORD DOG
2.2 PHRASE
3.1 EQUIVALENCE ($) RUN
4.1 WORD RAN
4.2 WORD RUN
4.3 WORD RUNNING
4.4 WORD RUNS
3.2 WORD SPOT
The optional fourth and fifth arguments to explain, sharelevel and explain_id,
allow multiple explains to use the same result table. When sharelevel is 0, the
default, then the result table is automatically truncated before the explain is
run. When sharelevel is 1, then any rows in the result table with the same explain_id
are deleted before the explain is run, but other rows are not touched. Thus, multiple
explains can use the same result table as long as they use distinct explain_id's.
When using shared result tables, the start with and connect by clauses of
the example sql have to be modified to limit them to a specific explain_id:
start with id = 1 and explain_id = 'YOURID'
connect by prior id = parent_id and explain_id = 'YOURID'
Hierarchical Query Feedback
Explain lets you see exactly how your query is run, which allows you to adjust
your query to refine the results. Hierarchical query feedback goes one step
further and actually suggests other search terms you might want to try.
You start by creating a result table. This result table looks a bit like the
explain result table, but with the addition of nested tables for suggested alternative
terms:
create table fres (
feedback_id varchar2(30),
id number,
parent_id number,
operation varchar2(30),
options varchar2(30),
object_name varchar2(64),
position number,
bt_feedback ctxsys.ctx_feedback_type,
rt_feedback ctxsys.ctx_feedback_type,
nt_feedback ctxsys.ctx_feedback_type
)
nested table bt_feedback store as fres_bt
nested table rt_feedback store as fres_rt
nested table nt_feedback store as fres_nt
The bt_feedback, rt_feedback, and nt_feedback store suggested broader terms, related
terms, and narrower terms, respectively. Each is a nested table of ctx_feedback_type
objects, which stores the suggested term in the attribute "text". The other attributes
are not currently used, but are reserved for future use.
Now you run hfeedback on your query:
ctx_query.hfeedback('myindex', 'felines', 'fres');
hfeedback will then draw on the linguistics knowledge base to suggest broader,
narrower, and related terms. The results are inserted into the result table. Like
explain and SQL explain plan, these inserts are not committed implicitly. You
can select from the nested tables to get feedback:
select text from the(select nt_feedback from fres
where object_name = 'felines')
TEXT
----
cat leopards tiger
Be aware that when getting feedback for queries with ABOUT clauses, the input
is broken into themes, so you cannot simply select back. For instance, if your
query were "ABOUT(African safari)", there would be no row with object_name = 'African
safari' -- more likely, there would be separate rows for "Africa" and "safaris".
The feedback terms are obtained from the built-in linguistics knowledge base.
If the knowledge base has been extended, feedback can also draw on these extensions.
The results will depend on the contents of your index -- only worthwhile terms
are shown. "Lion" is a narrower term of feline, but my index does not have any
documents with "Lion" in them, so it does not appear in the narrower term table.
Note that for performance reasons, we do not screen out invalid documents, so
it is possible (but unlikely) to get feedback terms which would not produce
query hits.
The feedback for non-ABOUT portions of the query can contain alternate forms
of words, as well -- narrower terms for "learning" might include "education",
"educating", "educational", etc.
If the index is case-sensitive, then case-sensitive feedback terms are provided.
For instance, narrower terms for "databases" might include both "PL/SQL" and
"Pl/sql".
When the index is case-insensitive, queries are case-insensitive -- a contains
query for "oracle" will pull up the same records as "Oracle". However, feedback
uses the linguistic knowledge base, which is case-sensitive. So, even if your
index is case-insensitive, the feedback you get depends on the case of the query
-- the feedback for "oracle" will not be the same as the feedback for "Oracle".
You can also give hfeedback multiple terms in a query string:
ctx_query.hfeedback('myindex', 'felines and canines', 'fres');
hfeedback will generate query plan steps as in explain, but phrases are not broken
into individual words and expansion operators such as fuzzy, wildcard, soundex
and stem are ignored. Each word or phrase step will have bt, nt, and rt tables.
Operator steps like AND will not have anything in these tables.
Like explain, hfeedback has sharelevel and feedback_id arguments for result
table sharing.
Highlighting
Document Services
Highlighting and linguistic extraction have been combined into a unified document
services package called ctx_doc. A document service is any ad hoc operation
on the document text.
Filter
The filter command fetches the document contents and filters it to readable
text. This allows you to see what the indexing engines sees. You first make
a result table:
create table fres ( query_id number, document clob )
Now request a filter:
ctx_doc.filter('myindex', '123', 'fres');
The first argument is the
name of the index -- you can use USER.INDEX, but the issuing user must have
SELECT access to the base table. The second argument is the primary key value
of the document you wish to filter. If you have a composite primary key, use
commas to separate the key fields:
ctx_doc.filter('myindex', 'ABC,123', 'fres');
Alternatively, you can use the function ctx_doc.pkencode:
ctx_doc.filter('myindex', ctx_doc.pkencode('ABC','123'), 'fres');
The third argument to filter is the
name of the result table. interMedia will translate the primary key to
rowid, fetch the document contents by rowid, and filter to readable text. It
will insert a new row into the result table and the readable text output will
be placed in the document clob column.
Query_ID
There is an optional fourth argument -- query_id -- which can be any integer.
This will be the value of the query_id column for the row in the result table.
If you are sharing result tables across several filter requests, you can have
each request use a different query_id, and distinguish the outputs by this column.
filter does not pre-truncate or pre-delete rows in the result table, regardless
of the setting of query_id. The default value of query_id is 0.
Plaintext
Recall part of the indexing pipeline for a moment:
+-----------+ +----------+ filtered +-----------+ plain
| | doc data | | doc text | | text
| datastore |---------->| filter |----------->| sectioner |----->
| | | | | |
+-----------+ +----------+ +-----------+
There are two "readable text" outputs in this segment of the indexing chain --
the filtered text, output from the filter, and the plain text, output from the
sectioner. When using the INSO filter, for instance, the filtered text is HTML
and might have text like this:
<BODY>
<P>This is the first
<B>sentence.</B>
</BODY>
The HTML section group will transform this into plain text, removing tags and
reformatting whitespace, much like a browser might save to text file:
This is the first sentence.
By default, filter will provide the filtered text. An optional boolean parameter
"plaintext" (the fifth argument) allows you to request the output from the sectioner
instead:
ctx_doc.filter('myindex', '123', 'fres', plaintext=>TRUE);
The plaintext output can be used for human readers, while the filtered text output
can be used if you are sending the results to a browser. Note that if you are
using the null section group, the plain text and filter text are the same, so
the plaintext argument would have no effect.
The default value of plaintext is FALSE, meaning filter text is produced.
Highlight
The highlight service takes a query string, fetches the document contents,
and shows you which words in the document cause it to match the query. The result
table for highlight looks like this:
create table hres (
query_id number,
offset number,
length number
)
Now request the highlight:
ctx_doc.highlight('myindex', '123', 'dog | cat', 'hres');
Like filter, the first argument is the name of the index and the second is the
document primary key. The third argument is the text query, and the fourth is
the name of the result table.
interMedia will parse the query and determine which words in the document
match the query. It then creates rows in the result table, showing the character
offset and length of each matching word. Let's say that our document looks like
this for the highlight call above:
I have a black cat and a white dog.
"I" is character 1, per oracle convention, so "cat" is at character 16 and has
length 3. "dog" is at character 32 and has length 3. If you are using a multi-byte
character set, remember that these are character offsets and lengths, not byte
offsets and lengths. The hres table will look like this:
QUERY_ID OFFSET LENGTH
---------- ---------- ----------
0 16 3
0 32 3
A single row in the result table can cover multiple words in the source document
-- the example simply uses single words.
Highlight has query_id as the fourth argument, working just like it does in
the filter service. Offset and length in the result table can apply to either
the plain text or filter text version of the document. This is controlled by
the plaintext argument in the fifth position.
You can use the filter service output in conjunction with the highlight output
to display the matching words:
select dbms_lob.substr(fres.document, hres.length, hres.offset)
from fres, hres
If your documents are in a varchar or clob you can do something similar with the
base table instead of the filter output.
Markup
Markup takes the highlight service one step further, and produces a text version
of the document with the matching words marked up. The markup result table is
the same format as the filter result table:
create table mres (
query_id number,
document clob
)
Now call markup:
ctx_doc.markup('myindex', '123', 'cat | dog', 'mres');
And the result is:
I have a black <<<cat>>> and a white <<<dog>>>.
Like the other highlight services, markup has a query_id argument for result table
sharing, and a plaintext argument to specify markup of filter text or plain text.
You can also override the text used to markup the matching words. This is
broken into start and end tags. Start tag is the text which goes before the
matching word, end tag is the text which goes after the matching word. By default,
start tag is <<< and end tag is >>>. You can override this using the starttag
and endtag arguments:
ctx_doc.markup('myindex', '123', 'cat | dog', 'mres',
starttag => '-->', endtag => '<--');
I have a black -->cat<-- and a white -->dog<--.
The length limit on tags is 30 characters.
Highlight Navigation
Highlight navigation is a new feature for Oracle8i, and allows you
to markup your text so that an interpreting program can navigate between highlights.
Let's use as an example HTML output being sent to a browser. The start and end
tags could be overridden to display highlighted words in bold:
ctx_doc.markup('myindex', '123', 'cat | dog', 'mres',
starttag => '<b>', endtag => '</b>');
<b>cat</b> ... <b>dog</b>
Now you might want to have links next to each highlighted word which take you
to the previous or next highlight. You would first have to anchor and name each
highlight using the A tag:
<A HREF=h1><b>cat</b></a> ... <A HREF=h2><b>dog</b></a>
Note that the start tags for these two words are different -- cat has h1 and dog
has h2. You can do this using markup macros. Each highlighted word is numbered
in document occurrence order. If %CURNUM is in a tag, it is replaced with this
number. This lets you do the anchoring above using start and end tag definitions
like so:
ctx_doc.markup('myindex', '123', 'cat | dog', 'mres',
starttag => '<a name=h%CURNUM><b>',
endtag => '</b></a>');
But clickable links for highlight navigation have not been achieved. A links would
need to added before and after each highlight, like this:
<a href=h1><b>cat</b></a><a href=#h2>></a> ...
<a href=#h1><</a><a href=h2><b>dog</b></a>
The first problem is that start and end tag cannot be used for this. For highlight
navigation, the first highlight will not have a prev link, and the last highlight
will not have a next link. Start and end tags are added for every highlight, so
those can't be used.
Luckily, markup has prevtag and nexttag specifically for this. Prevtag is
added before starttag for every highlight except the first one. Nexttag is added
after endtag for every highlight except the last one.
The second problem is that it is necessary to reference the next highlight
number in nexttag, and the previous highlight number in prevtag. For this, two
other markup macros, %PREVNUM and %NEXTNUM are used
Putting it all together, highlight navigation is obtained like this:
ctx_doc.markup('myindex', '123', 'cat | dog', 'mres',
starttag => '<a name=h%CURNUM><b>',
endtag => '</b></a>',
prevtag => '<a href=#h%PREVNUM><</a>',
nexttag => '<a href=#h%NEXTNUM>></a>');
This gives highlighted words in bold, with < and > links next to each one to take
the reader to the previous or next highlighted word.
Tagset
The call is complex so three sets of common definitions for starttag, endtag,
prevtag, and nexttag have been defined. You can use these sets by using the
tagset argument of markup:
ctx_doc.markup('myindex', '123', 'cat | dog', 'mres',
tagset => 'HTML_NAVIGATE')
This will give you similar functionality to the call above. Here are the defined
tagsets:
TAGSET
-------------
TEXT_DEFAULT
starttag = <<<
endtag = >>>
HTML_DEFAULT
starttag = <b>
endtag = </b>
HTML_NAVIGATE
starttag = <A NAME=ctx%CURNUM><b>
endtag = </B></A>
prevtag = <A HREF=#ctx%PREVNUM><</A>
nexttag = <A HREF=#ctx%NEXTNUM>></A>
You cannot define your own tagsets in this version. The tagset default is TEXT_DEFAULT.
Any specified values for starttag, endtag, prevtag, and nexttag override the values
for the tagset, so something like this:
ctx_doc.markup('myindex', '123', 'cat | dog', 'mres',
tagset => 'TEXT_DEFAULT',
endtag => '<--')
would highlight like this:
<<<dog<--
That is, the endtag has been overridden, but the starttag definition from the
tagset is still valid.
Linguistic Extraction
Linguistic Extraction
With the INDEX_THEMES attribute of the basic lexer, and the ABOUT clause,
you can easily enable thematic indexing and searching for your document table.
The linguistic extraction routines in the document services enable you to generate
document themes or theme summaries, on-demand and per-document.
The big change from previous versions is the elimination of the serivces queue.
In ConText, linguistics was asynchronous -- you submitted a linguistic request,
then eventually the answer would appear in the result table. Vast improvements
in linguistic engine performance have allowed us to make these requests synchronous
in Oracle8i. Now you invoke the linguistic function, and when the function
returns, you know the data is in the result table.
Linguistic extraction works only on English documents.
Themes
Themes allow you to generate the themes of a document. First you need a result
table like this:
create table tres (
query_id number,
theme varchar2(80),
weight number
)
Then invoke the themes request:
ctx_doc.themes('myindex', '123', 'tres');
Like the highlighting calls, the first argument is the name of the index, the
second is the document primary key, and the third is the name of the result table.
interMedia will fetch the document text, filter it, run it through
the sectioner, then parse it for thematic content. It will generate up to 50
themes, and place the results in the result table, in the theme column, one
row per theme. The weight column is a number showing relative importance of
the theme. It is only comparable to other themes of the same document. You cannot
compare theme weights from different documents.
An optional fourth query_id argument lets you share result tables. Like highlighting,
themes will not pre-truncate or pre-delete the result table.
The optional fifth argument, full_themes, allows you to generate themes with
their full knowledge catalog hierarchy. The call above might generate a theme
of "cats", for instance. Setting full_themes on:
ctx_doc.themes('myindex', '123', 'tres', full_themes => TRUE);
would generate
:science and technology:hard sciences:life sciences:biology:zoology:
vertebrates:mammals:carnivores:felines:cats:
instead. If you are using full themes, the length of the theme column in your
result table should be increased to 2000 or so.
Gist
Gist allows you to generate a theme summary of the document text. This does
not generate an original abstract like a human would. Instead, it is composed
of selected paragraphs or sentences which contribute most to the themes of the
document. The gist result table looks like this:
create table gres (
query_id number,
pov varchar2(80),
gist clob
);
and the gist function looks like this:
ctx_doc.gist('myindex', '123', 'gres');
interMedia will fetch the document text, filter it, run it through
the sectioner, then parse it for thematic content. It will generate the themes,
then generate gists for those themes, one row per gist, with the gist column
consisting of those paragraphs which contributed to the theme. The theme is
contained in the pov (point-of-view) column. Additionally, it might generate
a GENERIC gist, which consists of those paragraphs which best represent the
overall themes of the document.
The optional fourth argument to gist is query_id, which allows you to share
result tables, like all other calls in document services.
The optional fifth argument, glevel, is a flag which allows you to specify
the gist level. If glevel is 'P', the default, the gists are composed of whole
paragraphs from the document text. This is called the paragraph-level gist.
If glevel is set to 'S', the gists are composed of individual sentences from
the document text. This is called the sentence-level gist.
The optional sixth argument, pov, allows you constrain output to just one
point-of-view. Usually, this is done to generate only the GENERIC gist. This
is case-sensitive, so specify GENERIC in uppercase. This is not a way to force
the gist to a particular point-of-view; you cannot specify pov's here which
would not normally get produced. For instance, if you have a document in which
"dogs" and "cats" are pov's, and you specify "frogs", you will get no output.
The optional seventh and eighth arguments, numparagraphs and maxpercent, allow
you to constrain the size of the generated gist. numparagraphs specifies the
maximum number of paragraphs (or sentences, if you are doing a sentence-level
gist) which can make up each generated gist. maxpercent is similar, but limits
it to a percentage of the source document. If both are specified, the smaller
is the effective limit.
The Extensible Knowledge Base
All theme based features in interMedia -- themes, ABOUT queries, gists,
ABOUT query highlighting, and hierarchical query feedback -- depend on an internal
knowledge base (KB) that is supplied with interMedia. The knowledge base
consists of a lexicon of English words and phrases organized in a hierarchy
of categories. For the first time in Oracle8i, you can extend and customize
this knowledge base by adding new terms or redefining existing ones.
Consider a simple example in the domain of coffee futures. Documents in this
domain may mention the term ICO (International Coffee Organization). interMedia
can identify the theme "ICO" in such documents, but it does not exist in the
supplied KB -- it is known that the documents are talking about the ICO, but
not what ICO is. ICO is not classified in the concept hierarchy, and is not
associated with terms such as "coffee" or "International Coffee Organization".
ICO would look like this in a full-themes listing:
:ICO:
if it showed up at all. Because it is an isolated term with no parents, it won't
be judged a strong theme.
Loading a Custom Thesaurus
What is needed is to do is tell interMedia what ICO is all about. The
first step is to create a simple thesaurus for the term:
international trading organizations
BT international trade and finance
International Coffee Organization
BT international trading organizations
RT coffee
ICO
USE International Coffee Organization
Now load the thesaurus using ctxload, just like any other thesaurus. Make sure
the thesaurus is loaded case-sensitive, since case matters for theme generation:
ctxload -user ctxsys/ctxsys -thes -thescase Y -name COFFEE -file
and run the KB compiler, ctxkbtc (a new executable in 8i):
ctxkbtc -user ctxsys/ctxsys -name COFFEE
This will generate a new, extended KB. ctxkbtc would interpret the thesaurus above
as the following commands:
add "international trading organizations" to the KB under
"international trade and finance"
add "International Coffee Organization" under
"international trading organizations"
make "coffee" a related term to "International Coffee Organization"
add "ICO" as an alternate form for "International Coffee Organization"
Note that "international trade and finance" is an existing category in the supplied
KB. All supplied categories are listed in the documentation. It is recommended
that new terms be linked to one of these categories for best results in theme
extraction. If new terms are kept completely disjoint from existing categories,
they might not be strong enough to qualify as themes.
There can be only one user extension per installation -- each run of ctxkbtc
affects all users. However, you can specify multiple thesauri (up to 16) in
the -name argument of ctxkbtc to add them all to the extended KB:
ctxkbtc -user ctxsys/ctxsys -name COFFEE TEA MILK
Because ctxkbtc affects the entire installation, we recommend that only interMedia
administrators should extend the knowledge base.
Using the Extended KB
Nothing special needs to be done to use the new KB, other than to close and
re-establish open sessions -- iMT will use the new KB automatically. If you
generate themes now, this time you will see:
:business and economics:commerce and trade:international trade and
finance:international trading organizations:International Coffee
Organization:
Note that ICO has been normalized to "International Coffee Organization", it now
has parents, and its weight is higher because it is a known term.
There are some size limits and other restrictions on thesaurus terms for extending
the KB. See the Oracle8i interMedia documentation for details.
|