Oracle Text 9.2.0 Technical Overview
Oracle Text 9.2.0
Technical Overview
This is a technical overview of the improvements found in Oracle Text version
9.2.0. This is intended for an audience fairly familiar with previous versions
of the product.
If you are new to the product, you should familiarize yourself with the
product before reading this document. The official documentation, especially
the Application Developer's Guide, is a good place to start, as are the
interMedia Text technical overviews for 8.1.5, 8.1.6, 8.1.7, and 9.0.1 in
order. (Someday I'll get around to creating an updated overall technical
overview which merges these together)
The goal of this overview is to supplement the documentation in three ways:
-
Collect all the new features -- and only the new features -- in one document
so that expert users can catch up just by reading this instead of reading the
whole doc and mentally diff'ing.
-
Sketch each new feature without covering every option, bell, and whistle. Give
the reader enough information to decide if the new feature merits further
investigation in the main doc.
- Include information, observations, and notes on each feature which may
not be suitable for documentation. For instance, opinions on the pros and cons
of feature X, the internal mechanisms of feature Y, speculation of how feature
Z could be useful, etc.
Table of Contents
Upgrading 9.0.1 to 9.2.0
Oracle Text is integrated with the Database Migration Assistant (DBMA),
so the Text upgrade should happen as part of the overall database upgrade
procedure. This section details the manual upgrade process, in case something
goes wrong or you choose to not use DBMA.
This process assumes that you are starting with a valid 9.0.1 ctxsys data
dictionary, and that the system is not being used for index, DML, or query.
Use SQL*Plus to run all scripts.
1. run ?/ctx/admin/s0902000.sql as SYS
this grants new, needed database privileges to ctxsys
2. run ?/ctx/admin/u0902000.sql as CTXSYS
this upgrades the ctxsys schema to 9.0.1
It is not necessary to run anything else. The u0902000.sql
script calls other scripts to upgrade the indextypes
(t0902000.sql), recompile all packages
(dr0pkh.sql and dr0plb.sql), and recreate the shared
library (dr0ulib.sql).
The script is not expected to encounter errors other than ORA-955 ("name is
already used by an existing object", for public synonyms). After the script
is complete, all ctx-created objects owned by ctxsys should be
VALID.
Oracle Text also uses the new DBMS_REGISTRY facility, using
the component ID CONTEXT. However, since we do not currently have
a validation procedure, a return status of "VALID" from
DBMS_REGISTRY.STATUS does not guarantee installation validity.
The version for DBMS_REGISTRY.VERSION is generated from
VER_DICT of CTX_VERSION. You might notice that this
says 9.2.0.1.0 -- in the past that would have been the version for the first
patchset, but this is the official and expected version for the production
release.
Index Objects
URITYPE Indexes
The URIType and its subclasses are now indexable. As an example:
create table tutype(url sys.httpuritype);
insert into tutype values (
sys.httpuritype.createuri('http://www.oracle.com')
);
create index tutypex on tutype(url) indextype is ctxsys.context;
select * from tutype where contains(url, 'oracle')>0;
URIType indexes use DEFAULT_FILTER_TEXT and DEFAULT_SECTION_HTML. So, with
the default defaults, the above index would be using the NULL_FILTER and the
HTML_SECTION_GROUP.
Behind the scenes, Text is using the GETBLOB method to fetch the contents
of the URI. If you run into problems creating the index, a good sanity check
is something like:
select dbms_lob.getlength(a.<uritypecolumn>.getblob())
from <basetable> a;
which checks if the rows are readable outside of Text.
INSO_FILTER Attribute TIMEOUT_TYPE
Version 9.0.1 added the TIMEOUT attribute to INSO_FILTER. This checked the
size of the output file every TIMEOUT seconds and killed the filter if
no progress was detected. If a file filtered very slowly, the inso
filter could run for a long time -- even exceeding TIMEOUT seconds, as long as
the output file was continually growing in size.
The new TIMEOUT_TYPE attribute allows specification of the timeout style as
HEURISTIC (old 9.0.1 behavior) or FIXED (kill after TIMEOUT seconds no matter
what). The default is HEURISTIC to maintain backward compatibility. While
using FIXED timeout is more predictable, it may cause some long-filtering
documents to be excluded from the index.
BASIC_LEXER Attribute INDEX_STEMS
By default, stemming is a query expansion operation. At query time, it takes
the input word, generates all possible forms of the word, does a query to get
token_info for each form, then merges the results. This can be inefficient.
To ensure correctness, the stemmer must generate all forms of the input term
-- even strange or really uncommon forms. Since every form results in a query
on $I, each form that does not appear in the index just wastes time. The forms
which do appear in the index incur token info merging overhead.
The new INDEX_STEMS attribute of BASIC_LEXER provides a better method --
index-time stemming. During indexing, each word is stemmed to its base form.
This base form is added to the index (as a type 9 token) along with the
original form. For example:
create table stest(text varchar2(80));
insert into stest values ('RAN');
insert into stest values ('RUNNING');
commit;
We'll use the English derivational stemmer in the wordlist and lexer, and
create the index:
exec ctx_ddl.create_preference('stemming_lexer','basic_lexer');
exec ctx_ddl.set_attribute('stemming_lexer','index_stems','derivational');
exec ctx_ddl.create_preference('deriv_wordlist','basic_wordlist');
exec ctx_ddl.set_attribute('deriv_wordlist','stemmer','derivational');
create index stestx on stest(text) indextype is ctxsys.context
parameters ('lexer stemming_lexer wordlist deriv_wordlist');
The $I shows the base form indexed as type 9, and the original forms as
type 0:
select token_text, token_type from dr$stestx$i;
RAN 0
RUN 9
RUNNING 0
A query for $word is then just a matter of reducing word to its base form and
returning the results for that one type 9 token. This makes stem query
faster at the cost of index time and space. To prove that it is not doing the
expansion here, we can remove the type 0 rows from the index:
delete from dr$stestx$i where token_type != 9;
select text from stest where contains(text, '$run')>0;
RAN
RUNNING
Under certain circumstances, the stem operator is forced to fall back to
expansion behavior. If the stem operator output is the input to other
expansion operators, such as ?($(word)) then the stemming must be done as an
expansion in order to get the input to fuzzy. Another case is when the
query-time stemmer does not match the index-time stemmer. For instance, if
in the example above we had used the default wordlist (which uses English
stemmer) then the stem query would be done through expansion for correctness.
CTX_QUERY.EXPLAIN is the best way to examine the query plan and determine if
stem form or expansion is being used.
USER_LEXER
The USER_LEXER lets you write and plug in your own lexer to handle languages
or other indexing techniques that Text does not support. You specify two
stored functions -- one for index,
one for query -- that take in text, lex into tokens, and pass back the tokens
in an XML-esque document.
For our example, we'll create a very simple Hawaiian lexer. It will
uppercase the input and remove single quotes (glottal stop or okina).
First, we write the index lexer procedure:
create or replace procedure ul_idx_proc(
text in varchar2,
tokens in out nocopy varchar2,
locneeded in boolean
) is
...
(see Appendix B for actual code samples). Like user
datastore procedures, user lexer procedures must be owned by ctxsys and the
indexing user must have EXECUTE permission. For simplicity, our example will
be done entirely in the ctxsys schema.
Our example is also written entirely in PL/SQL. You can write lexers using
stored Java, or an external language like C or Perl, using the extproc
mechanisms (see the PL/SQL User's Guide).
We've used VARCHAR2 for "text" and "tokens" for simplicity, but CLOB is also
possible here for real-world implementations. You can name the parameters
whatever you want -- for readability, I'll refer to them by the parameter
names I've given them in this example, instead of "parameter 1".
Given input "imua i'olani", our user lexer will output in "tokens":
<tokens>
<word>IMUA<word>
<word>IOLANI<word>
<tokens>
The Text Reference contains the DTD and restrictions for this output format.
It's XML-esque, in the sense that it is XML, but with additional restrictions.
Note that because it's based on XML, those tag names are case-sensitive.
"locneeded" is TRUE when "tokens" output needs to include location data -- the
character offset and length of each token. locneeded will be TRUE most of the
time -- only stopword lexing does not require location data.
<tokens>
<word off="0" len="4">IMUA<word>
<word off="5" len="7">IOLANI<word>
<tokens>
Providing location data when locneeded is FALSE, or omitting location data
when locneeded is TRUE results in a lexer error.
ul_idx_proc will be called:
- once for each document during indexing (location data needed)
- once for each stopword during index metadata loading
- once for each document services call which requires lexing
(location data needed)
Next, we create a similar query lexer procedure:
create or replace procedure ul_qry_proc(
text in varchar2,
wilds in ctx_ulexer.wildcard_tab,
tokens in out nocopy varchar2
) is
...
"wildcards" is an input table of numbers which tells you which characters of
the passed-in text are wildcard characters. The output format of tokens is
similar to that of the index procedure. ul_qry_proc will be called at
query time, once per query term.
Now we create the user lexer:
ctx_ddl.create_preference('hawaiian_lexer','user_lexer');
ctx_ddl.set_attribute('hawaiian_lexer','index_procedure','ul_idx_proc');
ctx_ddl.set_attribute('hawaiian_lexer','input_type','varchar2');
ctx_ddl.set_attribute('hawaiian_lexer','query_procedure','ul_qry_proc');
and use it in creating an index, like any other lexer:
create table htest(text varchar2(80));
insert into htest values ('imua i''olani');
create index htestx on htest(text) indextype is ctxsys.context
parameters ('lexer hawaiian_lexer');
select token_text from dr$htestx$i;
IMUA
IOLANI
This is an extremely simple example. Your user lexer can also do composite
words, detect sentence and paragraph boundaries, and do custom number parsing.
See the Text Reference for full details.
Index Maintenance
Create Index Online
Version 9.2.0 implements support for create index online, which allows DML on
the base table during an index creation:
create index myindex on mytable(mycolumn)
indextype is ctxsys.context
parameters ('datastore mydatastore')
online;
Alter index rebuild online for "replace" and "resume" is also supported.
These should improve usability for installations that cannot afford to delay
changes during a long-running index creation or full rebuild.
The DML that happens during indexing gets logged to the table
dr$online_pending. At the end of the index creation, these pending DMLs are
processed, removing deleted rowids from $R and $K, and moving new rowids to
dr$pending. These changes are therefore not actually added to the index until
the next sync.
Keep the following limitations in mind when using online:
- ONLINE is supported for CONTEXT indexes only
- Although you can do DML on the base table, you cannot query during
a create index online
- ONLINE cannot be used in conjunction with PARALLEL. In this version you
can build an index in parallel or online, but not both. We may add support
for parallel online in a future version.
- ONLINE is not supported for local indexes. You cannot create a local
index online, nor rebuild a partition online.
- There is a small window at the very beginning of a create index online
(during our creation of index meta-data) where a concurrent DML will fail
with an error message
- There are small windows at the beginning and end of a create index online
where an exclusive lock on the table is required. This means that the create
index cannot be started if another session has an uncommitted transaction on
the base table. It also means that at the end of indexing, the process will
block until all open transactions on the base table are committed or rolled
back (maybe that should be rollbacked?).
Parallel Create Index
The parallel create index in previous versions used dbms_jobs and could be
applied only when creating global indexes on partitioned tables. The parallel
support has been rewritten in this version, and now supports creating an index
on a non-partitioned table in parallel.
When you issue a create index like:
create index parx on parbase(text) parallel 4
under the covers, the base table scan is split up by a parallel query:
select /*+ PARALLEL(base 4) */ rowid, text
from parbase base;
The database's parallel query facility splits the base table into partitions,
spawns slave processes, and assigns a different partition to each slave. Each
slave text indexes the rows in its portition.
The method of slicing the base table into portitions is determined by the
database's parallel query facility and is not under your (or Text's) direct
control. So is the number of slave processes actually spawned, which depends
on machine capabilities, system load, init.ora settings, and other factors.
It may not match the degree of parallelism requested. See
Notes on Parallel Degree for more
information.
The parallel servers
are P servers -- on my Solaris workstation, they show up named
ora_p<slave#>_<SID> -- like ora_p001_TEXT.
If logging is on, you'll get a seperate logfile for each slave process, with
the slave number appended to the logfile name. For instance, a parallel 2
create index where logging is to file "LOG" will create "LOG_0" and "LOG_1"
files. Be careful though -- each parallel operation opens and closes these
files. So, if you do a parallel index create, followed by a parallel sync
later on, the logfiles from the create can get overwritten, because each sync
slave will re-open its assigned logfile, thereby erasing the existing contents.
It's probably a good idea to back up the logfile or change the logfile name
(through end_log then start_log) between parallel
operations.
Also now supported are:
- alter index ... rebuild partition ... parallel
- alter index ... rebuild ... parallel
- alter index ... rebuild parameters ('resume') ... parallel
- alter index ... rebuild parameters ('replace') ... parallel
Keep the following limitations in mind when using online:
- PARALLEL is supported for CONTEXT indexes only
- PARALLEL cannot be used in conjunction with ONLINE. In this version you
can build an index in parallel or online, but not both. We may add support
for parallel online in a future version.
- PARALLEL is not supported for local index creation. If you specify
parallel for a local index creation, the clause is ignored and the index
creation is done serially. It still has the effect of setting the default
parallel degree for queries, which may have a noticeable performance effect on
queries (see Parallel Query on Local Indexes).
To build a local index in parallel, use
DBMS_PCLXUTIL.
Notes on Parallel Degree
The new parallel operations -- parallel create index, alter index, sync index,
and optimize index -- use parallel query under the covers. For parallel
create index/alter index/sync, if you specify
parallel degree of N, and you get N servers started, only N/2 of
those servers are actually doing indexing. If you are logging, you will
only see N/2 logfiles. The other N/2 servers are doing the work of
partitioning the base table. So, if you want 4 servers actually indexing, you
should specify a parallel degree of 8.
Parallel optimize does not use the same method. You specify N, it starts N
slaves, and all N of those slaves are doing optimize. However, the workload
across those N may be unequal, and some parallel servers may not actually
have any work to do -- you may see <N logfiles.
The settings of various database parameters also affect the actual degree of
parallelism. With improper settings, you could specify parallel 8 and only
get 4 slaves. The important ones are:
- PARALLEL_ADAPTIVE_MULTI_USER: this should be false, since we are
explicitly setting the parallel degree through hints, under the covers
- PARALLEL_MAX_SERVERS: this bounds the number of P servers that
can be spawned. So, for create index, should be >= 2*N, where N = number
of servers you want to be indexing, and for optimize index, should be N.
- PARALLEL_THREADS_PER_CPU: PARALLEL_THREADS_PER_CPU * (# of CPU's)
should be >= specified parallel degree.
See the Database Performance Guide, Concepts Manual, or Data Warehousing Guide
for general information on how to set up your database for parallel query.
DBMS_PCLXUTIL Support
The UNUSABLE keyword is now supported for create index. This will create
the index meta-data and then immediately exit the indexing. With support for
rebuild partition in parallel, you can now use DBMS_PCLXUTIL to
get parallel local index creation.
For instance, to create a local index on ptest, doing 3 partitions at a time,
building each partition in parallel 2, you'd do:
create index ptestx on ptest(text)
indextype is ctxsys.context
local unusable;
exec dbms_pclxutil.build_part_index(3, 2, 'PTEST', 'PTESTX', TRUE);
Exchange Partition
alter table exchange partition quickly swaps data between a non-partitioned
table and a table partition -- it doesn't physically move the data, it simply
reassigns data segments. Using exchange partition, you can quickly coalesce
several independent tables into one large partitioned table. You can have
multiple site databases collect info, transportable tablespace them into a
master database, and use exchange partition to get a monolithic table for data
mining, for example. When exchange partition is used with the "including
indexes" clause, the same fast switching is supposed to happen with the
non-partitioned table's index and the partioned table's local index's partion.
In 9.0.1, an exchange partition including indexes for context indexes forced
a rebuild of both the index on the non-partitioned table and the index on the
table partition. This has been rewritten so that exchange partition now does
data switching, making it much faster.
As an example, let's examine what happens when you do:
create table npt(id number, text varchar2(80));
insert into npt values(1, 'cat');
create index npx on npt(text) indextype is ctxsys.context;
create table pt(id number, text varchar2(80))
partition by range(id)
(partition p1 values less than (10));
insert into pt values (1, 'dog');
create index px on pt(text) indextype is ctxsys.context local;
alter table pt exchange partition p1 with table npt
including indexes;
First, Text will create a temporary table for each internal table. The
$I temporary table, for instance, will be DRMNPX$I. These temporary tables
are partitioned, with one partition. Then, it issues exchange partition
commands to:
- swap non-partitioned $I with temporary $I
- swap partitioned $I with temporary $I
- swap temporary $I with non-partitioned $I
for each internal table, except for $K. $K cannot be partitioned, so instead
we physically move rows between the three $K tables. This data moving will
probably take the majority of the time for most exchange operations. Finally,
the temporary tables are dropped.
In order to use exchange partition including indexes, the internal tables
cannot be partitioned. If, using the storage preference, you have partitioned
any internal table, you'll get an error message if you try to do exchange
partition including indexes. Similarly, if the two indexes do not use the
same indexing objects, you will get an error message during exchange. For
these cases, you will have to use exchange partition EXCLUDING INDEXES and
then manually rebuild the non-partitioned index and the partitioned index's
index partition.
Parallel Sync
Sync of context indexes can now be run in parallel. Specify the
desired number of slaves in the new parameter parallel_degree in
ctx_ddl.sync_index. Like
create index in parallel, this
uses the parallel query mechanism under the covers -- the driving parallelized
query is a join of dr$pending and the base table. See
Notes on Parallel Degree for more
information.
Unlike serial sync, a snapshot too old error during a parallel sync will
cause a failure, so if possible minimize the possibility of such errors by
running parallel syncs during periods where DML activity on the base table
is low.
Parallel sync is not supported for any other indextype.
New Sync Locking Mechanism
9.0.1 sync locks each row in the base table during indexing. A long-running
sync operation can therefore inconveniently block users' access to base table
rows for long periods of time. The locking algorithm has been rewritten in
9.2.0 to lock these rows only at the end of indexing. This greatly reduces
the window in which user access is blocked. A side effect of this change is
that updates during sync processing may result in garbage in $I. The sync
process will index the old version of the document, notice the update at
lock time, and be forced to automatically invalidate the just-indexed document.
We do not believe that this side effect will be noticable or even encountered
at most installations.
See Appendix C if you are interested in an in-depth
technical overview of the new sync locking mechanism.
Optimize Logging
Optimization now writes periodic progress updates to the CTX_OUTPUT log. If
greater granularity is desired, the new event
CTX_OUTPUT.EVENT_OPT_PRINT_TOKEN
will print each token as it is being optimized.
Token Optimize Token Type
TOKEN mode optimization does a FULL optimize for a single token. In previous
versions this was limited to type 0 (normal text) tokens, so you could not
affect field section text tokens or zone section tokens, for instance. This
version adds a new token_type argument to
ctx_ddl.optimize_index which allows
specification of the token type to optimize. For instance, to optimize the
zone section HEADLINE:
ctx_ddl.optimize_index('myindex', 'TOKEN', token=>'HEADLINE', token_type=>2);
If you don't know the token type number, you can use the new
CTX_REPORT.TOKEN_TYPE function to
translate a mnemonic name to token type. For instance:
ctx_ddl.optimize_index('myindex', 'TOKEN', token=>'HEADLINE',
token_type=>ctx_report.token_type('myindex','ZONE SEC'));
or
ctx_ddl.optimize_index('myindex', 'TOKEN', token=>'HEADLINE',
token_type=>ctx_report.TOKEN_TYPE_ZONE_SEC);
See CTX_REPORT for more
information on valid inputs to CTX_REPORT.TOKEN_TYPE.
New Full Optimize Method
FULL optimize of context indexes has been rewritten, and has the following
two enhancements:
Will not rewrite already-optimized rows. 9.0.1 optimize deletes every row
in $I, removes garbage, defragments, then re-inserts them. This causes a lot
of excess rollback and redo activity when run on already-optimized indexes.
The new optimization method is designed to leave such rows alone.
Optimizes all $N rows. 9.0.1 optimize was limited to removing garbage
for maximum 16,000 docids, and would remove garbage only for those docids
that were marked in $N. So, if you did a time-limited full optimize, then
deleted 10,000 docids, the next time-limited full optimize would not see
those new deletes, and would not remove their information from $I. The new
optimization method reads the full contents of $N at the start of each
full optimize, and removes all information for those docids.
Marking and removal of rows from $N is still capped, so you will see $N count
decreasing 16,000 at a time -- under the covers, we're really optimizing
everything, so this should not be cause for concern.
Parallel Full Optimize
FULL optimize of context indexes can now be run in parallel. Specify the
desired number of slaves in the new parameter parallel_degree in
ctx_ddl.optimize_index. Like create index in parallel, this
uses the parallel query mechanism under the covers, so the actual degree of
parallelism achieved will depend on machine capabilities, system load, init.ora
parameters, etc. (see Notes on Parallel
Degree) The partitioning of the load is done by hash value on
token, which means that a token's info will never get split across slaves,
and that some slaves may be slower than others (if they have large tokens).
A parallel time-limited full optimize may result in certain $I rows being
optimized more than once. For instance, let's say you do a parallel 2 full
optimize, and, when time runs out, slave 1 is up to DOG, while slave 2 is
up to DOKUMENT. Because there is no guarantee that all rows between DOG and
DOKUMENT have been optimized, the saved state is the minimum, which is DOG.
The next full optimize will therefore re-optimize any rows between DOG and
DOKUMENT that slave 2 has already examined. However, with the
new FULL optimize method, this should have
a minimal impact if those rows are still optimal.
FAST optimize has not been parallelized, and parallel FULL optimize is not
supported for any other indextype.
MARK_FAILED
While we have not been able to reproduce such a situation in-house, customers
have reported that occasionally a create index or alter index fails and leaves
the context index in an unusable state. The database believes that the index
is in "LOADING" or "INPROGRESS" state and blocks any attempt to recover through
alter index. The only recourse is to drop and recreate the index.
For this situation 9.2.0 adds CTX_ADM.MARK_FAILED, a call to
force the index from LOADING to FAILED, thus unblocking alter index.
This is a last-resort failsafe and is not considered a fully public part of
our API. This procedure can be dangerous if run improperly. It should be run
only under direction from support or development, and only when no other
session is touching the index.
Query
Optimizer Statistics Tuning
The algorithms for estimating cost and selectivity of context queries have
been tweaked. This should result in optimal plans being generated more often
when using the CBO.
Query Template
Query templates are an alternative to the existing query languages. Instead
of passing a query string to contains or catsearch, you pass a structured
document which contains the query string in a tagged element. Other elements
in this document control aspects of query processing.
The simplest form of query template is:
<query>
<textquery>cat or dog</textquery>
</query>
You just use the document as the query string, so a complete query looks
something like:
select id from mydocuments
where contains(text,
'<query>
<textquery>cat or dog</textquery>
</query>')>0;
though in real life the query document will most likely be generated by the
application into a pl/sql or bind variable and passed to the contains. You
don't need to do anything to be able to use query templates -- Oracle Text
recognizes that the query string starts with "<query>" and automatically
shifts to query template parsing mode. If it doesn't see that opening tag,
the default query parser is used, behaving like previous versions.
You can control two things using query templates in this version: grammar and
score datatype. An attribute of textquery, grammar
can be set to either "context" or "ctxcat", and specifies the query language of
the textquery. This allows you to use the ctxcat language for queries on
context indexes, or the context language for queries on ctxcat indexes:
select id from mydocuments
where contains(text,
'<query>
<textquery grammar="ctxcat">cat | dog</textquery>
</query>')>0;
The datatype attribute of the empty score element
controls whether the score() function returns an integer, or a floating point
number:
select score(1), id from mydocuments
where contains(text,
'<query>
<textquery>cat or dog</textquery>
<score datatype="float"/>
</query>', 1)>0;
SCORE(1) ID
-------- ---------
4.403 103
7.217 101
We plan to add more features to query templates in future releases.
The query template language is XML-like, but not XML. Some differences:
- You don't need to use entities. This makes it easier to use some of our
operators, such as:
<query>
<textquery> foo & bar </textquery>
</query>
instead of
<query>
<textquery> foo & bar </textquery>
</query>
- tag names, attribute names, and allowed attribute values are
case-insensitive
- The parsing is stripped down for performance, so it's not robust.
Form and validity mistakes may not raise errors -- they may just be ignored.
For instance, if you add an unallowed space in the attribute=value pair:
<query>
<textquery grammar = "ctxcat"> cat frog </textquery>
</query>
you won't get an error -- but the query will be interpreted as a context
language query string, meaning "the phrase 'cat dog'" instead of "cat and dog".
You can even add invalid elements or attributes like "<foo>" without
affecting query processing.
Parallel Query on Local Indexes
Thanks to extensible indexing enhancements, you can now parallelize contains
queries on local partitioned context indexes. Parallel query is not supported
for non-partitioned context indexes, or non-context indexes (e.g. ctxcat).
Even for context indexes, the usage is a little tricky, and the performance
may not necessarily be faster for your application.
To examine parallel query, we'll use an "articles" table which is partitioned
by numeric "art_id" column, and which has a local partitioned context index
on "art_text", called "artx". A simple query like:
select art_id from articles
where contains(art_text, 'myquery')>0
Uses the query plan:
PARTITION RANGE ALL
TABLE ACCESS BY LOCAL INDEX ROWID ARTICLES
DOMAIN INDEX ARTX
This is a serial query -- it iterates over the index partitions, and does a
contains query on each one. Contains query on partion 2 is not started until
the contains query on partition 1 is complete.
In order to parallelize, you need to adjust the parallel degree on the index:
alter index artx parallel 4;
This alter index statement only changes a bit of data in the SYS dictionary
tables, so it takes a second or so to run. The parallel degree can also be
specified in the create index statement, but, since the parallel keyword is
ignroed for local indexes (see
Parallel Create Index), it is not seen
often.
Once the index parallel degree is changed, the query plan shows parallel
execution:
PARTITION RANGE ALL PARALLEL_COMBINED_WITH_PARENT
TABLE ACCESS BY LOCAL INDEX ROWID ARTICLES PARALLEL_TO_SERIAL
DOMAIN INDEX ARTX
This is, for the most part, the only way to get this query parallelized.
The table parallel degree can also trigger parallelism, but only in limited
circumstances (See Appendix D).
There is no hint you can use to force parallel query -- the
PARALLEL_INDEX hint does not work for domain index scans.
Similarly, the NOPARALLEL_INDEX hint does not work -- once you've
changed the index's parallel degree, ALL multi-partition queries run in
parallel. To get serial execution, you have to alter the index again,
with "parallel 1".
To process the contains query. the database will start up some number of
parallel
slaves -- it will try to start 4 slaves (the degree of parallelism on our
index) but the actual number started will depend on the number of slave
processes already running for other sessions, machine capabilities, init.ora
parameters, etc. On solaris, each slave is a database background process like
ora_p001_<SID>. The slaves are each assigned a few partitions. They iterate
over these partitions and do a contains query on each one while the parent
process merges the results.
Parallel queries will deliver the most benefit for queries involving several
partitions, that need all results (this includes score-sorted queries), and
that are run on multi-processor machines. Parallel execution may actually
slow down the query when run on a small machine, or when optimizing for
response time, or system throughput (See
Appendix D).
Reverse Docid Sorting
The natural order of contains queries (that is, the order of rows returned
from a query in the absence of an order by clause) is docid order. 9.2.0
queries reverse this, so that query results are now returned in reverse docid
order. This is on automatically for all contains queries in 9.2.0 -- you don't
need to do anything to turn it on.
In order to return hits in reverse docid order, each term node in the query
pre-decompresses the docid information from the $I row into an array, then
walks the array backward. In order to save memory, the array does not include
word position information, which is instead re-read from the $I data.
We have run numerous benchmark tests, and have not shown any noticeable
performance impact of this extra processing.
The intent is to provide rough reverse-date sorting. If documents are inserted
in chronological order, and sync'ed in chronological order, then reverse
docid order is reverse chronological order. In practice, however, docid order
can get out of sync with chronological order, so don't rely on reverse docid if
you need accurate order.
Some of the ways docid order can get out of sync with chronological order:
- When the index is initially created, the main scan on the base table is
done without an order by. If this scan does not return the documents in
chronological order, the docids will not be assigned in order.
- Whenever a row is updated and sync'ed, it gets a new, higher docid.
This means that updating an old row will pop it to the front of reverse
docid-ordered hitlists. This also applies to DML which occurs during a
create index online.
- During sync, rows are processed in rowid order -- if rowid order does
not mirror chronology, docid order will also be off
- Docid order is only meaningful per-partition of a local partitioned index.
If the table is not partitioned chronologically, docids could mirror chronology
for a particular partition's results, but not for the table overall.
- Docids assigned during a parallel index create or parallel sync are not
likely to be in chronologic order. Docids are assigned to each slave in
blocks, and there is no control over how the database partitions the workloard
to each slave. Rows that are close chronologically can be assigned to
different slaves and end up with docids that are far apart. Furthermore,
within each slave there is no control over the order in which documents are
indexed and assigned docids.
- If a base table row is locked during a sync, that row is invalidated and
instead gets indexed during the next sync. It will get a higher docid than
any other row in its sync batch.
Miscellaneous Other Features
CTXCAT Theme Indexing
Theme indexing for CTXCAT is now supported. The indexed themes can be used
when ABOUT queries are issued through catsearch, using the new
query template facility. Since ctxcat documents
are not usually very long, you should set PROVE_THEMES to FALSE for the lexer
preference.
CTXXPATH Indextype
This version includes a new indextype -- ctxxpath. Although
this indextype is created in Text install, and uses Text code, it
really doesn't have anything to do with Text searching. A ctxxpath
index, which can be built only on sys.xmltype columns, is used to
speed up certain queries using the existsnode method (See the
Application Developer's Guide - XDB).
Classification Training
There are three steps to setting up a basic document classification
application:
- Group related sample documents together
- For each group, write rules that explain why the documents belong in the
group
- Using the rule set, classify incoming documents into appropriate groups
Version 9.0.1 introduced the ctxrule index type. This automated step 3 of
the process, but it was up to the user to manually write the routing rules.
The new CTX_CLS.TRAIN automates step 2; it statistically
analyzes document groups and automatically generates ctxrule-compatible rules.
To use training, first gather a sample set of documents -- you can even use
the real document table, if you like. The document table MUST have a numeric
column with all positive, integer key values less than 4 billion. It doesn't
have to be the primary key, but it should be unique and allow fast access by
id value. We'll call this column the doc_id column, though you can name it
whatever you want.
Next, create a context index on the table. You can use whatever preferences
you want, but TRAIN will generate rules based on the tokens
produced, so for best results the lexer preference should match the lexer
preference you intend to use for the ctxrule index:
create table docs (doc_id number primary key, text clob);
<<populate the table>>
create index docsx on docs(text) indextype is ctxsys.context;
TRAIN will
consider themes for its generated rules if they are present in the index, but
theme indexing is not required. TRAIN will not consider any
token other than type 0 and 1. So it cannot discover classification based on
subject codes in a non-visible field section, for instance. This may be
added in future releases.
Now create a table to hold the document groups. It's a many-to-many
normalization table, so it needs one column to hold a document id, and one
column to hold a category id. You can name the table and the columns whatever
you like, and have extra columns if you want. However, the doc_id and cat_id
columns both have to be numeric, limited to positive integer values less than
4 billion. It's assumed that you have a category table out there somewhere,
but it is not actually used during training, so technically you don't need one.
Training will perform best if you have a normal b-tree index on the category
id column (or a composite key index where category id is the first column).
create table cat_doc(
cd_cat_id number,
cd_doc_id number,
primary key (cd_cat_id, cd_doc_id)
) organization index;
<<organize your sample documents into groups>>
<<populate cat_doc table>>
Final setup step: create a result table. Recall that the output of
TRAIN is a set of ctxrule-compatible rules. So the result table
must have three columns: a numeric column to hold the cat_id, a varchar2(4000)
column to hold the rule text, and a numeric column to hold the confidence level
-- a score telling you how well the produced rule approximates the grouping:
create table rules(
rule_cat_id number,
rule_text varchar2(4000),
rule_confidence number
);
Now we run TRAIN, passing it the names of the tables and columns
we've set up:
begin
ctx_cls.train(
index_name => 'docsx',
docid => 'doc_id',
cattab => 'doc_cat',
catdocid => 'cd_doc_id',
catid => 'cd_cat_id',
restab => 'rules',
rescatid => 'rule_cat_id',
resquery => 'rule_text',
resconfid => 'rule_confidence'
);
end;
/
If logging is on, progress will be shown in the logfile. When it completes,
the rule table will be filled with rules. As a last step, you can create a
ctxrule index on the result table.
It may not be possible to generate rules in some situations; you won't always
get rows in the rules table. Also, this is a statistical process, so the rules
only approximate your grouping -- if you classified the sample documents using
the produced ruleset, you may not get exactly your doc_cat table back.
See Appendix E for a closer look at the
training process, including coverage of RULE_CLASSIFIER
preferences.
CTX_REPORT Package
The CTX_REPORT functions provide information about text
indexes for
diagnosis and maintenance. The procedures analyze the index, generate a
report, then write the report to a CLOB. For convenience, most have a
function form which allocates a call-duration temp clob and returns it,
so you can do something like:
set long 64000
set pages 0
set heading off
spool myindex_report
select ctx_report.describe_index('myindex') from dual;
spool off;
You can see sample output for some ctx_report functions in
Appendix F, if you want to get an idea of
what's included in each report. Now let's take a look function by function:
DESCRIBE_INDEX is a human-readable description of index meta-
data, including the name of the base table, the objects used for each class,
and the values of their attributes. It's useful when you don't know, don't
remember, or wish to verify an index's settings.
(example output here)
DESCRIBE_POLICY is like DESCRIBE_INDEX, but for
policies.
CREATE_INDEX_SCRIPT generates a sql script which includes
preference creation PL/SQL blocks and a create index statement to re-create
a given index. It's great when you need a testcase, but be aware that it
won't include external requirements like a user datastore procedure.
(example output here)
CREATE_POLICY_SCRIPT is like CREATE_INDEX_SCRIPT,
but for policies.
INDEX_SIZE lists the internal index tables and, using queries on
sys data dictionary tables and DBMS_SPACE, shows their allocated
and used sizes.
(example output here)
INDEX_STATS analyzes the $I table and calculates statistics like
the number of unique tokens, the 10 most frequent tokens, average size of
token_info per token, etc. It can also calculate statistics like estimated
row fragmentation, garbage size, and most fragmented tokens.
(example output here)
INDEX_STATS is heavy-duty. In order to calculate the stats,
it needs to scan the entire $I table, which can take a long time for large
indexes. You can track its progress in the ctx_output logfile,
if logging is on. Because it's expected to run for long periods of time,
it's not suitable for calling in a select list, and there is no function
variant.
It also needs to compile a lot of information, for which it uses a session
temporary table. Insertions to the temporary table are committed periodically,
so a single call to INDEX_STATS ends the current transaction
implicitly. This also means that you cannot pass a lob locator, because
lob locators cannot span transactions. Instead, pass in NULL and let
ctx_report allocate a session-duration temp clob. Something like:
declare
x clob := null;
begin
ctx_report.index_stats('tdrbprx21',x);
insert into output values (x);
commit;
dbms_lob.freetemporary(x);
end;
/
select * from output;
TOKEN_INFO decodes a token's info in $I. The report shows which
documents contain the token, the word offsets where the token occurs in each
document, and statistical information like average number of occurrences per
document. It's good for verifying the index contents during diagnosis of
problems.
(example output here)
TOKEN_TYPE translates a mnemonic label to a numeric token type.
It has two arguments: an index name, and the mnemonic label. The index name
must be provided in order to lookup meta data for field section types.
Valid input for the mnemonic are:
| |
Input |
Meaning |
Type Returned |
| |
THEME |
theme token |
1 |
| |
ZONE SEC |
zone section |
2 |
| |
ATTR TEXT |
text that occurs in an attribute |
4 |
| |
ATTR SEC |
attribute section |
5 |
| |
PREFIX |
prefix token |
6 |
| |
PATH SEC |
path section |
7 |
| |
PATH ATTR |
path attribute section |
8 |
| |
STEM |
stem form token |
9 |
| |
FIELD <name> TEXT |
text token in field section <name> |
16-79 |
| |
FIELD <name> PREFIX |
prefix token in field section <name> |
616-679 |
| |
FIELD <name> STEM |
stem token in field section <name> |
916-979 |
for example:
typenum := ctx_report.token_type('myindex', 'field author text');
The FIELD* types will fetch the index meta-data from the dr$ tables, so if
you're going to be calling these often, you might want to cache the values
to avoid repeated lookups.
If you are just interested in one of the static types, it does not need the
index meta-data, and you can pass NULL a the index name, if you like.
Actually, you're probably better off using one of the package constants:
TOKEN_TYPE_TEXT constant number := 0;
TOKEN_TYPE_THEME constant number := 1;
TOKEN_TYPE_ZONE_SEC constant number := 2;
TOKEN_TYPE_ATTR_TEXT constant number := 4;
TOKEN_TYPE_ATTR_SEC constant number := 5;
TOKEN_TYPE_PREFIX constant number := 6;
TOKEN_TYPE_PATH_SEC constant number := 7;
TOKEN_TYPE_PATH_ATTR constant number := 8;
TOKEN_TYPE_STEM constant number := 9;
Policies
This version re-introduces policies. A policy is a synthetic object like a
preference. You can think of it as a context index minus the index data
-- that is, it's just a collection of index objects and attribute settings.
You create a policy with ctx_ddl.create_policy, specifying
preferences for each class, just like the parameters clause of create index:
begin
ctx_ddl.create_policy('german_filtered',
filter => 'ctxsys.inso_filter',
section_group => 'ctxsys.html_section_group',
lexer => 'my_german_lexer',
stoplist => 'd_stoplist',
wordlist => 'd_wordlist'
);
end;
Since there is no base table, there is no need for a datastore
preference. Other than this, it works like create index.
The defaulting system works for policy creation. For instance, if you specify
an INSO_FILTER for filter, and NULL for section_group
then the HTML section group will kick in automatically.
Preference values are deep-copied, so you can modify or even drop the
preferences after create_policy without affecting the policy.
You can update an existing policy using ctx_ddl.update_policy and
drop a policy using ctx_ddl.drop_policy.
Policies use the same data dictionary tables as indexes, so you can view
existing policies using the views CTX_USER_INDEXES,
CTX_USER_INDEX_OBJECTS, and CTX_USER_INDEX_VALUES.
Or, preferably, try out the new function
CTX_REPORT.DESCRIBE_POLICY. Either
way, You'll see that under
the covers, policies are indexes created on the table
dr$policy_tab. On a side note, this table has a language column
so that you can use multi-lexer preferences.
So what are policies for? Not much in this version. They are used in
the XPath function ora:contains, which can be used in the
sys.xmltype methods (see the Application Developer's Guide - XDB).
In the future, we intend to use policies for things like direct document
services and true functional invocation. Watch this space.
JDeveloper Text Wizard
The Text Wizard is an add-in for JDeveloper which provides a menu-driven
interface for generating simple Text applications. You fire it up, answer
some prompts, and it generates SQL scripts for creating the preferences and
index, scripts for sync, and JSP's for web query.
The Text Wizard will soon be available on OTN.
|