Oracle Text 9.0.1 Technical Overview
Oracle Text 9.0.1
Technical Overview
Garrett Kaminaga, Product Development, Database and Application Server Technologies, Oracle HQ
This is a technical overview of the improvements found in Oracle Text (the
product formerly known as interMedia Text) version 9.0.1. This is intended
for an audience fairly familiar with previous versions of the product.
If you are new to the product, you should familiarize yourself with the
product before reading this document. The official documentation, especially
the new Application Developer's Guide, is a good place to start, as are the
interMedia Text technical overviews for 8.1.5, 8.1.6, and 8.1.7, in order.
This document is only a summary of new features and internal changes. As
always, please refer to the official documentation for detailed information.
Table of Contents
Overview of 9.0.1 Improvements
Three big new features stand out in the set of 9.0.1 improvements: local
partitioning, XML path searching, and document classification. With local
partitioning, large text indexes are easier to manage and faster to create,
maintain, and query. XML path searching enables sophisticated queries which
can reference and leverage the embedded structure of XML documents. The new
CTXRULE indextype is the inverse of the ConText indextype -- instead of
using a text query to find documents, you use a document to find queries.
The rest of the improvements are smaller changes, but were driven by user
feedback, and make using the product a little easier, a little faster, or a
little more robust. These include new, more sophisticated lexers for Japanese
and Korean, an INSO filter timeout, per-row logging during indexing to find
faulty documents, transportable tablespace support, and an on-demand filter.
Upgrading 8.1.7 to 9.0.1
Manual Upgrade From 8.1.7
Oracle Text is now integrated with the Database Migration Assistant (DBMA),
so the Text upgrade should happen as part of the overall database upgrade
procedure. This section details the manual upgrade process, in case something
goes wrong or you choose to not use DBMA.
This process assumes that you are starting with a valid 8.1.7 ctxsys data
dictionary, and that the system is not being used for index, DML, or query.
Use SQL*Plus to run all scripts.
1. run ?/ctx/admin/s0900010.sql as SYS
this grants new, needed database privileges to ctxsys
2. run ?/ctx/admin/u0900010.sql as CTXSYS
this upgrades the ctxsys schema to 9.0.1
It is not necessary to run anything else. The u0900010.sql
script calls
other scripts to upgrade the indextypes (t0900010.sql),
recompile all packages
(dr0pkh.sql and dr0plb.sql), and recreate the shared
library (dr0ulib.sql).
The script is not expected to encounter errors other than ORA-955 ("name is
already used by an existing object", for public synonyms). After the script
is complete, all ctx-created objects owned by ctxsys should be
VALID.
For iMT veterans who may be wondering, the path is not a typo. The upgrade
scripts, which used to be in their own sub-directory, are now found in the
?/ctx/admin directory.
SQL*Net Listener Setup
interMedia Text 8.1.7 shifted all code to trusted callouts, eliminating the
need for SQL*Net listener configuration (and there was great rejoicing).
Sorry, extproc is back again. But just for one new document service
function -- ctx_doc.ifilter, the on-demand INSO filtering call
(this is covered below). If you don't use this
function, you don't need to
set up the listener.
A brief description of setting up the listener is below; complete details
are in the Oracle 9i Server Administrator's Guide.
1. Add an entry to the tnsnames.ora:
extproc_connection_data =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = ipc)
(KEY = DBSID))
(CONNECT_DATA = (SID = ep_agt1)))
DBSID is the database SID. ep_agt1 can be named anything.
extproc_connection_data should not be changed.
2. Add the following to the listener SID_LIST:
(SID_DESC = (SID_NAME = ep_agt1)
(ORACLE_HOME = /oracle)
(ENVS = LD_LIBRARY_PATH=/oracle/ctx/lib)
(PROGRAM = extproc))
ep_agt1 matches the CONNECT_DATA SID for
extproc_connection_data
in the tnsnames.ora. The PROGRAM section tells
the Net8 listener to
start the external procedure process. The ENVS section,
which
is shown here for UNIX, will ensure that the environment includes
?/ctx/lib in LD_LIBRARY_PATH. This is needed
so that indexing can
use the INSO filters.
3. Since the extproc_connection_data ADDRESS section specifies
ipc,
make sure that the ADDRESS_LIST of listener.ora
accepts ipc connections.
A quick way to test the Net8 configuration is to do:
exec ctx_adm.test_extproc;
from SQL*Plus, as ctxsys. If you get the error:
DRG-50704: Net8 listener is not running or cannot start external procedures
then things aren't set up correctly. OTN and Oracle Support have helpful
debugging guides for extproc setup, compiled problem-by-problem, and paid for
in past headaches.
Upgrading Existing Indexes
Nothing needs to be done to upgrade existing text indexes. These should
still work without any problems in 9.0.1. However, we made ctxcat DML a bit
more efficient, so you might choose to re-constitute the triggers for ctxcat
indexes. All you have to do is rename the index:
alter index <indexname> rename to <newname>
You can, of course, rename it back to the original name if you choose.
Local Partitioned Indexes
Introduction
Creating Local Partitioned Indexes
Local Partitioned Indexes
A local partitioned index is a partitioned index on a partitioned table
with a one-to-one mapping of index partitions to table partitions. It's like
splitting up a large table and index into a set of smaller tables and indexes.
Local partitioned indexes have been around since partitioning support was
first added to the database. However, this support was limited to b-tree and
bitmap indexes -- domain indexes (including ConText indexes) could only be
global indexes. Kernel's extensible indexing framework has been enhanced in
9i, allowing local partitioned text indexes for the first time.
Why Local Partitioned Indexes?
We won't discuss why table partitioning is good for large data sets --
that's covered pretty well in the general database Concepts manual. Instead,
we'll list some reasons why a local partitioned text index is better than a
global index:
- Easier Table Maintenance
With a global index, partition-level DDL invalidates the entire index. You
could truncate one partition, and spend days rebuilding the index, even
though the data in the other partitions hasn't changed. A local
partitioned index truncates the corresponding index partition and the rest
of the index doesn't even notice. This makes managing large sets of data
much easier. This is discussed further below in the section
"Partitioned Table Maintenance"
- Parallel Sync
Oracle Text allows only one session to sync a given index at a time. Users
with large datasets have found this too limiting -- the system can't keep
up with huge amounts of incoming data. With local partitioning, the index
partitions are independent, and can be sync'ed in parallel. We discuss
this further in the section
"Local Partitioned Index Maintenance".
- Query Performance
A local partitioned ConText index can take advantage of partition pruning
and partition iteration for faster mixed query and sorted query. In
contrast, a global index wastes time producing rowids in other partitions
which are simply filtered out by the query plan. Query is examined in the
section "Query Considerations".
Creating a Local Partitioned Index
Let's use the following partitioned table as an example:
create table ptest (id number, text varchar2(80))
partition by range (id)
(partition p1 values less than (100),
partition p2 values less than (200),
partition p3 values less than (300));
To create a local partitioned index, simply specify the LOCAL
keyword:
create index ptestx on ptest(text)
indextype is ctxsys.context
local;
The local keyword must be after the indextype clause.
You can stick a parameters clause before or after the
local keyword:
create index ptestx on ptest(text)
indextype is ctxsys.context
parameters ('filter ctxsys.inso_filter')
local
create index ptestx on ptest(text)
indextype is ctxsys.context
local
parameters ('filter ctxsys.inso_filter')
are both OK. By default, the index partitions will have the same name as the
corresponding table partitions, so in our examples above the index partitions
are named p1, p2, and p3. You can
override this by specifying individual
partition clauses:
create index ptestx on ptest(text)
indextype is ctxsys.context
local
(partition larry, partition moe, partition curly)
parameters ('filter ctxsys.inso_filter')
The partition clauses must come immediately after the
local keyword. The
partition clauses correspond to the table partitions in order.
For instance,
our first table partition is p1, so, in our example above, its
index partition
will be named larry, because that's the first partition clause in
the set.
Because it's done solely by order, this means that you can't just name a few
partitions here -- you must specify a partition clause for each table clause.
If your table has 100 partitions, you must put 100 partition clauses here.
Each partition clause can optionally have a parameters clause:
create index ptestx on ptest(text)
indextype is ctxsys.context
local
(partition larry,
partition moe parameters ('memory 5m'),
partition curly)
parameters ('filter ctxsys.inso_filter')
These partition-level parameters clauses can be used to specify index memory
and storage attributes on a per-partition basis. We'll discuss the difference
between the index-level and partition-level parameters clauses below.
Limitations
There are a few limitations to be aware of when creating local partitioned
ConText indexes:
- Base table must be
RANGE partitioned
Other partitioning methods, such as HASH,
COMPOSITE, or LIST, are not
supported for local indexes. You can still create a global index on
tables partitioned by these methods, though.
- ConText indextype only
In this version, only the ConText indextype supports local partitioning.
You will just get an error message if you try to create a local
CTXCAT or CTXRULE index.
- Index name must be 21 bytes or less
Due to the table naming scheme used for local partitioned indexes (more
on this below), the names of local partitioned indexes cannot exceed 21
bytes in length.
- Maximum number of partitions is 9999
Again due to the table naming scheme, the index (and therefore the table)
cannot have more than 9999 partitions.
Local Partitioned Indexing Process
Now let's take a closer look at what happens under the covers when a local
partitioned index is created:
- Create the index global meta-data
- For each partition in the table, do the following:
- create the index partition meta-data
- create the index partition internal tables
- populate the index data
- create the $X b-tree
First, the index global meta-data is created. Like a regular ConText index,
it parses the index-level parameters clause, determines the preferences to use
for all classes, then deep-copies the preference values to index meta-data.
In this version, the index meta-data is shared by all index partitions. This
means that you cannot, for instance, specify a Japanese lexer for partition 1
and a Basic lexer for partition 2. The documents in the table are assumed to
be partitioned by contiguous numeric qualities like date or id, rather than
by document type. The existing heterogeneous data support (charset, format,
and language column) can be used to work around this limitation.
Since all the index partitions share common meta-data, the partition
parameters clause does not allow anything except MEMORY and
STORAGE -- use
the index-level parameters clause to specify everything else.
Once the index meta-data is created, we iterate over the table partitions.
(Parallel index creation is covered below)
For each table partition, an index
partition is created. Creating an index partition is not that different from
creating a regular ConText index. There are four stages: meta-data, internal
table creation, population, and b-tree index.
The partition meta-data is usually just a single row in
dr$index_partition,
visible through the view CTX_USER_INDEX_PARTITIONS, which has the
columns:
-
IXP_ID
-
This is a numeric ID for index partition. It is Text-assigned, so does not
correspond to the index partition object id. The value ranges from 1 to
9999, and is unique within an index. The values are designed to be
contiguous, so as index partitions are dropped and added, values will get
re-used. Because most partition operations take in an index partition
name, you generally won't care about the
ixp_id with one
exception --
the ixp_id makes up part of the internal tablenames.
-
IXP_INDEX_OWNER
IXP_INDEX_NAME
-
The owner and name of the index to which this index partition belongs.
-
IXP_INDEX_PARTITION_NAME
-
The index partition name. This is the same as the partition name in
user_ind_partitions. Use this name to identify the index
partition in
all index partition operations.
-
IXP_TABLE_OWNER
IXP_TABLE_NAME
IXP_TABLE_PARTITION_NAME
-
The base table partition for this index.
-
IXP_DOCID_COUNT
IXP_STATUS
-
Like
idx_docid_count and idx_status for normal
text indexes, only now these apply
on a partition level, so the values for these columns in
CTX_USER_INDEXES
can be ignored.
The second step in creating the index partition is internal table creation.
Each index partition gets its own set of DR$ tables. The naming scheme is
somewhat different --
DR#<indexname><ixp_id><suffix>. For instance,
index
partition p1 from our example above will have a $I table named
DR#PTESTX0001$I.
The # instead of $ avoids collisions with global ConText index tablenames.
The addition of ixp_id is the reason why local partitioned index
names are
limited to 21 bytes instead of the 25 bytes for global indexes, and the four-
digit decimal representation of the index partition id imposes the limit of
9999 index partitions.
The tablespace and storage for these internal tables are determined mainly
by the storage preference specified in the index-level parameters clause.
However, you can override this by using the partition parameters clause.
Modifying our example above:
create index ptestx on ptest(text)
indextype is ctxsys.context
local
(partition larry,
partition moe parameters ('storage p2_storage'),
partition curly)
parameters ('storage global_storage')
The internal tables for larry and curly will be
created using the global_storage preference, while
moe will be created using p2_storage.
Specifying a per-partition storage preference uses a little more index meta-
data (stored in table dr$index_values).
The third step in creating the index partition is population. This is the
real work -- reading the documents, inverting them, and dumping the index
data to the $I table. This is the same as a global index population, except
that the datastore will use a partition-extended tablename; we are indexing
only a particular partition of the base table here.
Index memory size determination works like storage preference determination.
First, we check if MEMORY was specified in the partition-level
parameters
clause. If not, we then check MEMORY in the index-level
parameters clause.
If not there, then we take system parameter DEFAULT_INDEX_MEMORY.
The fourth and final stage of index partition creation is the creation
of the $X index on the $I table. The tablespace and storage for this index
comes from the same storage preference used for internal table creation.
After these four stages are complete, it goes to the next table partition
and starts all over. When all table partitions have been indexed, the index
creation completes, and you can do contains queries.
Parallel Local Partitioned Index Creation
Normally the index partitions are created serially -- one at a time. More
powerful systems may be able to create index faster by creating the partitions
in parallel (inter-partition parallelism). This is not as simple as adding a
PARALLEL clause to the create index statement. Principally this
is because
we currently ignore it for local partitioned text indexes. However, even if
it were supported, it wouldn't create the partitions in parallel. With local
b-tree indexes, specifying a PARALLEL degree still creates the
index partitions
serially, though each partition is parallelized (intra-partition parallelism).
The strategy for creating a local partitioned index with inter-partition
parallelism is:
- create an empty index
- rebuild the index partitions in parallel
Normally, step 1 is accomplished using the UNUSABLE keyword on
create index.
Unfortunately, this is not supported for domain indexes -- you'll just get
syntax errors. Instead, you can use the Text way of creating an empty index --
the NOPOPULATE keyword:
create index ptestx on ptest(text)
indextype is ctxsys.context
local
parameters ('NOPOPULATE');
This will create the index meta-data, then create partition meta-data, tables,
and $X index for each partition. The partitions are processed serially, but
since we're not doing index population, this should take only a few seconds
per partition.
For step 2, rebuilding the index partitions in parallel, the database has
a nifty utility -- DBMS_PCLXUTIL -- which uses
DBMS_JOB to kick off multiple
sessions which rebuild the index partitions. Unfortunately, this does not work
with domain indexes, because the jobs that it submits include the
PARALLEL
clause, which is not supported for domain indexes, and raises a syntax error.
Instead, you have to manually submit the rebuild jobs. The easiest way
to do this is to simply open multiple SQL*Plus sessions and, in each one,
run scripts that have statements like:
alter index ptestx rebuild partition p1;
Issuing alter index rebuild partition like this will drop the internal
tables, re-create them, do the indexing, then re-create the $X index. So,
there's a little extra work involved because we've created the tables already
during create index, but it's not that wasteful. You can specify an index
memory size like this:
alter index ptestx rebuild partition p1 parameters ('replace memory 50M');
We are working with extensible framework development to simplify parallel
local domain index creation for 9i Release 2.
Local Partitioned Index
Query Considerations
Query Syntax
There is no difference between local partitioned text index queries and
global text index queries -- the syntax is the same. However, local
partitioning can change the performance characteristics of certain types of
queries. We'll examine this in this section using the following test table:
create table ptest(a number not null, b varchar2(128))
partition by range(a)
(
partition p1 values less than (10000),
partition p2 values less than (20000),
partition p3 values less than (30000),
partition p4 values less than (40000),
partition p5 values less than (50000),
partition p6 values less than (60000),
partition p7 values less than (70000),
partition p8 values less than (80000),
partition p9 values less than (90000),
partition p10 values less than (100000)
);
populated with synthetic data. There are 100,000 rows. Every row has the
token EVRY. TM07 occurs every tenth row, between 1
and 7 times within the
row (to give a score range). HSPC occurs every 100 rows, and
KSPC occurs
every 1000 rows.
We will be comparing two runs -- one with a global ConText index on this
table, and one with a local partitioned index. The timing numbers given
are for comparison against each other only, and do not represent tuned,
publishable benchmark numbers.
SCORE
The contains score is calculated from several statistics, including number of
documents containing the query token, and number of documents in the corpus.
In a local partitioned index, each partition is independent, so these numbers
are partition-scope. That is, it uses the number of documents in the partition
that contain the token, and the number of documents in the table partition. As
a result, the SCORE values for any given row that you get from a
query on a
local partitioned index may be different than if the index were a global index.
However, we do not believe that the partition-scope of the score will have
noticeable adverse effects for most applications.
Simple Text Queries
Each partition of a local partitioned ConText index is an almost-independent
index. If many index partitions have to be scanned to fulfill a query, local
index query time can be much slower than global index query time on the same
data, because essentially you are scanning many indexes.
A plain text query like this:
select * from ptest where contains(b, 'HSPC')>0;
with a global index shows a single domain index scan:
1.1 DOMAIN INDEX PTESTX
The local partitioned index, on the other hand, includes a partition iterator:
1.1 PARTITION RANGE ALL
2.1 DOMAIN INDEX PTESTX
This shows that the query plan is looping over the index partitions. For each
partition, it is doing a separate domain index scan.
Each domain index scan incurs some fixed amount of overhead, plus a query
time which is proportional to the amount of indexed data and number of hits.
The partitions of the local index have much less data than the global index, so
that portion of the query time is lessened. However, the overhead is much more
because the same fixed overhead is incurred once per partition. This slowdown
increases as the number of partitions increases.
The following table illustrates this effect. We ran successive tests with
the same data at different levels of partitioning -- for instance, the first
test partitioned the 100,000 rows into 10 partitions of 10,000 rows, the second
test partitioned the 100,000 rows into 50 partitions of 2,000 rows, etc. up to
1000 partitions. We created a local ConText index each time, and ran the query
above, for HSPC. HSPC occurs every 100th row, so
there's 1000 hits overall,
and a minimum of 1 hit per partition. Times given below are in minutes:seconds
and are to select all 1000 rows
|
# of partitions in ptest |
|
10 |
50 |
100 |
250 |
500 |
1000 |
| global index |
1.14 |
1.11 |
1.14 |
1.11 |
1.12 |
1.18 |
| local index |
1.34 |
11.37 |
21.23 |
50.42 |
2:25.48 |
5:01.82 |
As the number of partitions increases, the overhead for scanning all those
index partitions swamps the query time -- at 1000 partitions, the local
index is about 250 times as slow.
Local partitioning may not be looking too good so far. Keep in mind, though,
that this is the worst case scenario -- where local partitioning is weakest.
After all, it makes no sense to split up your index unless you're going to
leverage that split -- and this is what we turn to next.
Also, these times are for selecting all 1000 hits. The runtimes would be
faster if you were interested in only the first 10 hits, for instance. In the
50 partition case, for instance, the query can get 10 hits from the first
partition alone, and would stop without scanning the other 49 partitions. This
would yield times very close to global index.
Partition Pruning
We've seen that local partitioned indexes have the disadvantage when all the
partitions have to be scanned; the flip side of this is that they have a
distinct advantage when you only have to scan a handful of partitions.
The query class that most obviously benefits from local partitioning is
mixed text and structured, with structured condition on the partition key.
Finding news articles from this week that contain some phrase, for instance.
With global indexes, mixed queries work by getting all the text hits, then,
for each hit, evaluating the structured condition. If most of the text hits
don't match the structured condition, ConText does a lot of wasted work.
With local indexes, performance can be improved using partition pruning.
Partition pruning acts as a first-level filter, avoiding query on those
partitions that will never meet the structured criteria -- this results in
a higher text hit to structured hit ratio, and less wasted work in ConText.
For instance:
select /*+ FIRST_ROWS(50) */ 12345
from ptest
where contains(b, 'EVRY') > 0
and a between 90000 and 100000
and rownum < 50;
With a global ConText index, the query plan is:
1.1 COUNT STOPKEY
2.1 TABLE ACCESS BY GLOBAL INDEX ROWID PTEST
3.1 DOMAIN INDEX PTESTX
The ConText index produces a list of rowids in the table which contain the
token EVRY. Since every row contains this token, this results in
100,000
rowids, in docid order. For each of those rowids, the kernel does a table
access by rowid to get the value of column A. In this testcase,
the rowids
resulted from the ConText index happen to come back in ascending A
order.
As a result,
the ConText index wastes time producing 90,000 rowids that are just thrown
away by the structured condition filter. (This is only for a domain index
scan. This is not true of a functional invocation, though that has its own
problems, discussed in "Sort by Partition Key"
below)
With a local partitioned ConText index, the query plan does not look much
different from the global index:
1.1 COUNT STOPKEY
2.1 TABLE ACCESS BY LOCAL INDEX ROWID PTEST
3.1 DOMAIN INDEX PTESTX
And, in fact, it works pretty much the same -- ask the ConText index for a
list of rowids containing EVRY, then for each rowid, do a table
access to get
the value of A, and ensure that the value is between 90,000 and
100,000.
The difference in this case is that partition pruning has occurred. Since
we have a local partitioned index, instead of one big index, we have 10 small
indexes. The optimizer is smart enough to know that 9 of them will never get
any hits (because of the structured condition on A, the partition
key) so it
only asks p10. All of the rowids produced by p10
fulfill the structured
condition, so none are thrown away.
Time for global index: 4.07. Time for local partitioned index: 0.09
The example above prunes to a single index partition, but this technique can
still be useful with multiple partitions. The query plan simply iterates over
the selected index partitions. For instance:
select /*+ FIRST_ROWS(50) */ 12345
from ptest
where contains(b, 'EVRYS') > 0
and a between 80000 and 100000
and rownum < 50;
Involves two partitions, using the query plan:
1.1 COUNT STOPKEY
2.1 PARTITION RANGE ITERATOR
3.1 TABLE ACCESS BY LOCAL INDEX ROWID PTEST
4.1 DOMAIN INDEX PTESTX
EVERYS is a special token just for this test, which occurs in
every row from
1-79999, and then every 400th row after that. This lopsided distribution is
to force the partition query to hit two partitions.
Time for global index: 3.59. Time for local partitioned index: 0.20
We are scanning multiple partitions, so there's increased overhead, as the
section "Simple Text Queries" above illustrated. The query takes about twice
as long as the pruned-to-single-partition query above (which took 0.09).
However, scanning two index partitions is still a lot better than producing all
the text hits and filtering out the ones that don't match the range.
In order for true partition pruning to happen, the query must be parsed,
and the structured condition must be literals. If the structured condition
uses bind variables, the optimizer will be forced to produce a plan that
iterates over all index partitions. However, our informal experiments showed
that at execution time there was no performance difference between the
pruned query plan and the iterator query plan. Evidently, the SQL execution
engine's partition iterator recognizes that certain partitions will not yield
hits, and skips over them without opening an index scan. So, using bind
variables should not be cause for alarm. If you want to be sure, you can
use literal values, or use partition-extended tablenames in your select.
Score Sorting
Score-sorted queries are generally slower on local partitioned indexes than
on global indexes. The need to scan multiple partitions vs. a single index
makes it slower just to produce the results, as we saw above in
"Simple Text Queries", but the major factor is
the sorting difference.
With a global index, ConText can "push down" the sort condition and return
the rows in sorted order:
select /*+ FIRST_ROWS(50) */ 12345
from ptest
where contains(b, 'TM07', 1)>0
and rownum < 50
order by score(1) desc;
1.1 COUNT STOPKEY
2.1 TABLE ACCESS BY GLOBAL INDEX ROWID PTEST
3.1 DOMAIN INDEX PTESTX
The contains produces the 10,000 rows containing TM07, sorts them
internally, then returns them in order.
A local partitioned index, on the other hand, is split up, so we cannot
produce a list of all the hits and sort internally. Instead, the local
index scans every partition individually, then does a regular
SORT ORDER BY
on the amassed results:
select * from (
select /*+ FIRST_ROWS(50) */ 12345 from ptest
where contains(b, 'TM07', 1)>0
order by score(1) desc
) where rownum < 50
1.1 COUNT STOPKEY
2.1 VIEW
3.1 SORT ORDER BY STOPKEY
4.1 PARTITION RANGE ALL
5.1 TABLE ACCESS BY LOCAL INDEX ROWID PTEST
6.1 DOMAIN INDEX PTESTX
The SORT ORDER BY makes the query much slower. The global index
query runs
in 1.19 sec, while the local index takes 7.87 seconds. Also, the query syntax
has to be changed to incorporate a view, because of the different semantics
introduced by using a sort instead of internal ordering.
Because of this effect, if your application does a lot of text-only, score-
sorted queries, local indexes are not a good choice. If you have score-sorted
mixed query that can do partition pruning, however, local indexes can offer
performance advantages even with the sort order by. For instance, taking
the same TM07 text query, but adding a two-partition condition:
select /*+ FIRST_ROWS(50) */ 12345
from ptest
where contains(b, 'TM07', 1)>0
and a between 80000 and 100000
and rownum < 50
order by score(1) desc;
The global index does a lot of wasted work producing and sorting rows that get
filtered out by the text condition. The global index here takes 3.61 sec,
while the local index finishes in only 1.31 sec, thanks to partition pruning.
Even better, if your structured condition prunes down to a single partition,
then the score sorting WILL be pushed down:
select /*+ FIRST_ROWS(50) */ 12345
from ptest
where contains(b, 'TM07', 1)>0
and a between 90000 and 100000
and rownum < 50
order by score(1) desc;
1.1 COUNT STOPKEY
2.1 TABLE ACCESS BY LOCAL INDEX ROWID PTEST
3.1 DOMAIN INDEX PTESTX
Note the absence of the SORT ORDER BY. For this query you get
the benefits of
both partition elimination and ConText score sorting. The time for the global
index is 3.51 seconds, while the local index takes only 0.56 seconds. Again,
the order by score pushdown occurs only when you prune to a single partition.
Sort by Partition Key
Another situation where local partitioning can improve performance is sort by
partition key. Selecting news articles that contain some phrase, with most
recent articles first, for instance. Or, in our example:
select /*+ FIRST_ROWS(50) INDEX(ptest ptestx) */ 12345 from ptest
where contains(b, 'TM07') > 0
order by a
We've chosen TM07 here as a medium-selectivity text query, which
is the most
problematic for this query class. A very selective text query produces few
hits which can be sorted very quickly, while a very unselective text query
produces so many hits that it's faster to do an ordered b-tree scan, invoking
contains as a function for each row -- you'll only have to do a few to fill
the 50-row buffer. Medium-selectivity text queries produce too many rows
to sort quickly, but too few rows to drive off the b-tree. (the example sql
includes an INDEX hint to force driving from the domain index
scan)
With a global index, the ConText index produces all the hits, then does a
SORT ORDER BY to sort them by the structured column. In this
case, we have to
produce and sort 10,000 rows:
1.1 SORT ORDER BY
2.1 TABLE ACCESS BY GLOBAL INDEX ROWID PTEST
3.1 DOMAIN INDEX PTESTX
This takes 2.34 sec for the first 50 rows.
With a local partitioned index, it is more efficient to do an ordered
partition iteration. The base table is range-partitioned, so the value of
A
for any row in partition p1 is guaranteed to be less than the
value of A for
any row in partition p2. Therefore, any hits from partition
p1 are guaranteed
to be first in the result set when you order by A, and so it is
correct to
produce, sort, and result the hits from partition p1 before touching any other
partition. This is an ordered partition iteration, and the query plan looks
like this:
1.1 PARTITION RANGE ALL
2.1 SORT ORDER BY
3.1 TABLE ACCESS BY LOCAL INDEX ROWID PTEST
4.1 DOMAIN INDEX PTESTX
The SORT ORDER BY is under the partition iterator. This
indicates that it is
sorting on a partition level, rather than a global level. This is actually
more efficient, because it is faster to sort smaller batches (sort is O(n lg n)
if you can recall those dimly-remembered computer science lectures), but the
primary savings in this case are because we can saturate our 50-hit buffer
with the results from the first partition alone. This query takes only 0.20
sec for the first 50 hits, something like 1200% faster than global's 2.34 sec.
On the other hand, with very selective queries, you may have to hit multiple
partitions, and we have already seen how this can put the local index at a
disadvantage. Changing the query to KSPC (1/1000 selectivity)
forces the
partition iterator to hit 5 of 10 partitions, and its performance -- 0.34 sec
is more than the global index's 0.30 sec. However, the ordered iteration may
still be a good all-around plan -- it's much faster on the queries that are now
slow, and a little slower on the queries that are already fast, so you should
see an overall improvement as long as the number of partitions is not large.
One note of caution: do not use an inline view to do the top-n hit buffering.
Something like this:
select * from (
select /*+ FIRST_ROWS(50) INDEX(ptest ptestx) */ 12345 from ptest
where contains(b, 'TM07') > 0
order by a
) where rownum < 50
Produces a very different query plan:
1.1 COUNT STOPKEY
2.1 VIEW
3.1 SORT ORDER BY STOPKEY
4.1 PARTITION RANGE ALL
5.1 TABLE ACCESS BY LOCAL INDEX ROWID PTEST
6.1 DOMAIN INDEX PTESTX
Note that the SORT ORDER BY is above the partition iterator.
This means a
global sort, and the timing for this -- 2.63 sec -- is slower than the global
index. This seems to be a quirk in query plan parsing and optimization, and
is currently being investigated (see bug 1800113).
Parallel Query
While local b-tree index queries can be run in parallel, parallel contains
queries are not supported in this version. To illustrate, let's modify our
test table with a numeric column C and its own local b-tree index:
alter table ptest add (c number);
begin for i in 1..100000 loop
update ptest set c = mod(i, 113) where a = i;
end loop; end;
create index ptestc on ptest(c) local;
We can use the PARALLEL_INDEX hint to force parallel execution of
this query:
select /*+ INDEX(ptest ptestc) PARALLEL_INDEX(ptest, ptestc, 4) */ 12345
from ptest
where c = 111
and a between 20000 and 100000;
1.1 PARTITION RANGE ITERATOR PARALLEL_COMBINED_WITH_PARENT
2.1 TABLE ACCESS BY LOCAL INDEX ROWID PTEST PARALLEL_TO_SERIAL
3.1 INDEX RANGE SCAN PTESTC NON-UNIQUE PARALLEL_COMBINED_WITH_PARENT
Unfortunately, parallel query is not supported by the extensible indexing
framework in this version:
select /*+ PARALLEL_INDEX(ptest, ptestx, 4) */ 12345
from ptest
where contains(b, 'KSPC')>0
and a between 20000 and 100000;
1.1 PARTITION RANGE ITERATOR
2.1 TABLE ACCESS BY LOCAL INDEX ROWID PTEST
3.1 DOMAIN INDEX PTESTX
Note the lack of any parallel comments in the plan. This capability might be
added by the kernel's extensible indexing framework development in some future
version.
Query Considerations Summary
These tests show that local partitioned indexes are much faster than global
indexes as long as you are leveraging the partitioned nature of the base table,
either through significant partition pruning or ordered partition iteration.
Plain text queries are much slower because of the overhead incurred by scanning
multiple partitions -- especially if the number of partitions is large.
You might therefore conclude that, if your application were doing plain text
queries or mixed queries that do not involve the partition key, that global
indexes would be much better. However, do not overlook the maintenance
advantage of local partitioned indexes (discussed in the following sections).
Overall application performance (including admin and indexing time) might still
be better with local indexes, especially if the number of partitions is small.
Local Partitioned Index
Table Partition Operations
Partitioned Table Maintenance
Index Maintenance
Query Services
Document Services
Table Partition Operations
A major benefit of using partitioned tables is easier maintenance. To quote
the concepts manual, "partitions enable data management operations like data
loads, index creation, and data purges at the partition level, rather than
on the entire table, resulting in significantly reduced times for these
operations".
But what about the indexes? When you do partition-level DDL, a global
ConText index built on the table is a monolithic entity, and knows nothing
about table partitions. The only option is for the whole index to become
unusable, so you have to rescan the entire table and rebuild the index from
scratch -- even though most of the data has not changed.
A local partitioned index, on the other hand, can take more limited and
appropriate action. If you truncate a table partition, for instance, the
index simply truncates the corresponding index partition's internal tables
automatically -- no extra rebuild time required. At worst, just the affected
index partition is marked unusable. That partition must be rebuilt (which
takes less time than a full rebuild because it has just a portion of the data)
but the other index partitions are still available and can be queried.
Rebuilding index partitions is covered below in the section
"Rebuilding UNUSABLE Index Partitions".
We'll now examine the effects of the table partition operations individually.
RENAME PARTITION
alter table ptest rename partition p5 to pnew;
This renames a table partition. It has no effect on the index. In fact,
the index is not even notified when this happens. The index partition name
does not change to match the table partition name. To rename an index
partition, use alter index:
alter index ptestx rename partition p5 to pnew;
TRUNCATE PARTITION
alter table ptest truncate partition p5;
This removes all data in a table partition. The internal tables of the
affected index partition are truncated. The index partition remains valid
and usable.
DROP PARTITION
alter table ptest drop partition p5;
This removes the data and definition of a table partition. The corresponding
index partition's meta-data and internal tables are dropped. The remaining
index partitions are not affected. The index partition id's (
ixp_id in view
ctx_user_index_partitions), which are normally contiguous, will
have a hole.
This doesn't have much effect, and the id will get reused if partitions are
added (through ADD or SPLIT).
ADD PARTITION
alter table ptest add partition p11 values less than (110000);
This creates a new table partition. This new table partition will be empty,
because with RANGE partitioning (the only method currently
supported) you can
add partitions only at the end of the range, where there will be no data.
The ConText index automatically creates a new index partition, creating the
meta-data and internal tables. The new index partition will be valid and can
be queried, although there is no data, because the underlying table partition
is empty.
The name of the new index partition will be the same as the new table
partition name. If that name is already used by an existing index partition,
a system-generated name is assigned instead.
You cannot control the storage characteristics of the new index partition.
It will use the default storage values for the index. You can override this
by rebuilding the partition.
MOVE PARTITION
alter table ptest move partition p5 tablespace testing;
This reconstructs the table partition's data. For example, the command above
moves p5 to a new tablespace. Because ROWIDs and object ids
might change
because of this operation, the corresponding index partition will be marked as
UNUSABLE. The internal tables will still exist and contain the
same data, but
the index partition cannot be queried until it is rebuilt. The index has to be
rebuilt from scratch even though the data has not changed. This is because
there is no way to determine the "old" rowid of each row, so it is not possible
to simply rebuild the $R and $K tables.
SPLIT PARTITION
alter table ptest split partition p5 at (45000)
into (partition p5_1, partition p5_2);
This takes an existing partition and splits it into two adjacent partitions.
The ConText index first drops the existing partition's meta-data and internal
tables, then creates two new index partitions. The internal tables for these
new index partitions will be created, but they will be empty.
After the operation, the space taken up by the old index partition is free,
and the two new index partitions are empty and UNUSABLE -- they
cannot be
queried until they are rebuilt. (see
Rebuilding UNUSABLE Index Partitions,
below)
The two new partitions will have the same name as their table partitions.
If an existing index partition already has that name, a system-generated name
is assigned instead.
You cannot control the storage characteristics of the two new index
partitions. They will use the default storage values for the index. You can
override this when rebuilding the partitions.
MERGE PARTITION
alter table ptest merge partitions p2, p3
into partition p2and3;
This takes two neighbor partitions and merges them into a single partition.
It's the opposite of SPLIT. The local ConText index drops the
meta-data and
internal tables for the two old index partitions, then creates a new index
partition, with empty internal tables. The new index partition is
UNUSABLE
and cannot be queried until it is rebuilt. (see
Rebuilding UNUSABLE Index Partitions, below)
The new index partition will have the same name as the table partition. If
an existing index partition already has that name, a system-generated name is
assigned instead.
You cannot control the storage characteristics of the new index partition.
It will use the default storage values for the index. You can override this
when rebuilding the partition.
EXCHANGE PARTITION
alter table ptest exchange partition p7 with table p7_ext;
This is the most involved of the partition operations. It takes a specified
table partition and a non-partitioned table and swaps their data. If you do
not specify INCLUDING INDEXES, as above, or you explicitly
exclude indexes:
alter table ptest exchange partition p7 with table p7_ext
excluding indexes;
then the corresponding index partition is marked as UNUSABLE and
it must be
rebuilt. If a ConText index exists on the non-partitioned table, then that
is also marked as UNUSABLE and must be rebuilt.
If you include indexes:
alter table ptest exchange partition p7 with table p7_ext
including indexes;
Then the index partition will be automatically rebuilt. If there's a ConText
index on the non-partitioned table, that will also be rebuilt. Since you will
be re-creating two ConText indexes, this may take some time.
The local index and the index on the non-partitioned table are rebuilt with
their existing preferences. It will not synchronize or reconcile divergent
preferences, so you might find queries on the data return different results
after the move.
Rebuilding UNUSABLE Index Partitions
Certain table partition operations cause associated index partitions to
become UNUSABLE. When an index partition is
UNUSABLE, it cannot be queried.
Queries that involve that partition -- including global queries where no
partition pruning occurs -- will fail. For example:
alter table ptest move partition p5 tablespace testing;
select * from ptest where contains(b, 'KSPC')>0;
--> gets ORA-29954: domain index partition is marked LOADING/FAILED/UNUSABLE
Partition-pruned queries will still work, though:
select * from ptest partition(p4) where contains(b, 'KSPC')>0;
--> OK
You can find which partitions are UNUSABLE by querying
USER_IND_PARTITIONS:
select index_name, partition_name
from user_ind_partitions
where status != 'USABLE'
Once the UNUSABLE partitions have been identified, you must
rebuild them using
alter index:
alter index ptestx rebuild partition p5;
Note that multiple partitions of a given index can be queried or even rebuilt
at the same time. Alternatively, you can use:
alter table ptest modify partition p5 rebuild unusable local indexes;
which will go out itself and find all UNUSABLE partitions in all
local indexes for partition p5, and rebuild them.
Partition rebuild is also useful to reset storage characteristics of
partitions which are implicitly created through split, merge, or add; these
use the default storage values for the index when first created:
alter index ptestx rebuild partition p5
parameters ('replace storage p5_store');
The amount of index memory to use when rebuilding the index can also be
specified in the parameters string, although the REPLACE keyword
must be used:
alter index ptestx rebuild partition p5
parameters ('replace memory 5m');
Concurrency and Table Partition Operations
During a table partition operation, the entire domain index is unusable.
For instance, if in session 1 we do:
alter table ptest move partition p5 tablespace testing;
and, while it is in progress, in session 2 we query:
select * from ptest where contains(b, 'KSPC')>0;
you would see the query plan:
1.1 TABLE ACCESS FULL PTEST
and the query itself would get an error:
DRG-10599: column is not indexed
Even partition-specific queries that do not touch the affected partitions will
fail with this error. This is a quirk of the extensible indexing framework --
local b-trees and bitmap indexes do not have this same limitation. This is
currently being investigated by extensible indexing framework development.
Most table partition operations are relatively quick, so the window where
queries are unavailable should be small. However, be aware of this effect
during potentially long-running partition operations -- especially
EXCHANGE PARTITION INCLUDING INDEXES, which will force two
ConText index rebuilds.
Local Partitioned Index Maintenance
Local partitioned ConText indexes are not fundamentally different from global
ConText indexes -- they still have to be sync'ed to reflect new data, and still
should be optimized once in a while to defragment. The only real difference
is that a local index is made up of many sets of internal tables; each set is
independent, so maintenance is on a partition level.
SYNC
CTX_DDL.SYNC_INDEX has been extended to take a partition name:
PROCEDURE sync_index(
idx_name in varchar2 default NULL,
memory in varchar2 default NULL,
part_name in varchar2 default NULL
);
Actually, the memory parameter is also new, and lets you pass in
the index memory size to use for the sync.
For local indexes, you must specify both an index name and index partition
name in the idx_name and part_name parameters. The
ConText index will then
scan dr$pending for rowids covered by that index partition, and
update just
that index partition's internal tables.
part_name must be specified when idx_name is a
local index. This means
that, in this version at least, you must know which index partitions need to
be sync'ed, and you must issue one sync_index call for each of those. The
view CTX_USER_PENDING has been extended with a
PND_PARTITION_NAME column that
shows the index partition name for each row, so you can find which partitions
need sync with:
select distinct pnd_index_name, pnd_partition_name
from ctx_user_pending;
We are investigating enhancing sync_index in future versions to
enable it to
find and sync all out-of-date partitions without user specification.
Only one session can sync a given index partition at a time. However, you
can sync different partitions of the same index at the same time. Just open
multiple sessions -- either separate SQL*Plus windows, or, even better, use
several DBMS_JOBs and multiple SNP slaves. This can improve ingest rates if
data flows to multiple partitions, but it won't help something like date
partitioning, where new records are all going into the same partition.
OPTIMIZE
Like sync, optimize is also now done at the partition level; so, like
SYNC_INDEX, OPTIMIZE_INDEX has been extended to
take a partition name:
PROCEDURE optimize_index(
idx_name in varchar2,
optlevel in varchar2,
maxtime in number default null,
token in varchar2 default null,
part_name in varchar2 default null
);
For local indexes, you must specify both an index name and index partition
name in the idx_name and part_name parameters.
Only one session can
optimize a given index partition at a time, but you can optimize different
index partitions of the same index at the same time.
Each index partition records its own optimization state (it is not in a user
view, but is stored in dr$index_partition) so running a time-
limited FULL
optimization on an index partition will not interfere with any other index
partition. When FULL optimize on partition 1 runs out of time,
the next
FULL optimize on that partition will pick up where it left off,
no matter what
you do to the other index partitions.
Query Services
All query services (CTX_QUERY package) that require an index
name --
COUNT_HITS, EXPLAIN, HFEEDBACK, and
BROWSE_WORDS -- have been extended with
a part_name argument. When specifying a local index, you must
specify an
index partition. The operation will be run against the internal tables of that
partition only. So, for instance, you can count the hits in a particular
partition, but to count hits in the whole table, you must call count_hits
multiple times -- once per partition -- and manually add up the individual
return values.
We are investigating multi-partition versions of these calls for future
release. The results can be calculated only on a per-partition level, so there
is a performance impact, but the main stumbling block is that, other than
COUNT_HITS, merging multiple partition results is somewhat
complicated.
Document Services
There are no changes to document services (CTX_DOC package) for
local
partitioned indexes. The document services operate on a per-document basis,
specifying the document by rowid or primary key. The index is not even used,
except to find out the base table and indexed column, and to know which
indexing objects and attribute values to use. So, the document services don't
really care whether the index is local or global.
New XML Features
Path Indexing
XML Path Searching
Every new version of Oracle Text/interMedia Text/ConText Option has added new
structured document searching features, so we've built up a lot of
functionality -- 8.1.7 has nested section search, doctype disambiguation,
attribute value searching, automatic section indexing, and more. But with the
industry embracing XML, demand is growing for even more sophisticated features
which are beyond the capabilities of the current section architecture.
9.0.1 introduces a new section type and new query operators which support an
XPath-like query language. ConText indexes with XML path searching are able
to perform sophisticated section searches that were impossible in previous
versions.
The Path Section Group
In order to use XML path searching, the index must be created with the new
path section group:
exec ctx_ddl.create_section_group('mypathgroup','PATH_SECTION_GROUP');
create index myindex on mytable(doc)
indextype is ctxsys.context
parameters ('SECTION GROUP mypathgroup');
You can add only SPECIAL sections to a path section group; you
do not have to
add ZONE, FIELD, or ATTR sections, and
it does not support STOP sections in
this version. Most of the time, you won't even need to create your own -- just
use the default one:
create index myindex on mytable(doc)
indextype is ctxsys.context
parameters ('SECTION GROUP ctxsys.path_section_group');
The path section group is somewhat like the auto section group in that it
automatically indexes all tags and attributes it encounters. For query, you
can still use WITHIN, but the path section group also supports
the more
powerful INPATH and HASPATH operators.
Path Section Internals
Now we'll talk a little about how path sections are indexed, and how they
differ from zone and attribute sections. You can skip this section if you're
just interested in usage.
We'll illustrate the difference in index data with this simple doc:
<OUTER><INNER ATTR="attrvalue">text</INNER></OUTER>
The auto section group produces the following in $I:
| TOKEN_TEXT |
TOKEN_TYPE |
TOKEN_INFO |
OUTER |
2 (zone section) |
DOC 1 START 0 LENGTH 1 |
INNER |
2 (zone section) |
DOC 1 START 0 LENGTH 1 |
INNER@ATTR |
5 (attribute section) |
DOC 1 START 0 LENGTH 1 |
ATTRVALUE |
4 (attribute value) |
DOC 1 POS 1 |
TEXT |
0 (plain text) |
DOC 1 POS 1 |
A simple within query like:
select id from mytable where contains(doc, 'text WITHIN inner')>0
can be fulfilled by fetching the info for word TEXT and section
INNER, then
looping through each word offset of TEXT, and checking to see if
it is between
INNER START and START + LENGTH. A simple attribute
query like:
attrvalue WITHIN inner@attr
(we'll just write the text query for brevity) can be fulfilled in much the same
way, but using type 4 and 5 tokens instead of types 0 and 2.
This section type cannot support more complex queries. For instance, equal
sections like INNER and OUTER are a problem. The
query:
(text WITHIN outer) WITHIN inner
matches this document. The index data records the same offsets for
INNER and
OUTER, so it is impossible to tell if INNER is
inside OUTER or vice versa.
Another problem is attribute value sensitive section search. For document:
<SHIPMENT>
<MEDIA TYPE="dvd">Ice Blue Eyes</MEDIA>
<MEDIA TYPE="book">The Monopoly Companion</MEDIA>
</SHIPMENTS>
(We won't include the xml declaration and DOCTYPE stuff -- just
pretend that
they are there) If you want to find shipments that include the DVD "Ice Blue
Eyes", you'd have to include both section and attribute criteria:
find documents where "Ice Blue Eyes" occurs within a "MEDIA" section
whose "TYPE" attribute has the value "dvd"
Unfortunately, attribute values and sections (types 4 and 5) are completely
separate from normal text and sections (types 0 and 2). There is no way to
link an attribute section to the particular zone section occurrence in which
it occurs.
The path section group solves both problems by indexing the document like:
| TOKEN_TEXT |
TOKEN_TYPE |
TOKEN_INFO |
OUTER |
7 (path section) |
DOC 1 START 0 LENGTH 1 LEVEL 1 |
INNER |
7 (path section) |
DOC 1 START 0 LENGTH 1 LEVEL 2 |
INNER@ATTR |
8 (path attr section) |
DOC 1 START 0 LENGTH 1 OCC 1 |
ATTRVALUE |
4 (attribute value) |
DOC 1 POS 1 |
TEXT |
0 (plain text) |
DOC 1 POS 1 |
Zone sections and attribute sections (types 2 and 5) have been replaced with
path sections and path attribute sections (types 7 and 8). Each occurrence
of a path section has a LEVEL bit which indicates the nesting
level of the tag.
The root tag gets a level of 1, its children get a level of 2, their children
get a level of 3, etc. Including level information solves the equal section
problem, because we can now tell which tag is the outer tag.
Path attribute sections have a OCC part which links the
attribute section
occurrence to a path section occurrence. Here, INNER@ATTR has an
OCC of 1
because it occurs in the first occurrence of INNER. The ability
to correlate
attributes and tags solves the attribute value sensitive section search.
New XML Features
Path Query
Querying XML Documents
The query interface is through SQL selects, so your XML queries return entire
documents, not just selected parts of them. While we work on better extraction
features for future release, you can explore using the new
XMLType, which has
extraction methods. Just use extraction methods in the select list and
contains in the where clause.
The INPATH Operator
The way path sections are indexed enables more complicated section searches,
but the WITHIN operator is not expressive enough to handle them. Instead of
<text query> WITHIN <section name> (e.g. Ice WITHIN MEDIA )
indexes with a path section group use the INPATH operator in queries:
<text query> INPATH(<path expression>) (e.g. Ice INPATH(//MEDIA) )
but it functions in much the same way, limiting the scope of the text query to
certain parts of the indexed documents. The parentheses around the path
expression are required. The path expression is more than just a simple
section name -- it is a mini query, with a specialized query language. The
next section explores the path query language in more detail.
You can still use the WITHIN operator even if you are using the
path section
group. There should be no difference in behavior between the path section
group or auto section group when using WITHIN queries.
Path Query Language
The Text path query language is based on XPath, and we will probably continue
to use XPath as a guide for future development, but it is NOT XPath. Not all
the XPath operators exist in the Text path query language, for instance. Also,
the Text path query language operators are case-insensitive, while XPath's are
strictly lower-case. There are other semantic differences covered below. Just
don't make assumptions about the path language based on XPath expectations.
Tags
When specifying tags in path queries, you must specify it exactly as it
appears in the document in order for it to match. There are two commonly-made
mistakes you should avoid.
First, tag names are case-sensitive so the query "title" does
not match the
tag <TITLE> or the tag <Title>. It
will match only <title>.
Second, there is no namespace support in this version. Take the fragments:
DOC 1 <A xmlns:ORCL="/index.html"><ORCL:B>
DOC 2 <A xmlns:ORACLE="/index.html"><ORACLE:B>
<ORCL:B> in DOC 1 is the same tag as
<ORACLE:B> in DOC 2, because their
namespace tags normalize to the same URI. However, when querying for these
tags, you must specify it as written in the document, so "ORCL:B"
to find the
tag in doc 1, and "ORACLE:B" to find it in doc 2.
"B" alone will not find
either tag, nor will something like "http://www.oracle.com/:B".
Future
versions will probably add more sophisticated namespace support.
Top-Level Tag
The simplest INPATH query string is a single tag:
perro INPATH(TITLE)
Like a WITHIN query, this query finds perro where it occurs between
<TITLE> and
</TITLE>. However, unlike a WITHIN query,
<TITLE> must be the top-level tag. Take
these two documents:
DOC 1 <TITLE>Clifford El Gran Perro Colorado</TITLE>
DOC 2 <BOOK><TITLE>Años De Perro</TITLE></BOOK>
The query
perro WITHIN TITLE
will find both documents, but the INPATH query will find only
document 1. It
does not match document 2 because there the TITLE tag has a
level of 2.
What's really happening is that no level for the query node is specified, so
it uses the default context, which is always the top level for
INPATH queries.
You can explicitly specify the top level context with slash:
perro INPATH(/TITLE)
or explicitly specify the default context using dot:
perro INPATH(./TITLE)
both are equivalent to the query without the slash. All examples from here
will include the top level slash for readability.
Any-Level Tag
A double slash indicates "any number of levels down". So, the query:
perro INPATH(//TITLE)
is looking for perro inside a TITLE tag that occurs
at the top level or any
level below. In other words, this query is equivalent to:
perro WITHIN TITLE
and finds both DOC 1 and DOC 2.
Ancestor/Descendant Searching
A child tag is a tag which is enclosed within another tag. For instance, in:
DOC 2 <BOOK><TITLE>Años De Perro</TITLE></BOOK>
TITLE is a child of BOOK. We can find this
document using the any-level
tag searching, as in the previous section. But what if the corpus also
contained:
DOC 3 <MOVIE><TITLE>Mi vida como un perro</TITLE></MOVIE>
In order to find only books with perro in the title, we need to
limit the
search to title tags whose parent is a book tag:
perro INPATH(/BOOK/TITLE)
Reading the path right-to-left, we are looking for a top-level
BOOK tag with
a child TITLE tag, which matches only DOC 2.
The single slash is direct parentage. The query above will not find:
DOC 4 <BOOK><DESCRIPTION>
<TITLE>Años De Perro</TITLE>
</DESCRIPTION></BOOK>
Because here TITLE is not a direct child of BOOK.
TITLE's direct parent is
DESCRIPTION, whose parent is BOOK --
TITLE is a grand-child of BOOK.
To find this doc, you can use the any-level slashes:
perro INPATH(/BOOK//TITLE)
Reading the path right-to-left, we are looking for a top-level
BOOK tag with
some descendant TITLE tag. This query will match both
DOC 3 and DOC 4.
Note that this is not the same as:
((perro WITHIN TITLE) WITHIN BOOK)
First, the INPATH query restricts BOOK to the
top-level. Second, equal
sections are not confused. That is, the query:
((perro WITHIN BOOK) WITHIN TITLE)
would match DOC 4, but the query:
perro INPATH(/TITLE//BOOK)
would not. Path sections know that TITLE is a child of
BOOK, even though
they occur at the same text offsets.
Finally, if you wanted to match only DOC 4 and not
DOC 3 -- that is, you want
to match TITLE only if it is a grandchild of BOOK,
and not a child or great
grandchild, etc. -- you can use the single level wildcard:
perro INPATH(/BOOK/*/TITLE)
The * matches exactly one level, so this path query filters out
DOC 3.
You can combine these ancestor/descendant elements for even more complicated
queries:
felis INPATH(//kindgom/*/*/order/family//genus)
Attributes
You can search within an attribute value using the syntax
<tag>/@<attribute>:
perro INPATH(//MOVIE/@SPANISHTITLE)
matches:
DOC 5 <MOVIE SPANISHTITLE="Mi vida como un perro">My Life As A Dog</MOVIE>
and is equivalent to the query:
perro WITHIN MOVIE@SPANISHTITLE
One limitation resulting from how attributes are indexed is that all attributes
must specify their direct-parent tags. The following:
perro INPATH(//@TITLE)
perro INPATH(A/*/@TITLE)
are not allowed, because the tag for the title attribute is not specified:
select * from doc where contains(text, 'perro INPATH(//@TITLE)')>0;
*
ERROR at line 1:
ORA-20000: Oracle Text error:
DRG-50951: Unable to resolve element name for attribute TITLE
Existence Test
The square brackets are used to impose a condition on a node without changing
the path context. For instance, the query:
monopoly INPATH(/auction[image])
is looking for monopoly inside a top-level auction
tag which has an image
tag as a direct child. The search for monopoly occurs within the
entirety
of <auction> and </auction>, and not
just within <image> and </image>. This
document will match:
<auction>Sailing Monopoly <image src="...">pic</image></auction>
but will not match:
<auction>Sailing Monopoly</auction>
because there is no image element. The default context inside a
test element
is the tag to which it is applied, so
monopoly INPATH(/auction[image])
is actually the same as:
monopoly INPATH(/auction[./image])
You need the dot to reference the default context. Without the dot:
monopoly INPATH(/auction[/image])
it would mean top-level image tag. This is not supported, and will result in
a syntax error.
The existence test for image will match only if image exists
and is a
direct child. It does not match:
<auction>Sailing Monopoly<desc><image src="...">pic</image></desc></auction>
because here image is not a direct child of auction.
You can match this
document using the any-level wildcard, instead:
monopoly INPATH(/auction[.//image])
You can also test for attribute existence:
monopoly INPATH(/auction[@reserve])
The test node can be combined with other operators for interesting searches:
monopoly INPATH(/auction[.//image]/title)
The test node does not change context, so the /title applies to
/auction
rather than /auction//image -- this query finds auctions where
monopoly
occurs inside a direct-child title tag, but only if the auction
has an
image tag in it somewhere. For instance, the doc:
<auction>
<title>Sailing Monopoly</title>
<description>
New Sailing Monopoly with custom pewter tokens from USAOpoly
<image src="...">here is a picture</image>
</description>
</auction>
To test for non-existence, use the NOT operator:
monopoly INPATH(/auction[not(taxinfo)])
this query looks for monopoly within an auction element that does not have
a direct child taxinfo. The NOT operator is
case-insensitive in our path
query language. In XPath it only works in lowercase.
Attribute and Tag Value Equality
The test operator is capable of more than simple existence testing. More
useful is attribute value testing, which contrains nodes by the value of their
attributes. For instance, given a document like:
<MOVIE>
<TITLE LANGUAGE="German">Tiger und Dragon</TITLE>
<TITLE LANGUAGE="French">Tigre et Dragon</TITLE>
<TITLE LANGUAGE="Spanish">Tigre y Dragón</TITLE>
<TITLE LANGUAGE="Mandarin">Wo hu cang long</TITLE>
<TITLE LANGUAGE="English">Crouching Tiger, Hidden Dragon</TITLE>
</MOVIE>
the query:
dragon INPATH(//TITLE)
will search all language titles. To limit the search to just English titles,
you can add an attribute value equality test:
dragon INPATH(//TITLE[@LANGUAGE = "English"])
Only equality and inequality (using !=) are supported. Range
searches are
not supported in this version. The left-hand side must be an attribute or tag,
while the right-hand side must be a literal. The query:
gato INPATH(//A[@B = @C])
is not allowed, nor is something like
gato INPATH(//A["dog" = "cat"]
Only string literals are allowed. Numeric literals, such as
tora INPATH(//MOVIE[@PRICE = 5])
will raise a syntax error. This means that numbers are not normalized. The
query above will not match:
<MOVIE PRICE="5.0">Tora! Tora! Tora!</MOVIE>
because the string 5 is not equal to the string 5.0,
although numerically
they are equal.
The equality test is not strict equality -- it uses "contains-equality". Two
text fragments are contains-equal if the lexer produces identical index info.
Some of the significant ways that this deviates from strict equality are:
- contains equality is probably case insensitive
Unless the lexer has MIXED_CASE on, it would consider
foo and FOO to
be equal strings.
- lexer transformation can make strings equal
The lexer can be set to normalize word forms. Base letter, for instance,
converts accented characters to non-accented forms. A word with u-umlaut
(Müller) is contains-equal to the same word with an unaccented u
(Muller) in this case.
- lexer ignores whitespace and non-alphabetic, non-join characters
The lexer deals in word offsets. In a phrase like
WORD1 WORD2, the word
offset of WORD2 is always 1 greater than the word offset of
WORD1 -- it
doesn't matter how many spaces or newlines there are between them. Also,
any non-alphabetic, non-join character is converted to whitespace (and
subsequently ignored). This can confuse names, with
Chase Matthew being
contains-equal to Chase, Matthew, or phrases, with
fruit-plants being
contains-equal to fruit, plants.
- stopwords in the query string are considered any word wildcard
Any stopword appearing in the query text fragment will match any single
word in the document's text fragment. This means that the query
Paris in the spring would be contains-equal to the document
Paris: avoid during spring.
The rules for contains equality seem complex, but it works the same as regular
text queries hitting document text -- you've probably internalized these rules
already. One significant difference between equality and contains, though, is
that the equality test always makes sure that the number of words in the
attribute value is the same as the number of words in the query string.
dragon INPATH(//TITLE[@LANGUAGE = "French"])
does not match any of these fragments:
<TITLE LANGUAGE="Canadian French">dragon</TITLE>
<TITLE LANGUAGE="French Colloquial">dragon</TITLE>
<TITLE LANGUAGE="Medieval French Colloquial">dragon</TITLE>
Although each LANGUAGE attribute value has the word
French, there are extra
words. These would match a contains in the attribute value, but they do not
meet the "same number of words" equality criteria.
While docu-head people use a lot of attributes in their DTD's, data-heads
prefer child tags. For instance, a docu-head might write:
<MOVIE YEAR="2001" TITLE="Moulin Rogue">...
While a data-head would prefer:
<MOVIE>
<YEAR>2001</YEAR>
<TITLE>Moulin Rogue</TITLE>
...
To match the data-head version, you can use equality testing on tag values:
moulin INPATH(//MOVIE[YEAR = "2001"])
Tag value equality uses contains-equality just like attribute value testing.
Inequality is also supported in both attribute and tag value equality, using
the != operator:
moulin INPATH(//MOVIE[@YEAR != "2000"])
moulin INPATH(//MOVIE[YEAR != "2000"])
Note that inequality implies existence. The queries above do not match
<MOVIE>Moulin Rouge</MOVIE>
Because the MOVIE tag does not have a YEAR
attribute or YEAR child element. To
test for non-existence, use the NOT operator.
AND and OR Operators
You can use boolean AND and OR to combine
existence or equality predicates in
a test. Say you have documents like:
<MOVIE>
<TITLE>Big Trouble in Little China</TITLE>
<ACTORS>
<ACTOR>Kurt Russell</ACTOR>
<ACTOR>Kim Cattrall</ACTOR>
</ACTORS>
<DVD>2 DISCS</DVD>
</MOVIE>
and you want to find movies with china in the title starring
Kurt Russell
and Kim Cattrall that are available on DVD:
china INPATH(/MOVIE[DVD and
.//ACTOR = "Kurt Russell" and .//ACTOR = "Kim Cattrall"]/TITLE)
You can use parentheses for precedence:
blue INPATH(/video[DVD and (discount or @rating = "4")])
AND and OR are case-insensitive in our path query
language. In XPath they
must be lowercase.
Nested INPATH
Nested INPATH operators are allowed, but the two are independent -- the
default context of an INPATH is always top level. For instance:
(perro INPATH(A)) INPATH(B)
will never hit any documents, because both INPATH's are looking
for top-level
tags, and, except for invalid documents, a document cannot have two different
top-level tags.
The HASPATH Operator
The HASPATH operator is not a path query language operator;
it's a ConText
query language operator like INPATH. INPATH is used
when you want to search
for a text query within a path. HASPATH is used when all you want
to do is
test for path existence; it takes a path as its only argument, and returns 100
for a document if the path exists, 0 otherwise.
select id from documents
where contains(doc, 'HASPATH(/movie/dvd)')>0;
will return all documents where the top-level tag is a movie
element which
has a dvd element as a direct child.
HASPATH can also do tag value equality tests:
HASPATH(//A = "dog")
Attribute value equality tests and AND and
OR operators are not currently
supported. You can use the ConText query language AND and
OR, with multiple
HASPATHs to achieve the same effect. Instead of:
HASPATH(A and B)
write:
HASPATH(A) and HASPATH(B)
HASPATH can return false hits when there are empty sections.
Path sections
are recorded with level information, but not true parentage. As a result, a
document like:
<A>
<B>
<C></C>
</B>
<D>
<E></E>
</D>
</A>
is matched by the query:
HASPATH(//B/E)
Since we do not have real parent information, we cannot detect that
E is not
the child of B. The index tells us only that E and
B surround the same
word offsets, and that E is a third-level tag and B
is a second-level tag.
Normally this indicates that E is a child of B.
In this boundary case it
does not. This limitation only applies to empty sections like this -- any
words in the document would ensure correct behavior.
New XML Features
Highlighting
XMLType Indexing
Highlighting
Highlighting with the INPATH and HASPATH operators
is not supported in this
version. You can still highlight and markup regular words, and
WITHIN queries,
but use of the path operators will result in an error message. We are working
on support for a future release.
XMLType Indexing
Oracle 9i introduces a new datatype for storing XML -- the XMLType. This is
a core database feature, and you can find out about the type and its usage in
the XML features manual.
You can create a ConText index on this type, but you need a few database
privileges first:
1. the user creating the index must have query rewrite:
grant query rewrite to <user>
Without this privilege, the create index will fail with:
ORA-01031: insufficient privileges
<user> should be the user creating the index. The
schema that owns the
index (if different) does not need the grant.
2. query_rewrite_enabled should be true, and
query_rewrite_integrity should
be trusted. You can add them to the init.ora:
query_rewrite_enabled=true
query_rewrite_integrity=trusted
or turn it on for the session:
alter session set query_rewrite_enabled=true;
alter session set query_rewrite_integrity=trusted;
Without these, queries will fail with:
DRG-10599: column is not indexed
These privileges are needed because under the covers a ConText index on an
XMLType column is actually a function-based index on the
getclobval() method
of the type. These are the standard grants you need to use function-based
indexes, as covered in the general Oracle documentation. However, unlike
function-based b-tree indexes, you do not need to calculate statistics.
When an XMLType column is detected, and no section group is specified in the
parameters string, the default system examines the new system parameter
DEFAULT_XML_SECTION, and uses the section group specified there.
At install
time this system parameter is set to CTXSYS.PATH_SECTION_GROUP,
which is the
default path sectioner. The default filter system parameter for
XMLType is
DEFAULT_FILTER_TEXT, which probably means that the INSO filter
is not engaged
by default.
Other than the database privileges and the special default section group
system parameter, indexes on XMLType columns work like any other
ConText index.
Here is a simple example:
connect ctxsys/ctxsys
grant query rewrite to xtest;
connect xtest/xtest
create table xtest(doc sys.xmltype);
insert into xtest values (sys.xmltype.createxml('<A>simple</A>'));
create index xtestx on xtest(doc)
indextype is ctxsys.context;
alter session set query_rewrite_enabled = true;
alter session set query_rewrite_integrity = trusted;
select a.doc.getclobval() from xtest a
where contains(doc, 'simple INPATH(A)')>0;
The CTXRULE Indextype
The CTXRULE Indextype
ctxrule is a new indextype designed for classification, which is the inverse
of information retrieval. In traditional IR applications, for which the
ConText indextype is designed, you index a set of documents and find documents
with a text query; in classification, you index a set of queries, and find
queries with a document. This inversion enables a new kind of application.
Take news, for example. New articles come in off the wire all the time. Say
you want to get emailed any new news articles that mention Oracle. You could
do something like this:
- set up a text table and a text index on the table
- as documents come off the wire, insert them into the text table
- periodically sync
- periodically run a query like this:
select article_id from articles
where contains(article_text, 'oracle')>0
and article_date > :last_time_I_ran_this_query
- email the results
There are several problems with this approach. First, there can be a lot of
lag time between when a document comes in and when it gets emailed out. You
have to first wait for the periodic sync to index the document, then wait for
the periodic query to run and find it. More importantly, though, this process
is not scaleable as the number of stored queries increases -- the system spends
a lot of time running queries that don't hit any new documents.
With ctxrule, the system looks like this:
- set up a query table and a ctxrule index on the queries:
create table queries (
username varchar2(10),
query_string varchar2(80)
);
insert into queries values ('JSMITH', 'Oracle');
create index queryx on queries(query_string)
indextype is ctxsys.ctxrule;
- as documents come off the wire, run a query like this:
select username from queries
where matches(query_string, :article_text)>0
- email the article to each matched username
Instead of indexing the article text and periodically running stored queries,
index the query strings, and use incoming document text to query the queries.
The lag problem gets solved because we don't have to wait for a sync and for a
periodic query -- the classification gets done in one step instead, and as part
of a query. The scalability problem gets solved because you don't have to run
every query all the time -- the index on the queries allows you to find only
the ones that match the given document.
Now that you know what ctxrule is -- a classification engine that takes a
given document and matches it against stored queries -- you need to know what
it isn't.
-
ctxrule does not do automatic classification. In some systems, the engine
works with just documents -- each document is analyzed and classified in some
static hierarchy, like a library card catalog. With ctxrule, you can't start
with nothing -- you need to provide the hierarchy, and the rules that govern
classification before processing documents. The existing theme extraction
features can be used for automatic classification, though.
-
ctxrule is not a clustering engine. Some classification systems work by
examining a corpus of documents, and arranging them into groups of related
documents based on discovered common traits. This is not what ctxrule does.
ctxrule operates on single documents, and matches them against queries.
-
ctxrule is not a learning system. In some systems, you collect documents
into groups. The system routes new documents to these groups, and you train
the system by approving or rejecting its guesses. This is not how ctxrule
works. The set of ctxrule queries are user-specified. It does not change
those rules over time or have a feedback process.
-
ctxrule is not psychic. The indexed queries must be explicit and concrete,
such as words or phrases. You can't use fuzzy criteria like "Financial News"
or "Funny Stories". However, you can use ABOUT and theme searching, which
may provide some of this functionality.
-
ctxrule query sets are not hierarchies. You can have a table of queries that
is arranged in some kind of Yahoo-like hierarchy, but ctxrule sees it as a
table and will treat each row as equal and independent. It will not follow
hierarchical rules like parent-child uniqueness.
Even if ctxrule is not a bleeding-edge, fancy learning system, it's still an
interesting indextype that enables new kinds of text applications. We'll now
examine the index and its use in the following sections.
Creating a CTXRULE Index
To create a ctxrule index, specify the ctxrule type in the indextype clause:
create index queryx on queries(query_string)
indextype is ctxsys.ctxrule;
The supported datatypes that you can index are VARCHAR2 and
CLOB. Since you
are indexing stored query strings, VARCHAR2 will probably be
sufficient.
The parameters clause can be used to specify indexing objects:
create index queryx on queries(query_string)
indextype is ctxsys.ctxrule
parameters ('wordlist my_wordlist')
ctxrule supports the same parameters clause options that ConText does, but with
a twist: the section group and filter preferences are inverted and applied at
query time rather than at indexing time. When you specify a section group in
a ConText index, it is applied to the document during indexing. In ctxrule's
case, the document during indexing is a query string, and it wouldn't make
sense to apply a section group. Instead, the section group is applied to the
document that you use in matches, at query time, since the query document could
be XML or some other structured format.
ctxrule has its own set of system parameters for defaulting some of the
preferences:
DEFAULT_CTXRULE_LEXER
DEFAULT_CTXRULE_STOPLIST
DEFAULT_CTXRULE_WORDLIST
DEFAULT_CTXRULE_STORAGE
ctxrule does not support local partitioning in this version.
Querying with Matches
ctxrule indexes are queried using the matches operator, passing the entire
document text as the query string. Assuming your document text is stored in
a bind variable :doc_text, the query:
select * from queries
where matches(query_string, :doc_text)>0
will find matching queries in the queries table.
Putting it all together for a simple example:
create table queries (
query_id number,
query_string varchar2(80)
);
insert into queries values (1, 'oracle');
insert into queries values (2, 'larry or ellison');
insert into queries values (3, 'oracle and text');
insert into queries values (4, 'market share');
create index queryx on queries(query_string)
indextype is ctxsys.ctxrule;
select query_id from queries
where matches(query_string,
'Oracle announced that its market share in databases
increased over the last year.')>0
this query will return queries 1 (the word oracle appears in the
document)
and 4 (the phrase market share appears in the document) but not
2 (neither
the word larry nor the word ellison appears, and not
3 (there is no text
in the document, so it does not match the query).
The document text used in a matches query can be VARCHAR2 or
CLOB. It does
not accept BLOB input in this version, so you cannot match
filtered documents
directly. Instead, you must filter the binary content to CLOB
using the INSO
filter. Assuming the document data is in bind variable :doc_blob:
declare
doc_text clob;
begin
-- create a temporary CLOB to hold the document text
doc_text := dbms_lob.createtemporary(doc_text, TRUE, DBMS_LOB.SESSION);
-- call ctx_doc.ifilter to filter the BLOB to CLOB data
ctx_doc.ifilter(:doc_blob, doc_text);
-- now do the matches query using the CLOB version
for c1 in (select * from queries where matches(query_string, doc_text)>0)
loop
-- do what you need to do here
end loop;
dbms_lob.freetemporary(doc_text);
end;
ifilter is discussed later in this
document. Similarly, to match files, you
need to get the text into a CLOB. If your file is text in the
database
character set, you can create a BFILE and load it to a
CLOB using the function
dbms_lob.loadfromfile, or you can use
UTL_FILE to read the file into a temp
CLOB locator. If your file needs INSO filtering, you can load
the file into
a BLOB instead, and call ctx_doc.ifilter as above.
The Rule Language
The language used for the indexed queries is a simplified form of the ConText
query language which is used in contains. The rule language
supports the
following:
| Operator |
Example |
AND | dog and cat |
OR | dog or cat |
PHRASE | dog sled |
ABOUT | about(dogs) |
NEAR | dog ; cat |
STEM | $dog |
WITHIN | dog within title |
THESAURUS | SYN(dog) |
The expansion operators -- about, stem, thesaurus -- are index-time snapshots.
For instance, suppose that your default thesaurus looks like this:
cat
SYN kitty
when you index the query:
SYN(cat)
when the create index is complete, the matches query:
matches(query_string, 'kitty')>0
will hit that indexed rule. This is because at index time the thesaurus was
used to expand SYN(cat) to {CAT}|{KITTY}. If the
thesaurus then changes to:
cat
SYN kitty
SYN gato
The already-indexed rule will not be affected. The matches query:
matches(query_string, 'gato')>0
is not going to hit the query. In order to reflect the new thesaurus, the
query has to be reindexed:
update queries set query_string = query_string where query_id = 123;
exec ctx_ddl.sync_index('queryx');
CTXRULE Mechanics
Now we'll discuss the internal details of how the ctxrule index works. If
you're just interested in usage, you can skip to the next section,
"Performance".
The ctxrule indexing process is:
+-----------+ +----------+ +-----------+
| | query strings | | query string | query |
| datastore |--------------->| lexer |--------------->| parser |
| | | |<---------------| |
+-----------+ +----------+ parse tree +-----------+
^ |
| |
|column data |rules
| |
| V
+==========+ +----------+
| | | |
| database |<----------------| engine |
| | rules | |
+==========+ +----------+
The datastore delivers query strings to the lexer, which passes them whole
to the query parser. The parser -- in this case, a simplified version of the
parser used for ConText contains queries -- creates a parse tree and sends this
back to the lexer. The lexer normalizes the tokens, breaks the parse tree into
rules (more on what a rule is below), and sends these to the engine. The
engine builds up an inverted index of documents by rule, and stores it in the
index tables.
ctxrule uses $I, $R, $K, and $N tables just like the ConText index. The
major difference is in the structure of the $I table, which includes a new
TOKEN_EXTRA column:
TOKEN_TEXT VARCHAR2(64)
TOKEN_TYPE NUMBER
TOKEN_FIRST NUMBER
TOKEN_LAST NUMBER
TOKEN_COUNT NUMBER
TOKEN_INFO BLOB
TOKEN_EXTRA VARCHAR2(4000)
The TOKEN_EXTRA column is not part of the $X index column list.
Instead of indexing document lists by token, ctxrule indexes query lists by
rule. A rule is one of the tokens in the query string, called the filing term,
stored in TOKEN_TEXT, plus conjunctive criteria stored in
TOKEN_EXTRA.
The idea is that if a document has token X, a select on $I by
TOKEN_TEXT = X
is a good first-pass filter -- you may get some that don't match the document,
but you will not miss any potentially-matching rules. Each rule returned by
this select is a partial match -- if the
TOKEN_EXTRA criteria can be fulfilled
by the other tokens in the document, then the rule is a full match.
We'll illustrate token_extra with a simple example, using the queries:
QUERY_ID QUERY_STRING
-------- ------------
1 oracle
2 larry or ellison
3 text and oracle
4 market share
Query 1 is a single word query. A document is a full match if it contains
the word oracle. In this case, matching TOKEN_TEXT
alone is sufficient, so
TOKEN_EXTRA is NULL:
QUERY_STRING TOKEN_TEXT TOKEN_EXTRA
---------------- ---------- -----------
oracle ORACLE (null)
Query 2 is an OR query. A document is a full match if it contains the word
larry or the word ellison. This can be reduced to
two single-word queries,
each of which has TOKEN_EXTRA NULL:
QUERY_STRING TOKEN_TEXT TOKEN_EXTRA
---------------- ---------- -----------
larry or ellison LARRY (null)
ELLISON (null)
Query 3 is an AND query. A document must have both text and
oracle to be
a full match. The engine will choose one of these as the filing term, and
place the other the TOKEN_EXTRA criteria:
QUERY_STRING TOKEN_TEXT TOKEN_EXTRA
---------------- ---------- -----------
text and oracle TEXT {ORACLE}
Documents that contain the word text will pull this rule up as a
partial
match. The query engine will then examine the TOKEN_EXTRA
criteria, see that
it requires the presence of the word oracle, check if the
document contains
that word, and judge the rule a full match if so.
Query 4 is a phrase. The engine will use the first word of the phrase as
the filing term, and the whole phrase as the TOKEN_EXTRA:
QUERY_STRING TOKEN_TEXT TOKEN_EXTRA
---------------- ---------- -----------
market share MARKET {MARKET} {SHARE}
Putting it all together, our $I table looks something like this:
TOKEN_TEXT TOKEN_EXTRA TOKEN_INFO
---------- ----------------- --------------------------------------
ORACLE (null) DOC 1
LARRY (null) DOC 2
ELLISON (null) DOC 2
TEXT {ORACLE} DOC 3
MARKET {MARKET} {SHARE} DOC 4
Now that we have our index let's take a look at the query process:
+-----------+ +----------+ filtered +-----------+
| | document | | doc text | |
| SQL Query |---------->| filter |----------->| sectioner |---------+
| | | | | |--+ |
+-----------+<-------+ +----------+ +-----------+ | |
| | |
| section offsets | |
results | +---------------------------------+ |
| | |
| V |
+==========+ | +----------+ +-----------+ |
| | +--| | | | |
| index |----------->| query |<-----------| lexer |<--------+
| tables | index data | engine | tokens | | plain
+==========+ +----------+ +-----------+ text
The document is passed through the filter, sectioner, and lexer to tokenize it.
The query engine then iterates over the unique tokens in the document. For
each token, it selects rows in $I which have that token as
TOKEN_TEXT. The
TOKEN_EXTRA for each of those rules is evaluated. If satisfied,
then the query
list is added to a result buffer. The result buffer for all the tokens is
merged to eliminate duplication, and returned to the query session.
Let's walk through the query:
select query_id from queries
where matches(query_string,
'Oracle boss Larry Ellison reported increased market share
of text applications')>0
First the filter, sectioner, and lexer tokenize the document:
TOKEN_TEXT TOKEN_INFO
---------- ----------
ORACLE OFF 1
BOSS OFF 2
LARRY OFF 3
...
Next the query engine iterates over each unique token. The first token is
ORACLE, so the query engine executes:
select token_extra, ...
from dr$queryx$i
where token_text = 'ORACLE' ...
This pulls up:
TOKEN_TEXT TOKEN_EXTRA TOKEN_INFO
---------- ----------------- --------------------------------------
ORACLE (null) DOC 1
the TOKEN_EXTRA is NULL, so this is a full match,
and we add DOC 1 to the
result buffer.
Next we move to the second token, BOSS. This pulls up no rows
in $I, so we
move immediately to the third token, LARRY, which results in:
TOKEN_TEXT TOKEN_EXTRA TOKEN_INFO
---------- ----------------- --------------------------------------
LARRY (null) DOC 2
Again TOKEN_EXTRA is NULL, so this is a full match,
and we add DOC 2 to the
result buffer. The buffer now contains DOC 1 and
DOC 2.
On the fourth token, ELLISON, we match DOC 2 again.
However, DOC 2 is
already in the result buffer, and so we discard this hit as a duplicate.
When we get to MARKET, we have a non-null
TOKEN_EXTRA:
TOKEN_TEXT TOKEN_EXTRA TOKEN_INFO
---------- ----------------- --------------------------------------
MARKET {MARKET} {SHARE} DOC 3
To verify that the phrase market share exists, the query engine
accesses the
occurrence lists for MARKET and SHARE, and finds a
case where the word offset
of SHARE is one more than that of MARKET. Since the
document token information
is all in memory, this is quick and requires no disk access. The
TOKEN_EXTRA
is fulfilled, and DOC 3 is added to the result buffer.
On reaching TEXT, the rule:
TOKEN_TEXT TOKEN_EXTRA TOKEN_INFO
---------- ----------------- --------------------------------------
TEXT {ORACLE} DOC 4
is returned by the select. The engine quickly checks that ORACLE
exists in the
in-memory token list, and adds DOC 4 to the result buffer.
Once the engine has processed all tokens in the document, the merged result
list is resolved from docid to rowid (using the $R table just like ConText) and
the rows returned to the query session.
Performance
The typical ctxrule query does a lot more work than your typical contains
query. The document has to be filtered, sectioned, and lexed. The $I table
has to be accessed many more times -- once per unique word in the document.
Each TOKEN_EXTRA has to be evaluated. So the expected response
time of a
ctxrule query is generally longer than that of a contains query.
ctxrule query performance depends mainly on the size of the document. As the
size of the document increases, there are more unique words, each of which
results in a $I query. Performance is also affected by number of unique rules
indexed. As this increases, so does the number of partial matches per $I query
and the engine needs more time to evaluate all the TOKEN_EXTRA
criteria.
Furthermore, the number of rows returned by the select increases, so there's
an $R overhead and the time to do whatever it is your application does with the
results. However, the number of unique rules has less impact on query
performance than size of the document:
Indexed Queries |
Document Size |
| |
143 |
525 |
1034 |
2108 |
4023 |
8151 |
16636 |
32933 |
64337 |
| 100 |
0.09 |
0.16 |
0.25 |
0.42 |
0.73 |
1.32 |
2.51 |
4.10 |
7.15 |
| 500 |
0.09 |
0.17 |
0.26 |
0.46 |
0.80 |
1.46 |
2.84 |
4.73 |
7.57 |
| 1000 |
0.10 |
0.18 |
0.28 |
0.51 |
0.84 |
1.55 |
2.84 |
4.58 |
7.87 |
| 2000 |
0.11 |
0.20 |
0.30 |
0.56 |
0.99 |
1.72 |
3.27 |
4.80 |
8.51 |
| 4000 |
0.16 |
0.22 |
0.34 |
0.71 |
1.25 |
2.15 |
4.28 |
6.20 |
10.20 |
| 8000 |
0.13 |
0.28 |
0.44 |
0.87 |
1.68 |
2.83 |
5.78 |
8.04 |
13.96 |
| 16000 |
0.15 |
0.33 |
0.62 |
1.17 |
2.30 |
3.77 |
8.34 |
10.40 |
19.16 |
These timings are informal times done on a workstation. They are not tuned
benchmark numbers, and should be used for intra-chart comparison only. The
documents and rules used are derived from real-world corpora.
When looking to improve performance, there's not much you can do about the
size of your document, so your ability to decrease response time for a
particular document is limited. However, you can get overall throughput
improvement by partitioning and parallel. Instead a single session doing
a matches query on each document as it comes in, round-robin the documents to
separate sessions -- or even separate databases on separate machines.
CTXRULE Index Maintenance
ctxrule indexes require sync and optimize, just like a ConText index. Simply
use the ctx_ddl.sync_index and ctx_ddl.optimize_index
calls, passing a ctxrule
index name instead of a ConText index name.
CTXRULE Query and Document Services
Document (CTX_DOC) and query (CTX_QUERY) services
are not supported against
ctxrule indexes.
Miscellaneous New Features
Deprecated Features
The alter index interface for SYNC and OPTIMIZE
is no longer supported. We
have discouraged the use of alter index for SYNC and
OPTIMIZE since the PL/SQL
interface for these was added in 8.1.6. Failures during sync or
optimize can
invalidate the index when using the alter index interface, forcing a complete
index rebuild. The PL/SQL API does not carry this risk.
You can no longer create CTXCAT indexes on CLOB
columns. Although existing
CTXCAT indexes on CLOB columns created in 8.1.7 are still valid
and can be
queried, creating new ones is not allowed and will result in an error message.
CTXSRV has been deprecated since 8.1.6 -- avoid using this.
Instead, use
CTX_DDL.SYNC_INDEX called from a DBMS_JOB.
CTXLOAD should be used only for thesaurus import and export --
its loading
and updating features are deprecated and should not be used.
BLOB_LOC/CLOB_LOC OUTPUT_TYPE in USER_DATASTORE
When the user datastore is employed with an OUTPUT_TYPE of
CLOB or BLOB, the
datastore creates a temporary lob, and the user datastore procedure is expected
to copy the document content into it. The system of copying into a temporary
lob was designed for user datastores where the document is being assembled from
parts or its content is being processed before indexing.
However, some applications use the user datastore as a multiplexer, where
each document is stored in exactly one of several locations (an arc relation,
for you ERD nuts). For instance, you might have separate tables to store
movies, images, and text so that you could record media-specific attributes:
create table MOVIE (
movieid NUMBER,
description CLOB,
encoding VARCHAR2(30),
data BLOB
);
create table IMAGE (
imageid NUMBER,
annotation CLOB,
data BLOB
);
create table DOCUMENT (
docid NUMBER,
author VARCHAR2(80),
title VARCHAR2(150),
text CLOB
);
but still want to be able to efficiently search across all media types -- for
movies, search the description, for images, search the annotation, and for
documents, search the text. You could do this with the user datastore. First
create a combined media table:
create table MEDIA (
mediatype VARCHAR2(10),
mediaid NUMBER,
searchcol CHAR(1)
);
then create a multiplexing user datastore procedure, which reads the media
type of each row, and fetches the data from the appropriate location:
create or replace procedure myds(rid in rowid, dataout in out nocopy clob)
is
mlob clob;
begin
for c1 in (select mediatype, mediaid from media where rowid = rid) loop
if (c1.mediatype = 'MOVIE') then
select description into mlob from movie where movieid = c1.mediaid;
elsif (c1.mediatype = 'IMAGE') then
select annotation into mlob from image where imageid = c1.mediaid;
elsif (c1.mediatype = 'MEDIA') then
select text into mlob from document where docid = c1.mediaid;
end if;
DBMS_LOB.COPY(dataout, mlob, dbms_lob.getlength(mlob));
end loop;
end;
However, in this case, that last DBMS_LOB.COPY is wasteful --
since we
already have a lob locator with the data, there's no reason to spend I/O
copying the exact same thing to the temp lob. It would be more efficient to
simply pass this lob locator back to the indexing engine.
For this class of application, the user datastore object has been extended
with two new OUTPUT_TYPE values: CLOB_LOC and
BLOB_LOC. These settings tell
the user datastore to by-pass temporary lob creation because the user datastore
procedure will pass back a permanent lob locator instead.
To use CLOB_LOC, our example user datastore procedure needs to
be modified to
get rid of the dbms_lob.copy and instead pass back the lob
locator in dataout:
create or replace procedure myds(rid in rowid, dataout in out nocopy clob)
is
begin
for c1 in (select mediatype, mediaid from media where rowid = rid) loop
if (c1.mediatype = 'MOVIE') then
select description into dataout from movie where imageid = c1.mediaid;
elsif (c1.mediatype = 'IMAGE') then
select annotation into dataout from image where docid = c1.mediaid;
elsif (c1.mediatype = 'MEDIA') then
select text into dataout from document where movieid = c1.mediaid;
end if;
end loop;
end;
and, in our preference creation, set output_type to CLOB_LOC:
begin
ctx_ddl.create_preference('myds_pref','USER_DATASTORE');
ctx_ddl.set_attribute('myds_pref', 'PROCEDURE', 'myds');
ctx_ddl.set_attribute('myds_pref', 'OUTPUT_TYPE', 'CLOB_LOC');
end;
Now our user datastore is passing back locators instead of doing copies, so we
should see an improvement in indexing performance.
One limitation is that the output_type setting applies to all
rows being
indexed -- you cannot have the datastore pass in a temp locator for some rows
and not pass one in for others. In our example above, we might want to append
author and title information for documents, but still pass back permanent lobs
for movies and images.
You can accomplish this to a limited extent in the user datastore procedure.
It can create a temporary lob and store the locator in a persistent place, such
as a package variable. When you need to use it, trim the temp lob, construct
the document content, and set the OUT variable to the temp lob locator.
IGNORE Format Column Value
Some customers would like to have a single repository table for all media.
This includes non-text data like images and soundfiles. When creating a
ConText index on these, at best INSO filters them to empty files and you waste
time. At worst, there's no filtering and you end up with a lot of garbage
tokens.
To deal with these, a new IGNORE value has been added to the
format column.
When the format column is set to IGNORE, the document content is
blocked in
the datastore -- its content is never accessed, and it is indexed as an empty
document. It is assigned a docid and the rowid is in $R, but it has zero
impact on $I because no tokens are generated, and near-zero impact on query
time -- the only effect is a slightly larger $R table.
Here's IGNORE in action:
create table myfiles (
format varchar2(10),
filen varchar2(80)
);
insert into myfiles values ('TEXT', '/doc/README');
insert into myfiles values ('BINARY','/doc/imt901.doc');
insert into myfiles values ('IGNORE','/doc/banner.jpg');
create index myfilesx on myfiles(filen)
indextype is ctxsys.context
parameters ('datastore ctxsys.file_datastore
filter ctxsys.inso_filter
format column format');
The first document is marked as TEXT, so it bypasses INSO
filtering. The
second document is marked as BINARY, so it gets INSO filtered
before lexing.
The third document is marked as IGNORE, so it is indexed as an
empty document.
The file /doc/banner.jpg is not opened or accessed during the
create index.
The IGNORE value is caught by the datastore interface, so all
intervening
layers do minimal work. If you have a user datastore or procedure filter,
for instance, the procedure will not get executed at all for the
IGNORE
documents.
The IGNORE value is also useful for working around problem
documents. If
you find that your index cannot be created because of faulty documents, you
can mark those documents as IGNORE, which should allow the index
to be created.
When the problem is resolved, you can change the value to TEXT or
BINARY as
appropriate, and force re-indexing by updating the indexed column to itself.
In previous versions, the row had to be deleted, which is usually unacceptable
because you lose data and affect other parts of the app.
INSO Timeout
The INSO filter will occasionally hang or loop on certain documents. In the
past, this would bring indexing to a halt. While we continually work with INSO
corporation to improve filter quality, we have added a timeout so that such
cases can fail individually, allowing the indexing to continue.
You can set timeout as an attribute of the inso filter:
begin
ctx_ddl.create_preference('myinso','inso_filter');
ctx_ddl.set_attribute('myinso','timeout','600');
end;
The unit of TIMEOUT is seconds. The default is 120, if
TIMEOUT is not
specified. (Note: bug 1819825 makes the default no timeout, but this will be
addressed in the first 9.0.1 patchset) If filtering has not progressed within
that time period, ctxhx
kills itself and logs an indexing error. Those documents (which can be found
by scanning ctx_user_index_errors after create index or sync) can be tested
further, perhaps by manually invoking ctxhx.
The timeout interval is not an upper limit on document filtering time -- a
document can take longer than the timeout interval. ctxhx checks
the file
size every timeout interval seconds. If the file size is growing, then
ctxhx continues. If it is not growing, ctxhx kills
itself. This means that
documents that simply take a long time to filter can be filtered. It also
means that infinite loops in INSO that continually write to the output file
cannot be caught by timeout.
Do not set timeout too low. There is currently no way to override the INSO
timeout on a per-document basis. If it turns out that INSO can complete
filtering a timed-out document, just not make any progress each interval, there
is no way to get that document indexed. The best you can do is manually filter
to text and replace the binary contents with text contents. We believe that
documents not progressing after 2 minutes are very rare.
CTX_DOC.IFILTER
The user datastore is often used to construct virtual structured documents
by concatenating several columns of the base table. However, this technique
can't really be used when one or more of those columns is binary data, such as
a Word document. The binary data must be converted to text in the datastore
procedure before concatenation.
Unfortunately, there is no way to do this in previous versions. The obvious
attempt -- calling ctx_doc.filter inside the user datastore
procedure -- is not
workable. The ctx_doc calls are trusted callouts, as is
indexing, and our
architecture does not currently allow recursive calls. Even if supported,
however, the ctx_doc functions require an index, so you would
have to create a
dummy index on the BLOB column just to be able to call
ctx_doc.filter. This would
just take up space.
So for 9i we've created a new document service CTX_DOC.IFILTER.
It takes a
passed-in BLOB, filters it to HTML using the INSO filter, and
appends the
result to the passed-in CLOB:
PROCEDURE ifilter(
data IN blob,
text IN OUT nocopy clob
);
No index is needed, since you pass in the BLOB directly. It
appends, so you
can lay out your structured columns first, then call ifilter to
append the
BLOB's data -- or call it multiple times for multiple
BLOBs. Here's an
example of ctx_doc.ifilter used in a user datastore procedure:
create procedure ifds(rid in rowid, doc in out nocopy clob)
is
buf varchar2(120) := null;
begin
for c1 in (select id, title, document
from pdfdocs
where rowid = rid)
loop
buf := buf || '<ID>'||c1.id||'</ID>';
buf := buf || '<TITLE>'||c1.title||'</TITLE>';
buf := buf || '<DOCTEXT>';
dbms_lob.writeappend(doc, length(buf), buf);
ctx_doc.ifilter(c1.document, doc);
buf := buf || '</DOCTEXT>';
dbms_lob.writeappend(doc, length(buf), buf);
end loop;
end;
The text is always HTML in this version -- a plaintext option may be added
in the future. Because ctx_doc.ifilter always appends to the
existing data in the
text clob, make sure you pass in an empty clob or trim the clob
before the
first call. Unlike the in-memory doc services, ifilter does not
automatically
allocate a temporary lob; passing in NULL or a non-writeable lob
in the text
parameter will result in an error.
ifilter is implemented using a safe callout -- this means that
SQL*Net must
be configured for extproc and running in order to use this
feature. ifilter
uses the safe callout so that it can be called from a user datastore procedure,
which runs as a trusted callout.
CTXHX will be invoked using the default built-in timeout of 120
seconds.
There is no way to override this in this version.
ifilter explicitly calls the INSO filter. There is no way to
have ifilter
use the charset, procedure, or user filters.
UTF-16 Support
The INSO and charset filters currently convert documents to the database
character set when needed. The conversion is done using Oracle NLS functions.
Since 9i NLS improvements add UTF-16 character set support (
AL16UTF16,
AL16UTF16LE), these filters get UTF-16 conversion implicitly.
However, we have also added UTF-16 endian detection. If the document
character set specification is UTF16AUTO, then the filter will
examine the
first two bytes of the data and choose big-endian or little-endian UTF-16
conversion as appropriate. If the marker bytes are 0xFE,
0xFF,
then these two
bytes are stripped out and little-endian conversion is used. If the marker
bytes are 0xFF, 0xFE, then these two bytes are
stripped out and
big-endian
conversion is used. Anything else assumes big-endian conversion and leaves
the first two bytes intact.
An example of using UTF-16 auto-detection:
create table conv (
format varchar2(10),
charset varchar2(30),
text varchar2(80)
);
insert into conv values ('TEXT', 'UTF16AUTO', '/doc/msunicode.txt');
insert into conv values ('TEXT', 'UTF16AUTO', '/doc/unixunicode.txt');
create index convx on conv(text)
indextype is ctxsys.context
parameters ('datastore ctxsys.file_datastore
filter ctxsys.inso_filter
format column format
charset column charset');
Japanese Lexer
Written Japanese is problematic for full-text indexing because words are not
space-delimited; sentences are written in one long string of kana and kanji.
Human readers can easily pick out the words, but this is much harder to do
mechanically.
The existing JAPANESE_VGRAM_LEXER solves this problem by not
indexing words
at all. Instead, it splits up a string of characters into overlapping bits.
A sequence of ABCD (where A, B, C, and D are kana/kanji) might be indexed as
three "tokens": AB, BC, and CD.
While this allows full-text retrieval, it has some problems. Index space is
wasted on "tokens" that span word boundaries, because they might never get
queried -- and if they do get queried, it's a false hit. Query is slowed by
the need to combine index data from multiple tokens for each word, and these
tokens are somewhat unselective, because they are character sequences instead
of full words.
9i includes the new JAPANESE_LEXER, which is a hybrid lexer
based on the
CHINESE_LEXER that uses a lexicon to segment written Japanese
into coherent
words, falling back on vgrams only for unrecognized sequences. The segmenta-
tion should make Japanese text query faster and more precise, while the
vgramming preserves recall. Using the JAPANESE_LEXER is pretty
easy, because
it has no attributes to fiddle with in this version:
exec ctx_ddl.create_preference('myjlexer','japanese_lexer');
create index myidx on mytable(text)
indextype is ctxsys.context
parameters ('lexer myjlexer');
The lexicon is an Oracle-modified version of the EDR lexicon, licensed from the
European Language Resource Association. This is not a case of poor American
geography skills -- just a funny consequence of licensing issues. The lexicon
is not user-modifiable in this version.
New Chinese Lexicon
Written Chinese poses the same problem as Japanese: lack of whitespace.
Previous versions have two lexers for Chinese: the
CHINESE_VGRAM_LEXER for
vgram indexing, and a CHINESE_LEXER, for segmentation, 9i
includes an update
to the lexicon that the CHINESE_LEXER uses, which should improve
its accuracy.
Korean Morphological Lexer
Completing our tour of East Asia, we have a new Korean lexer. Unlike Chinese
or Japanese, Korean is mostly whitespace delimited. However, like German,
it has a large number of compound words. Since words change form when combined
in compounds, finding word boundaries is quite difficult. Furthermore, verbs
have a large number of inflectional forms -- literally thousands.
The new lexer, KOREAN_MORPH_LEXER, uses a newly-licensed
technology, 'HAM'.
This analyzes text a sentence at a time, then uses morphology and grammar
lexicons to identify word boundaries and normalize word form. Its word
boundary detection is more accurate than the existing
KOREAN_LEXER. The word
form normalization improves recall without having to expand query terms to all
possible forms.
The KOREAN_MORPH_LEXER has a lot of attributes to play with.
They control
what to index and what to throw away -- for instance, NUMBER can
be set to
FALSE to index or to throw away numbers. Generally the defaults
are fine,
so the following will work:
exec ctx_ddl.create_preference('myjlexer',korean_morph_lexer');
create index myidx on mytable(text)
indextype is ctxsys.context
parameters ('lexer myjlexer');
The KOREAN_MORPH_LEXER uses four lexicons, all in
?/ctx/data/kolx:
drk2sdic.dat - system dictionary
drk2gram.dat - grammar dictionary
drk2udic.dat - user dictionary
drk2xdic.dat - stopword dictionary
The system dictionary should not be modified, but the grammar, user, and
stopword dictionaries can be user-edited for extension and customization. The
rules for editing each one can be found in the files themselves (albeit in
Korean). After modification, it is best to bounce the database or flush the
shared pool to ensure that the changes get reloaded into memory.
ALL Language for Multi-Stoplist
Version 8.1.7 introduced the multi-language stoplist. One drawback to this
functionality was that stopwords that appear in multiple languages had to be
added multiple times. Foreign-language documents in the real world sometimes
contain passages in English (we refuse to learn anything else). In order to
stop THE in such cases, you had to do:
ctx_ddl.add_stopword('globallist','the','French');
ctx_ddl.add_stopword('globallist','the','German');
ctx_ddl.add_stopword('globallist','the','Swedish');
...
which quickly uses up the 4096 available stopword slots.
For this case, we've added a new ALL language for stopwords:
ctx_ddl.add_stopword('globallist','the','ALL');
An ALL-language stopword works for any and all document languages, while taking
up just one stopword slot.
CTXCAT Wildcard Searching
Wildcard has been added to the CTXCAT query language, using asterisk. For
instance:
catsearch(desc, 'monopol*', '')>0
does the same thing as:
contains(desc, 'monopol%')>0
Left-truncated wildcards are not supported. They are stripped off and the
remainder of the pattern is expanded.
catsearch(desc, '*monopol*', '')>0
is the same as the above query for monopol*. Wildcards in the middle of a
token are supported:
catsearch(desc, 'mon*pol*', '')>0
CTXCAT queries do not perform well when the number of tokens in queries is
very large. If you expect to use CTXCAT wildcard a lot, we highly recommend
also using prefix indexing; wildcard queries will be one prefix token instead
of multiple normal tokens.
Transportable Tablespace
Transportable tablespace is now supported by all three indextypes, as long
as all internal objects are in the tablespace set. With transportable
tablespace, you can do your indexing on a large machine, then quickly push the
results to multiple query server databases for query and DML scalability. To
illustrate, we'll create a table in tablespace datat, and a text
index on it
with internal objects in tablespace indext, controlled by the
storage
preference:
create table transport(text varchar2(80))
tablespace datat;
begin
ctx_ddl.create_preference('mystore','basic_storage');
ctx_ddl.set_attribute('mystore','i_table_clause','tablespace indext');
ctx_ddl.set_attribute('mystore','k_table_clause','tablespace indext');
ctx_ddl.set_attribute('mystore','n_table_clause','tablespace indext');
ctx_ddl.set_attribute('mystore','r_table_clause','tablespace indext');
ctx_ddl.set_attribute('mystore','i_index_clause','tablespace indext');
end;
/
create index textidx on transport(text)
indextype is ctxsys.context
parameters ('storage mystore');
Imagine that the table has a lot of data, and that the index takes a long time
to create and populate the internal tables. Now make the two tablespaces read
only:
alter tablespace datat read only;
alter tablespace indext read only;
then export the two tablespaces:
exp "'/ as sysdba'" transport_tablespace=y tablespaces=datat,indext
file=transport.dmp
In order to do this, the text table and all secondary objects for domain
indexes have to be in the tablespace set. Since our text table is in datat
and all secondary objects for the text index are in indext, we meet this
requirement.
Now we use a OS copy or FTP to get the datafiles for datat and indext to the
target system, and import:
imp "'/ as sysdba'" transport_tablespace=y
datafiles=/oracle/home/dbs/datat.dbf,/oracle/home/dbs/indext.dbf
tablespace=datat,indext file=t.dmp
The text table and text index are created on the target database very quickly.
The index does not have to be rebuilt -- all the data in the datafiles is
useable, since the table data rowids are not changing. All that needs to be
done is a little bit of meta-data cleanup.
User datastore procedures and preferences are not preserved by this, so
if you intend to do sync or indexing on the target system, be sure to manually
bring those over. Queries should be fine without them.
If you have rows in dr$pending for the text index when you do
the
transportable tablespace export, those ROWIDs will materialize in
dr$pending
during the import, so you won't miss anything. At worst you'll have to do a
sync on the target system.
If you are refreshing a target system, then the tablespaces already exist.
They must be dropped before importing. Normally you can do:
drop tablespace datat including contents;
However, this will not work if the tablespace contains tables with domain
indexes, or domain index secondary objects -- either datat or
indext from our
examples -- you'll get an ORA-29857 error message. You must first drop the
domain indexes manually:
drop index textidx;
then drop the tablespaces.
Rowid Logging
When something goes wrong during indexing -- a core dump or a hang, for
instance -- you can usually work around it by removing or
IGNORE-ing the faulty
documents. However, in previous versions it was very difficult to find out
which documents were faulty.
To make this easier, 9i has a new rowid printing event. When set, the rowid
of each row will be printed to the logfile. For instance:
create table logtest(a varchar2(80));
insert into logtest values ('the quick brown fox jumps over the lazy dog');
insert into logtest values ('Mr. Jock, TV quiz PhD, bags few lynx');
exec ctx_output.start_log('mylog');
exec ctx_output.add_event(CTX_OUTPUT.EVENT_INDEX_PRINT_ROWID);
create index logtestx on logtest(a)
indextype is ctxsys.context;
the log shows:
16:29:00 05/14/01 populate index: CTXSYS.LOGTESTX
16:29:00 05/14/01 Begin document indexing
16:29:00 05/14/01 INDEXING ROWID AAABklAACAAACynAAA
16:29:00 05/14/01 INDEXING ROWID AAABklAACAAACynAAB
...
The ROWID message is printed just after the datastore has selected the row from
the base table, but just before it is read or processed. This means that if a
problem occurs, the last ROWID printed in the log is the problem document, and
should be removed.
There is currently only one event. If you have a suggestion as to other
events that could be useful, let us know.
VER_CODE Column in CTX_VERSION
The view CTX_VERSION has been changed so that it now has two
columns:
VER_DICT, which shows the version of the CTXSYS data
dictionary, and VER_CODE,
which is actually a trusted callout that queries the linked-in code for the
text version. This should solve the problem of determining which Text patchset
is installed, although this has already been mostly solved with unified RDBMS
and Text patchsets in 8.1.7.
|