How interMedia Processes Text DML
This document describes how DML (updates, inserts and deletes) work for columns
which are indexed by interMedia Text. It is likely to be useful to anyone doing
performance tuning on an interMedia Text system, or looking for bottlenecks
during the DML / reindexing process.
Give
us feedback
Table of Contents
The Extensibility Framework
To understand interMedia Text indexing, it is necessary to know a little about Oracle's
Extensibility Framework, around which interMedia Text architecture is built.
The Extensibility Framework allows developers to build their own index types. The
kernel knows the names of the various user-provided routines to handle these indexes, but
nothing about the underlying structure of the indexes.
The developer must provide a specified set of routines. For example, ODCIIndexInsert,
ODCIIndexUpdate, and ODCIIndexDelete are the function definitions for creating, modifying
and removing an index entry respectively (ODCI stands for Oracle Data Cartridge
Interface). The routines are generally passed the ROWID for the row that is changing.
These user-provided functions are called by the kernel as and when necessary to do the
index processing. In normal usage, extensibility functions run in a separate address space
from the kernel, running under the EXTPROC program (communicating via interprocess
communication, or IPC). However, certain modules written internally within Oracle are
described as "trusted" callouts, and run within the kernel address space (such
trusted callouts are not available to normal developers using the extensibility
framework).
In interMedia Text 8.1.5 and 8.1.6, the query functions and the DML calls are trusted,
but the actual indexing code (which runs at SYNC time) execute in
the EXTPROC space. In 8.1.7, all calls are trusted, and no EXTPROC process is required.
Index Tables
The interMedia Text index consists of four tables, referred to as the $I, $K,
$N and $R tables respectively. The tables exist within the schema of the
index owner, and have names concatenated from "DR$", the name of the
index, and the suffix (e.g. "$I").
$I The "Token" table
This table consists of all the tokens (words) that have been indexed, together with a
binary representation of the documents they occur in, and their positions within those
documents. Each document is represented by an internal DOCID value.
$K The DOCID mapping table
This is an index-organized table (IOT) which maps internal DOCID values to external
ROWID values. Each row in the table consists of a single DOCID/ROWID pair. The IOT allows
for rapid retrieval of DOCID given the corresponding ROWID value.
$R The ROWID mapping table
This is designed for the opposite lookup from the $K table - fetching a
ROWID when you know the DOCID value. Given that ROWIDs are a fixed length (18 bytes), and
DOCIDs are allocated sequentially, it is possible to write all rowids into a binary
structure and any specific docid by reading the 18 bytes starting at position ( 1 +
(DOCID*18) ).
In practice, this binary structure is split over several rows in the $R table to
prevent any single row getting two large, but this makes no difference to the principle.
$N Negative row table
This contains a list of deleted DOCID values, which is used (and cleaned up) by the
index optimization process.
Inserts
When a new record is inserted into a table with an interMedia index, the appropriate
index creation routine is called by the kernel. The index creation routine creates a row
in the DR$PENDING table (owned by CTXSYS), containing the rowid of the new row. No
other processing is done at this time, so the indexes have not yet been updated to reflect
the new information (this is done at SYNC time - see later).
Note that the kernel does not check for NULL values - an entry will be made in DR$PENDING
even if the data value to be indexed by interMedia Text is null.
The row inserted in DR$PENDING is in the same commit unit as the new data
inserted into the base table, so they will both be either committed or rolled back
together.
Under certain circumstances, extra SQL statements will be executed to load the index
information cache - see note 1.
An extra step is needed when there is already a similar row in DR$PENDING. In
this case, there will be unique index violation, and instead a row is inserted into CTXSYS.DR$WAITING.
The reasoning behind this is that the row in DR$PENDING may already be being
processed by an index sync. If this is true, then we must be sure that the data is
reindexed again at a later date.
Deletes
When an indexed row is deleted, the corresponding row in the $K table is
immediately deleted. At the same time, a row (containing the index id and docid) is
inserted in the DR$DELETE table owned by CTXSYS, and a row (containing just ROWID)
is inserted into the $N table.
These three events are committed when the user commits his delete.
Removing the row from $K means that functional lookups in the index will
not the deleted row (see Note 2). Adding a row into DR$DELETE
means that normal index lookups will not find out (see queries, later), and enables the
commit callback to delete the row from the $R table. The row in the $N table
will be used during index optimization to remove unwanted DOCIDs from the $I table.
A final stage is to register a "commit callback" for this index. This is an
instruction to the kernel to call a specific interMedia Text routine at commit time. There
only needs to be one such callback per index, so if one is already registered for this
index within this transaction, there is no need to do this.
Note: if the data item is inserted and deleted in the same commit unit, then there will
be no row in the $K at the start of this process. In that case, there is no
need to go through the rest of the process.
Updates
Updates are basically treated as a delete followed by an insert. The record is deleted
as in the section above, then the rowid for the record is inserted into DR$PENDING
(or maybe DR$WAITING) as described in the INSERTS section
Commit Time
At commit time, our "commit callback" will be invoked, getting passed the
internal index id for the index to be updated.
The callback will fetch all the docids from DR$DELETE for the index id in
question.
For each docid, the callback will perform a LOB piecewise update of the $R
table, setting the rowid string to nulls.
It will then delete all the rows from DR$DELETE for this index, and deregister
the callback.
Query Time
There are two sorts of index lookup used in interMedia Text - normal and functional
lookups. The normal lookup effectively says "give me all the rowids that satisfy my
text criteria", whereas the functional lookup says "does row satisfy my text
criteria?" The first of these (normal lookup) fetches a set of docids from the $I
table, then uses the $R table to convert them to rowid values.
If our current session has deleted a record, but not committed the delete, then the $R
table will not yet have been modified. Therefore, during a normal lookup, the index lookup
code must check DR$DELETE, and remove any unwanted DOCID values that it finds in
this table before converting these values to rowids using the $R table.
This ONLY applies to records modified in our own session - if other sessions have made
modifications but not committed, those modifications are invisible to us anyway. And once
they have committed, the $R table will have had the old DOCIDs nulled out.
In the case of a functional lookup, there is no need for any special processing.
Functional lookup uses the $K table, and this table is updated immediately the
record is changed.
Index Synchronization
Index synchronization (sync, for short) occurs when a user executes the SQL statement
ALTER INDEX indexname REBUILD ONLINE PARAMETERS ('sync') or in 8.1.6 or later calls a
PL/SQL sync routine.
Sync looks in DR$PENDING and DR$WAITING for rowids of records to be
updated. Rowids from these two tables are combined.
For each rowid, a new DOCID value is assigned. The data is indexed via the indexing
pipeline (which will not be covered in detail here) and the resulting token, DOCID, and
word position information will be inserted into the $I table. A new row is inserted
in the $K table containing the DOCID/ROWID pair, and the $R data is extended
via a LOB piecewise write to the correct 18 character string.
Implications
When a record is deleted, the index change is immediate. That is, your own session will
no longer find anything in that record from the moment you make the change, and other
users will not find it as soon as you have committed.
Inserts - and by implication updates - are different. The new information will not be
visible to text searches until an index sync has occurred.
The most important affect of this is on updates. If you make a minor alteration to a
document, then that document effectively becomes invisible to searches until an index sync
occurs. Application developers should bare this in mind.
Notes
Note 1 - loading the index detail cache
The first time that the Oracle kernel deals with an interMedia Text index, it will load
an internal cache with various information about the index - such as the filter used, the
section groups, stopwords, etc. For this reason, in a trace you will sometimes see a bunch
of extra SQL statements dealing with tables such as DR$INDEX, DR$INDEX_OBJECT,
etc. You can see this by creating a new index, then performing an insert to the table with
SQL_TRACE switched on. Note that this only happens the first time the index is used after
creation, or after restarting the database. Subsequent, separate sessions do not need to
reload this information - it is available in the SGA to all sessions.
It may seem a little odd that information like stopwords is loaded during an insert,
when it is only needed during indexing or querying, but the logic seems to be that since
we have to find some information about the index, we may as well fetch it all.
Note 2 - removing rows from $K on delete
Actually, when a record has really been deleted, there is no way we will ever get to do
a functional lookup anyway. For the kernel to do a functional lookup, it has to find the
row via some other criteria, and if it has been deleted then that's not going to happen.
However, when we do an update, this is described as a "delete followed by
insert", and in this case the physical row has not been deleted, so the
argument does apply.
Roger Ford
Last Modified 27-March-2000
|