How interMedia Processes Text DML

This document describes how DML (updates, inserts and deletes) work for columns which are indexed by interMedia Text. It is likely to be useful to anyone doing performance tuning on an interMedia Text system, or looking for bottlenecks during the DML / reindexing process.

Give us feedback

Table of Contents

The Extensibility Framework

To understand interMedia Text indexing, it is necessary to know a little about Oracle's Extensibility Framework, around which interMedia Text architecture is built.

The Extensibility Framework allows developers to build their own index types. The kernel knows the names of the various user-provided routines to handle these indexes, but nothing about the underlying structure of the indexes.

The developer must provide a specified set of routines. For example, ODCIIndexInsert, ODCIIndexUpdate, and ODCIIndexDelete are the function definitions for creating, modifying and removing an index entry respectively (ODCI stands for Oracle Data Cartridge Interface). The routines are generally passed the ROWID for the row that is changing.

These user-provided functions are called by the kernel as and when necessary to do the index processing. In normal usage, extensibility functions run in a separate address space from the kernel, running under the EXTPROC program (communicating via interprocess communication, or IPC). However, certain modules written internally within Oracle are described as "trusted" callouts, and run within the kernel address space (such trusted callouts are not available to normal developers using the extensibility framework).

In interMedia Text 8.1.5 and 8.1.6, the query functions and the DML calls are trusted, but the actual indexing code (which runs at SYNC time) execute in the EXTPROC space. In 8.1.7, all calls are trusted, and no EXTPROC process is required.

Index Tables

The interMedia Text index consists of four tables, referred to as the $I, $K, $N and $R tables respectively. The tables exist within the schema of the index owner, and have names concatenated from " DR$", the name of the index, and the suffix (e.g. " $I").

$I The "Token" table

This table consists of all the tokens (words) that have been indexed, together with a binary representation of the documents they occur in, and their positions within those documents. Each document is represented by an internal DOCID value.

$K The DOCID mapping table

This is an index-organized table (IOT) which maps internal DOCID values to external ROWID values. Each row in the table consists of a single DOCID/ROWID pair. The IOT allows for rapid retrieval of DOCID given the corresponding ROWID value.

$R The ROWID mapping table

This is designed for the opposite lookup from the $K table - fetching a ROWID when you know the DOCID value. Given that ROWIDs are a fixed length (18 bytes), and DOCIDs are allocated sequentially, it is possible to write all rowids into a binary structure and any specific docid by reading the 18 bytes starting at position ( 1 + (DOCID*18) ).

In practice, this binary structure is split over several rows in the $R table to prevent any single row getting two large, but this makes no difference to the principle.

$N Negative row table

This contains a list of deleted DOCID values, which is used (and cleaned up) by the index optimization process.

Inserts

When a new record is inserted into a table with an interMedia index, the appropriate index creation routine is called by the kernel. The index creation routine creates a row in the DR$PENDING table (owned by CTXSYS), containing the rowid of the new row. No other processing is done at this time, so the indexes have not yet been updated to reflect the new information (this is done at SYNC time - see later).

Note that the kernel does not check for NULL values - an entry will be made in DR$PENDING even if the data value to be indexed by interMedia Text is null.

The row inserted in DR$PENDING is in the same commit unit as the new data inserted into the base table, so they will both be either committed or rolled back together.

Under certain circumstances, extra SQL statements will be executed to load the index information cache - see note 1.

An extra step is needed when there is already a similar row in DR$PENDING. In this case, there will be unique index violation, and instead a row is inserted into CTXSYS.DR$WAITING. The reasoning behind this is that the row in DR$PENDING may already be being processed by an index sync. If this is true, then we must be sure that the data is reindexed again at a later date.

Deletes

When an indexed row is deleted, the corresponding row in the $K table is immediately deleted. At the same time, a row (containing the index id and docid) is inserted in the DR$DELETE table owned by CTXSYS, and a row (containing just ROWID) is inserted into the $N table.

These three events are committed when the user commits his delete.

Removing the row from $K means that functional lookups in the index will not the deleted row (see Note 2). Adding a row into DR$DELETE means that normal index lookups will not find out (see queries, later), and enables the commit callback to delete the row from the $R table. The row in the $N table will be used during index optimization to remove unwanted DOCIDs from the $I table.

A final stage is to register a "commit callback" for this index. This is an instruction to the kernel to call a specific interMedia Text routine at commit time. There only needs to be one such callback per index, so if one is already registered for this index within this transaction, there is no need to do this.

Note: if the data item is inserted and deleted in the same commit unit, then there will be no row in the $K at the start of this process. In that case, there is no need to go through the rest of the process.

Updates

Updates are basically treated as a delete followed by an insert. The record is deleted as in the section above, then the rowid for the record is inserted into DR$PENDING (or maybe DR$WAITING) as described in the INSERTS section

Commit Time

At commit time, our "commit callback" will be invoked, getting passed the internal index id for the index to be updated.

The callback will fetch all the docids from DR$DELETE for the index id in question.

For each docid, the callback will perform a LOB piecewise update of the $R table, setting the rowid string to nulls.

It will then delete all the rows from DR$DELETE for this index, and deregister the callback.

Query Time

There are two sorts of index lookup used in interMedia Text - normal and functional lookups. The normal lookup effectively says "give me all the rowids that satisfy my text criteria", whereas the functional lookup says "does row satisfy my text criteria?" The first of these (normal lookup) fetches a set of docids from the $I table, then uses the $R table to convert them to rowid values.

If our current session has deleted a record, but not committed the delete, then the $R table will not yet have been modified. Therefore, during a normal lookup, the index lookup code must check DR$DELETE, and remove any unwanted DOCID values that it finds in this table before converting these values to rowids using the $R table.

This ONLY applies to records modified in our own session - if other sessions have made modifications but not committed, those modifications are invisible to us anyway. And once they have committed, the $R table will have had the old DOCIDs nulled out.

In the case of a functional lookup, there is no need for any special processing. Functional lookup uses the $K table, and this table is updated immediately the record is changed.

Index Synchronization

Index synchronization (sync, for short) occurs when a user executes the SQL statement ALTER INDEX indexname REBUILD ONLINE PARAMETERS ('sync') or in 8.1.6 or later calls a PL/SQL sync routine.

Sync looks in DR$PENDING and DR$WAITING for rowids of records to be updated. Rowids from these two tables are combined.

For each rowid, a new DOCID value is assigned. The data is indexed via the indexing pipeline (which will not be covered in detail here) and the resulting token, DOCID, and word position information will be inserted into the $I table. A new row is inserted in the $K table containing the DOCID/ROWID pair, and the $R data is extended via a LOB piecewise write to the correct 18 character string.

Implications

When a record is deleted, the index change is immediate. That is, your own session will no longer find anything in that record from the moment you make the change, and other users will not find it as soon as you have committed.

Inserts - and by implication updates - are different. The new information will not be visible to text searches until an index sync has occurred.

The most important affect of this is on updates. If you make a minor alteration to a document, then that document effectively becomes invisible to searches until an index sync occurs. Application developers should bare this in mind.


Notes

Note 1 - loading the index detail cache

The first time that the Oracle kernel deals with an interMedia Text index, it will load an internal cache with various information about the index - such as the filter used, the section groups, stopwords, etc. For this reason, in a trace you will sometimes see a bunch of extra SQL statements dealing with tables such as DR$INDEX, DR$INDEX_OBJECT, etc. You can see this by creating a new index, then performing an insert to the table with SQL_TRACE switched on. Note that this only happens the first time the index is used after creation, or after restarting the database. Subsequent, separate sessions do not need to reload this information - it is available in the SGA to all sessions.

It may seem a little odd that information like stopwords is loaded during an insert, when it is only needed during indexing or querying, but the logic seems to be that since we have to find some information about the index, we may as well fetch it all.

Note 2 - removing rows from $K on delete

Actually, when a record has really been deleted, there is no way we will ever get to do a functional lookup anyway. For the kernel to do a functional lookup, it has to find the row via some other criteria, and if it has been deleted then that's not going to happen. However, when we do an update, this is described as a "delete followed by insert", and in this case the physical row has not been deleted, so the argument does apply.

 


Roger Ford
Last Modified 27-March-2000


Left Curve
Popular Downloads
Right Curve
Untitled Document
Left Curve
More Database Downloads
Right Curve