Setting up an Index Maintenance Regime for Oracle Text

Introduction

This paper looks at the process of maintaining an Oracle Text index, and explores the options available to DBAs to maintain these indexes.

What is Index Maintenance?

Text Indexes are different from standard ("B-tree") Oracle indexes, in that they are not normally transactional. When you insert a record into a table, this merely schedules an update to the index - it doesn't actually change the index. The indexes are only modified when you run an index synchronization ("sync") procedure.

The frequency with which such syncs are run has an effect on the efficiency of the index. Thus it is necessary to have an index maintenance regime, in order to keep the index working efficiently.

Oracle Text ships with defaults which are suitable for setting up test tables and indexes. However, for a production system, you will need to spend some time gathering requirements, and setting up a suitable regime for maintaining your indexes.

Index Latency

The first decision you need to make is the degree of latency - that is, how up-to-date your indexes must be.

When a record is inserted or updated, the index is not updated immediately. Instead, reindexing of that record is scheduled for the next time a SYNC is run on that index.

Note, however, that in the case of a delete - or an update, which is implemented by deletion followed by the a new insert - the information is effectively deleted from the index immediately the change is comitted. So even if you only change a small part of a document, that document will not be searchable until after the next sync.

It is therefore tempting to conclude that indexes should be updated in "near-realtime". However, it is important to understand the consequences of this "easy" choice, and then perhaps to make a better decision based on this understanding.

Index Optimization

As we shall see later, indexes can be in a non-optimal state, and will get gradually less optimal as syncs are run. It is therefore necessary to run periodic index optimizations to improve the efficiency of the indexes. Our second decision, therefore, is how often to run optimization, and what types of optimization to use.

In order to understand these choices, we must look a bit deeper at what non-optimality of a text index means, and how it comes about.

To get the most out of the next part of this paper, you really need to understand how Oracle Text processes DML. The rest of this document will assume you are familiar with the concepts covered by that paper.

Text Index Non-Optimality

There are two ways a Text index can be non-optimal:
  1. It can contain garbage
  2. It can be fragmented
Garbage in indexes is due to the 'lazy delete' method used during DML (see link above). Left-over index information is not deleted from the indexes at commit or sync time - it is left around until it is removed by an optimization run.

Index fragmentation is where too many rows are used for any single word in the $I index table. Ideally, we would like to see a single row per word. In practice, this is not possible, since for efficiency reasons the index information - the TOKEN_INFO BLOB column - is restricted to 4000 bytes (so it can be stored "in-line").

So we may need more than one row per word, but a minimally fragmented index will have the following characteristics:

  1. There will only be a single row in the $I table for any word which has less than (approximately) 4000 bytes of index information (the TOKEN_INFO column)
  2. If there is more than 4000 bytes of TOKEN_INFO, it will be held in the minimum number of 4000 byte chunks.
Indexes can become more fragmented - many TOKEN_INFO lengths of less than 4000 bytes - for a number of reasons:
  1. During indexing, the cached $I entries are flushed to disk each time the internal index memory is exhausted. Each flush starts a new $I row for each word.
  2. Parallel indexing threads will write their own rows to $I
  3. Processing of DML writes new rows to the $I rather than updating the existing rows.

Index Creation - review

The first way to avoid fragmentation in an index is to avoid building it in when you first create the index.

Index Memory

As mentioned above, cached $I entries are flushed to disk each time the indexing memory is exhausted. The default index memory at installation is a mere 12MB, which is very low. Users can specify up to 50MB at index creation time, but this is still pretty low.

This would be done by a CREATE INDEX statement something like:

                                       
CREATE INDEX myindex ON mytable(mycol)
 INDEXTYPE IS ctxsys.context
 PARAMETERS ('index memory 50M');

                                    
Allow index memory settings above 50MB, the CTXSYS user must first increase the value of the MAX_INDEX_MEMORY parameter, like this:
                                       
 begin
    ctx_adm.set_parameter('max_index_memory', '500M');
  end;

                                    
The setting for index memory should never be so high as to cause paging, as this will have a serious effect on indexing speed. On smaller dedicated systems, it is sometimes advantageous to temporarily decrease the amount of memory consumed by the Oracle SGA (for example by decreasing DB_CACHE_SIZE and/or SHARED_POOL_SIZE) during the index creation process. Once the index has been created, the SGA size can be increased again to improve query performance.

Parallel Indexing

Parallel Indexing generally leads to shorter indexing times. However, it can cause fragmentation of the indexes for two reasons:
  1. Your "index memory" setting applies to each parallel indexing process. Therefore, if you have 1GB of free memory and four indexing processes, you must set index memory to 256MB - leading to the fragmentation discussed above.
  2. Each index process does its own writes to the $I index tables. Let's imagine a word which appears in only four documents, but each document is indexed by a different indexing process. In this case, we'll get four rows in the $I table, even though there is very little actual information stored in them.

Index Synchronization

Index Synchronization (or just "sync") is the process by which the indexes are brought up-to-date with the data in the table.

There is more than one way of running a sync on an index. In previous versions, there was a utility called CTXSRV. This has been deprecated for some time, and is no longer provided in version 10g.

Your options are now

  1. Alter Index ... Rebuild Online Parameters ('sync')
  2. The PL/SQL procedure CTX_DDL.Sync_Index
  3. Using the " Sync (On Commit)" or " Sync (Every ...)" clauses in the Parameters string when creating the index (available only in releases from 10g).

The Alter Index ... Rebuild command is not recommended (and is deprecated), since certain non-fatal indexing errors can cause the index to be marked as UNUSABLE - requiring a complete rebuild.

So the choice is simply whether to use Sync (On Commit), or Sync (Every ...) or when to run the Sync_Index PL/SQL utility.

Sync (On Commit) new in Oracle 10g

Creating an index with this option might look like this:
                                       
 create index my_index on my_table (text_col)
  indextype is ctxsys.context
  parameters ('lexer my_lexer sync (on commit)');

                                    
This means that a sync is launched immediately after the commit occurs. The way this works is slightly unusual, in that the SQL call to commit will not return until the sync is complete, and yet the sync is done in a separate transaction from the main commit.

So from the point of view of the user calling the commit, the index will appear to be updated as soon as the commit completes. However, another user might see a period when the rows have been committed to the table, but the index has not yet been updated. The period during which this occurs will depend on the amount of data to be indexed - that is, how long the sync takes. Using Sync (On Commit) can be bad for index performance if small numbers of rows are committed at a time. On the other hand, if the commit is only called every few thousand rows, this might be a reasonable option.

Sync (Every ...) new in Oracle 10g

This option allows you to automatically schedule syncs for your index. The SQL for this might look like:
                                       
 create index my_index on my_table (text_col)
  indextype is ctxsys.context
  parameters ('lexer my_lexer sync (every "sysdate+(1/24)")');

                                    
The "interval" string is slightly obscure, but is taken from Oracle's job scheduling syntax. In this case, it means every hour - "sysdate" is the current time, and we want the next job to run after one twenty-fourth of a day. Every minute would be "sysdate+(1/24/60)".

Transactional new in Oracle 10g

The transactional keyword does not directly relate to index syncs, but more to how queries are run.

When this keyword is used, every query will check the "pending" queue for the text index. If there are rows which have been committed but not yet indexed, the query will check each of these rows to see whether they satisfy the query criteria, before looking in the actual index. This gives the appearance of a fully transactional index.

Using transactional means you can put off the actual indexing of the data, which can be beneficial in terms of index fragmentation. However, every query must search all of the pending rows, which can itself have a very deleterious effect on index performance. The searching is done by effectively indexing each document "on the fly". This will be particularly expensive if the documents are formatted, and must be passed through a filter during the indexing process.

Availability Versus Index Fragmentation

There are two reasons you might want to sync your indexes frequently. The first is that new records are only searchable once the index has been sync'd. The less obvious reason is that updated records cannot be found at all by a text search until the index has been sync'd.

So even if you only change one word in a document, that document cannot be found by any of the unchanged words until a sync has occurred. This is because an update is implemented as internally as "delete followed by insert", and deletions from the index are instant, whereas inserts are deferred until sync time.

It is therefore tempting to use SYNC (ON COMMIT), or set SYNC (EVERY ...) or a manual sync to run extremely regularly. But it's important to realise the effect this will on index fragmentation.

Fragmentation Revisited

As discussed above under "Text Index Non-Optimality" index fragmentation occurs when index records are written in smaller-than-ideal chunks.

If you sync your index after adding (or changing) just one record, then every unique word in your document will have its own new row in the $I table.

This may not matter too much if you're only adding or changing a few hundred documents a day. However if you're loading or updating many thousands of documents per hour, your indexes will rapidly become badly fragmented which will have a big impact on query performance. You can improve this by scheduling regular optimizations (see next section), but you may get to a situation where you just can't run optimizations often enough (or for long enough) to prevent query performance deterioration.

Therefore in many situations it is better to delay syncs. If you call the sync procedure for your index only once an hour, the sync will be able to process all of the documents added or modified in the last hour in one go. This will mean that much larger chunks are written at a time to the $I index table, and hence there is far less fragmentation.

Note that frequency of sync will not have any effect on garbage build-up in the $I table due to updated or deleted rows - so we would want to run an optimize eventually just to clear this out.

Sync Index Memory

Sync uses index memory in the same way as CREATE INDEX - batching up updates until it runs out of memory and flushing them to the $I table. Since it's usually dealing with relatively small numbers of rows, the default index memory setting is usually sufficient. However if you are dealing with lots of changes at a time, you should consider increasing the memory by modifying the parameter DEFAULT_INDEX_MEMORY (this must be done by CTXSYS by means of the CTX_ADM procedure) or by specifying a larger value for MEMORY in the SYNC_INDEX call (versions 9.2 or later).

Methods for Calling CTX_DDL.SYNC_INDEX

So we've decided that we are going to sync our indexes once per hour. How do actually do that? There are a number of methods:
  1. Manually.

    We could log into SQL*Plus and run the command EXEC CTX_DDL.SYNC_INDEX('myindexname') once per hour. Naturally this is not very satisfactory for anything other than short-term testing!

  2. SYNC (EVERY ...)

    This clause on an index creation

  3. DBMS_JOB

    The DBMS_JOB package allows us to schedule a regular call to a There are two reasons you may want to do a complete rebuild of a text PL/SQL package. This is the 'recommended' solution, although non-PL/SQL experts may find the package a little challenging to use.

  4. CTX_SCHEDULE

    This package is not part of the Oracle Text product, but can be downloaded from OTN. It is basically just a wrapper around DBMS_JOB, designed to simplify the setting up of an index There are two reasons you may want to do complete rebuild of a text maintenance regime for non-experts.

  5. Concurrent Manager

    The Concurrent Manager is a tool provided with the Oracle e-Business Suite. If you are running this suite, then the Concurrent Manager may be your best way to schedule syncs.

Index Optimization

As discussed above, index non-optimality can take two forms - garbage in the $I table, and fragmentation of the $I table. There are also two forms of optimization - "optimize fast" and "optimize full". Fast optimization tackles only the fragmentation, whereas full optimization tackles both fragmentation and garbage cleanup.

The other difference is that FAST optimization must always be run to completion, whereas full optimization can be time limited - for example you might decide to run it once a day, at your system's quietest time - for three hours at a time.

Time limited optimization will always pick up where the last run finished, so there is no danger that you will be always optimizing one part of the database, and never touching the rest.

Does optimization stop me searching/updating? Will it slow my system?

No. Index optimization does not prevent searching, updates, or syncs.

It will, however, have an effect on your system as it uses resources. Optimization basically fetches rows from the $I table, a token at a time, reassembles them into the minimum possible number of rows (after eliminating garbage if running in "full" mode), and writes them back. The process does not use much memory, but it does use a lot of I/O bandwidth, and a moderate amount of CPU to decode the compressed index data and recompress it again. It also tends to cause a large amount of redo log to be generated by the database, since there is a lot of modification of table data.

How can I tell if my index is fragmented?

You can use the CTX_REPORT utility to estimate the degree of fragmentation and the amount of garbage.

So how often should I optimize?

It is very hard to give an absolute answer to this, even knowing the variables such as number of new or updated records per day.

The most effective regime can only be found by experimentation in your particular environment. However, we can give some guidelines based on customer experiences:

If you have a large number of updates (thousands per hour) and are running "SYNC (ON COMMIT)" or scheduling syncs every minute or so, then you will need to run optimize pretty much continuously. You can do this by, for example, scheduling a time-limited full optimize to run for 50 minutes each hour, or by run running fast optimize in a loop. If you are doing the latter, and your application does updates or deletes as well as inserts, you must make sure you schedule a full optimize occasionally (perhaps once a week) in order to clean up garbage.

For large numbers of updates (thousands per hour), but syncs running every 20 minutes or less, you should run a full optimize once a day, preferably at the quietest time of day. This can be a time-limited full optimize, or you can run fast optimize, let it run to completion, then run a full optimize weekly for garbage collection.

For less updates, or less frequent syncs, you can probably get away with running an optimize once a week. However, you should take care to monitor your query performance to see if it is dropping off towards the end of the week.

Rebuilding the $X index

As part of your periodic optimizations, it can sometimes be beneficial to rebuild the $X index on the $I index table. The $X index is a standard Oracle B-tree index, and can be rebuilt with the following command:
                                       
 ALTER INDEX DR$index_name$X REBUILD;

                                    
- where index_name is the name of the Oracle Text index.

This will de-fragment the $X index for fastest access. If you want to be able to continue to run queries while this is happening, you should append the ONLINE keyword.

It is not necessary to do this often - once a month is probably sufficient for even the most intensively updated index.

Rebuilding Indexes

There are two reasons you may want to do a complete rebuild of a text index:
  1. You wish to change some index options, and these require a complete rebuild. Some options can be changed without a full rebuild - for example you can add stopwords - but most changes cannot be implemented without a full rebuild.
  2. You want to periodically rebuild the index for house-keeping purposes - for example if the storage for the $I table is growing over time.
The simplest way to rebuild the index is simply to drop it and recreate it. However, this means that the index is not available while it is being recreated.

It is not possible to fully rebuild a Text index while it is online. Therefore, if you need to do this you must create a new and separate index. But then you hit the problem of only being able to create a single index (of any particular INDEXTYPE) on a column.

A solution to this can often be found by using the USER_DATASTORE or MULTI_COLUMN datastore. These both allow you to create an index on one column, but to fetch the actual data to be indexed from another column.

So we might create a table as follows:

                                       
CREATE TABLE mytable (
     text CLOB,
     dummy1 CHAR,
     dummy2 CHAR);

                                    
We can then create the index on dummy1, but fetch the data from text. Queries would need to take the form
                                       
... WHERE CONTAINS (dummy1, 'search terms') >0

                                    
Then when we need to rebuild the index, we can create a whole new index on dummy2, once again fetching the actual data to index from text.

When the index is complete, the application can start issuing new queries:

                                       
... WHERE CONTAINS (dummy2, 'search terms') > 0

                                    

Conclusion

It is often tempting for a system designer to take customers' word that they need "immediate updates" without further query. However doing this and then not planning for the consequences can lead to a seriously underperforming system.

Instead, it is essential to find out how often the indexes REALLY need to be sync'd, and then plan for this and a realistic optimization schedule, bearing in mind the amount of changes, the system uptime requirements, and any expected quiet periods. Needless to say, such a regime cannot be tested on a small amount of data, then immediately moved into a production system. You should aim to test the impact of realistic update loads well before the system goes production.


Last Modified 3 October 2003 by Roger Ford


Left Curve
Popular Downloads
Right Curve
Untitled Document
Left Curve
More Database Downloads
Right Curve