Setting up an Index Maintenance Regime for Oracle Text
IntroductionThis paper looks at the process of maintaining an Oracle Text index, and explores the options available to DBAs to maintain these indexes.
What is Index Maintenance?Text Indexes are different from standard ("B-tree") Oracle indexes, in that they are not normally transactional. When you insert a record into a table, this merely schedules an update to the index - it doesn't actually change the index. The indexes are only modified when you run an index synchronization ("sync") procedure.
The frequency with which such syncs are run has an effect on the efficiency of the index. Thus it is necessary to have an index maintenance regime, in order to keep the index working efficiently.
Oracle Text ships with defaults which are suitable for setting up test tables and indexes. However, for a production system, you will need to spend some time gathering requirements, and setting up a suitable regime for maintaining your indexes.
Index LatencyThe first decision you need to make is the degree of latency - that is, how up-to-date your indexes must be.
When a record is inserted or updated, the index is not updated immediately. Instead, reindexing of that record is scheduled for the next time a SYNC is run on that index.
Note, however, that in the case of a delete - or an update, which is implemented by deletion followed by the a new insert - the information is effectively deleted from the index immediately the change is comitted. So even if you only change a small part of a document, that document will not be searchable until after the next sync.
It is therefore tempting to conclude that indexes should be updated in "near-realtime". However, it is important to understand the consequences of this "easy" choice, and then perhaps to make a better decision based on this understanding.
Index OptimizationAs we shall see later, indexes can be in a non-optimal state, and will get gradually less optimal as syncs are run. It is therefore necessary to run periodic index optimizations to improve the efficiency of the indexes. Our second decision, therefore, is how often to run optimization, and what types of optimization to use.
In order to understand these choices, we must look a bit deeper at what non-optimality of a text index means, and how it comes about.
To get the most out of the next part of this paper, you really need to understand how Oracle Text processes DML. The rest of this document will assume you are familiar with the concepts covered by that paper.
Text Index Non-OptimalityThere are two ways a Text index can be non-optimal:
Index fragmentation is where too many rows are used for any single word in the $I index table. Ideally, we would like to see a single row per word. In practice, this is not possible, since for efficiency reasons the index information - the TOKEN_INFO BLOB column - is restricted to 4000 bytes (so it can be stored "in-line").
So we may need more than one row per word, but a minimally fragmented index will have the following characteristics:
Index Creation - reviewThe first way to avoid fragmentation in an index is to avoid building it in when you first create the index.
Index MemoryAs mentioned above, cached $I entries are flushed to disk each time the indexing memory is exhausted. The default index memory at installation is a mere 12MB, which is very low. Users can specify up to 50MB at index creation time, but this is still pretty low.
This would be done by a CREATE INDEX statement something like:
CREATE INDEX myindex ON mytable(mycol) INDEXTYPE IS ctxsys.context PARAMETERS ('index memory 50M');Allow index memory settings above 50MB, the CTXSYS user must first increase the value of the MAX_INDEX_MEMORY parameter, like this:
begin ctx_adm.set_parameter('max_index_memory', '500M'); end;The setting for index memory should never be so high as to cause paging, as this will have a serious effect on indexing speed. On smaller dedicated systems, it is sometimes advantageous to temporarily decrease the amount of memory consumed by the Oracle SGA (for example by decreasing DB_CACHE_SIZE and/or SHARED_POOL_SIZE) during the index creation process. Once the index has been created, the SGA size can be increased again to improve query performance.
Parallel IndexingParallel Indexing generally leads to shorter indexing times. However, it can cause fragmentation of the indexes for two reasons:
Index SynchronizationIndex Synchronization (or just "sync") is the process by which the indexes are brought up-to-date with the data in the table.
There is more than one way of running a sync on an index. In previous versions, there was a utility called CTXSRV. This has been deprecated for some time, and is no longer provided in version 10g.
Your options are now
So the choice is simply whether to use
Sync (On Commit) new in Oracle 10gCreating an index with this option might look like this:
create index my_index on my_table (text_col) indextype is ctxsys.context parameters ('lexer my_lexer sync (on commit)');This means that a sync is launched immediately after the commit occurs. The way this works is slightly unusual, in that the SQL call to commit will not return until the sync is complete, and yet the sync is done in a separate transaction from the main commit.
So from the point of view of the user calling the commit, the index will appear to be updated as soon as the commit completes. However, another user might see a period when the rows have been committed to the table, but the index has not yet been updated. The period during which this occurs will depend on the amount of data to be indexed - that is, how long the sync takes. Using
Sync (Every ...) new in Oracle 10gThis option allows you to automatically schedule syncs for your index. The SQL for this might look like:
create index my_index on my_table (text_col) indextype is ctxsys.context parameters ('lexer my_lexer sync (every "sysdate+(1/24)")');The "interval" string is slightly obscure, but is taken from Oracle's job scheduling syntax. In this case, it means every hour - "sysdate" is the current time, and we want the next job to run after one twenty-fourth of a day. Every minute would be
Transactional new in Oracle 10gThe
When this keyword is used, every query will check the "pending" queue for the text index. If there are rows which have been committed but not yet indexed, the query will check each of these rows to see whether they satisfy the query criteria, before looking in the actual index. This gives the appearance of a fully transactional index.
Availability Versus Index FragmentationThere are two reasons you might want to sync your indexes frequently. The first is that new records are only searchable once the index has been sync'd. The less obvious reason is that updated records cannot be found at all by a text search until the index has been sync'd.
So even if you only change one word in a document, that document cannot be found by any of the unchanged words until a sync has occurred. This is because an update is implemented as internally as "delete followed by insert", and deletions from the index are instant, whereas inserts are deferred until sync time.
It is therefore tempting to use SYNC (ON COMMIT), or set SYNC (EVERY ...) or a manual sync to run extremely regularly. But it's important to realise the effect this will on index fragmentation.
Fragmentation RevisitedAs discussed above under "Text Index Non-Optimality" index fragmentation occurs when index records are written in smaller-than-ideal chunks.
If you sync your index after adding (or changing) just one record, then every unique word in your document will have its own new row in the $I table.
This may not matter too much if you're only adding or changing a few hundred documents a day. However if you're loading or updating many thousands of documents per hour, your indexes will rapidly become badly fragmented which will have a big impact on query performance. You can improve this by scheduling regular optimizations (see next section), but you may get to a situation where you just can't run optimizations often enough (or for long enough) to prevent query performance deterioration.
Therefore in many situations it is better to delay syncs. If you call the sync procedure for your index only once an hour, the sync will be able to process all of the documents added or modified in the last hour in one go. This will mean that much larger chunks are written at a time to the $I index table, and hence there is far less fragmentation.
Note that frequency of sync will not have any effect on garbage build-up in the $I table due to updated or deleted rows - so we would want to run an optimize eventually just to clear this out.
Sync Index MemorySync uses index memory in the same way as CREATE INDEX - batching up updates until it runs out of memory and flushing them to the $I table. Since it's usually dealing with relatively small numbers of rows, the default index memory setting is usually sufficient. However if you are dealing with lots of changes at a time, you should consider increasing the memory by modifying the parameter DEFAULT_INDEX_MEMORY (this must be done by CTXSYS by means of the CTX_ADM procedure) or by specifying a larger value for MEMORY in the SYNC_INDEX call (versions 9.2 or later).
Methods for Calling CTX_DDL.SYNC_INDEXSo we've decided that we are going to sync our indexes once per hour. How do actually do that? There are a number of methods:
Index OptimizationAs discussed above, index non-optimality can take two forms - garbage in the $I table, and fragmentation of the $I table. There are also two forms of optimization - "optimize fast" and "optimize full". Fast optimization tackles only the fragmentation, whereas full optimization tackles both fragmentation and garbage cleanup.
The other difference is that FAST optimization must always be run to completion, whereas full optimization can be time limited - for example you might decide to run it once a day, at your system's quietest time - for three hours at a time.
Time limited optimization will always pick up where the last run finished, so there is no danger that you will be always optimizing one part of the database, and never touching the rest.
Does optimization stop me searching/updating? Will it slow my system?No. Index optimization does not prevent searching, updates, or syncs.
It will, however, have an effect on your system as it uses resources. Optimization basically fetches rows from the $I table, a token at a time, reassembles them into the minimum possible number of rows (after eliminating garbage if running in "full" mode), and writes them back. The process does not use much memory, but it does use a lot of I/O bandwidth, and a moderate amount of CPU to decode the compressed index data and recompress it again. It also tends to cause a large amount of redo log to be generated by the database, since there is a lot of modification of table data.
How can I tell if my index is fragmented?You can use the CTX_REPORT utility to estimate the degree of fragmentation and the amount of garbage.
So how often should I optimize?It is very hard to give an absolute answer to this, even knowing the variables such as number of new or updated records per day.
The most effective regime can only be found by experimentation in your particular environment. However, we can give some guidelines based on customer experiences:
If you have a large number of updates (thousands per hour) and are running "SYNC (ON COMMIT)" or scheduling syncs every minute or so, then you will need to run optimize pretty much continuously. You can do this by, for example, scheduling a time-limited full optimize to run for 50 minutes each hour, or by run running fast optimize in a loop. If you are doing the latter, and your application does updates or deletes as well as inserts, you must make sure you schedule a full optimize occasionally (perhaps once a week) in order to clean up garbage.
For large numbers of updates (thousands per hour), but syncs running every 20 minutes or less, you should run a full optimize once a day, preferably at the quietest time of day. This can be a time-limited full optimize, or you can run fast optimize, let it run to completion, then run a full optimize weekly for garbage collection.
For less updates, or less frequent syncs, you can probably get away with running an optimize once a week. However, you should take care to monitor your query performance to see if it is dropping off towards the end of the week.
Rebuilding the $X indexAs part of your periodic optimizations, it can sometimes be beneficial to rebuild the $X index on the $I index table. The $X index is a standard Oracle B-tree index, and can be rebuilt with the following command:
ALTER INDEX DR$index_name$X REBUILD;- where index_name is the name of the Oracle Text index.
This will de-fragment the $X index for fastest access. If you want to be able to continue to run queries while this is happening, you should append the ONLINE keyword.
It is not necessary to do this often - once a month is probably sufficient for even the most intensively updated index.
Rebuilding IndexesThere are two reasons you may want to do a complete rebuild of a text index:
It is not possible to fully rebuild a Text index while it is online. Therefore, if you need to do this you must create a new and separate index. But then you hit the problem of only being able to create a single index (of any particular INDEXTYPE) on a column.
A solution to this can often be found by using the USER_DATASTORE or MULTI_COLUMN datastore. These both allow you to create an index on one column, but to fetch the actual data to be indexed from another column.
So we might create a table as follows:
CREATE TABLE mytable ( text CLOB, dummy1 CHAR, dummy2 CHAR);We can then create the index on dummy1, but fetch the data from text. Queries would need to take the form
... WHERE CONTAINS (dummy1, 'search terms') >0Then when we need to rebuild the index, we can create a whole new index on dummy2, once again fetching the actual data to index from text.
When the index is complete, the application can start issuing new queries:
... WHERE CONTAINS (dummy2, 'search terms') > 0
ConclusionIt is often tempting for a system designer to take customers' word that they need "immediate updates" without further query. However doing this and then not planning for the consequences can lead to a seriously underperforming system.
Instead, it is essential to find out how often the indexes REALLY need to be sync'd, and then plan for this and a realistic optimization schedule, bearing in mind the amount of changes, the system uptime requirements, and any expected quiet periods. Needless to say, such a regime cannot be tested on a small amount of data, then immediately moved into a production system. You should aim to test the impact of realistic update loads well before the system goes production.
Last Modified 3 October 2003 by Roger Ford