interMedia Text Performance FAQ

Contents

Section 1: Queries

Section 2: Indexing

Section 3: Updates

Queries

What do we mean by "query performance" anyway?

There are generally two measures of query performance - response time (the time to get an answer to an individual query), and throughput (the number of queries that can be run in any time period, eg queries per second). These two are related, but are not the same. In a heavily loaded system, we would normally want maximum throughput, whereas in a relatively lightly loaded system, we probably want minimum response time. Also, some applications will require a query to deliver all its hits to the user, whereas others may just require the first 20 hits from an ordered set. It is important to distinguish between these two scenarios.

[Back to Top]

What is the fastest type of text query?

The fastest type of query will meet the following conditions:

  • Single CONTAINS clause
  • No other conditions in the WHERE clause
  • EITHER: No "ORDER BY" clause at all, OR just "ORDER BY SCORE(n) DESCEND" together with the "FIRST_ROWS" hint
  • Only the first page of results is returned (eg. the first 10 or 20 hits).

[Back to Top]

How does the size of my data affect queries?

The speed at which the text index can deliver ROWIDs is not affected by the actual size of the data, but by the size of the Token Table which holds the list of words, and information about the rows in which they appear. Text query speed will be related to the number of rows that must be fetched from this Token Table, and the length of each row.

Hence it should be nearly as fast to find a rare word in a large document set as it is to find a common word - or many uncommon words - in a smaller document set.

[Back to Top]

How does the source type of my data affect queries?

The format of the documents (eg. plain ascii text, HTML or Microsoft Word) should make no difference to query speed. The documents are filtered to plain text at indexing time, not query time.

The "cleanliness" of the data will make a difference. Spell-checked and sub-edited text for publication will tend to have a much smaller total vocabulary (and therefore size of token table) than informal text such as emails, which will contain many spelling errors and abbreviations which bloat the token table.

[Back to Top]

Does sorting the results slow a text-only query?

It most certainly does - except when sorting by relevence, or when the table is partitioned by the sort column (of which more later). If you sort by any other criteria - such as date, or price - then Oracle must resolve the entire query before it can pass back any results to the user application. If there is no sorting, then Oracle can return results as it finds them, which is very much quicker in the common case where the application needs to only display a page of results at a time.

There is a special case - sorting by relevence (SCORE(n)) can be extremely quick if the FIRST_ROWS hint is used. In this case, the kernel allows the text engine to perform a high speed internal sort when fetching from the text index tables.

An example of such a query:

                                     
 SELECT /*+ FIRST_ROWS */ ID, SCORE(1), TEXT FROM MYTABLE
    WHERE CONTAINS (TEXT, 'searchterm', 1) > 0
    ORDER BY SCORE(1) DESC;

                                  

Note that for this to work efficiently, there must be no other criteria in the WHERE clause other than a single CONTAINS.

Where the query contains non-text conditions, it may be that the query is "driven" from a B-tree index on a non-text column. In this case, an ORDER BY on this column will not be an overhead, since the B-tree index can deliver ROWIDs in sorted order. Such a query will use a functional lookup in the text index, which can be expensive if many rows are fetched from the B-tree index.

[Back to Top]

What is a "functional" versus an "indexed" lookup?

There are two ways the kernel can query the text index. In the first - and most common - case, the kernel asks the text index for all the rowids that satisfy a particular text search. These rowids are returned in blocks. In the second, the kernel passes individual rowids to the text index, and asks whether that particular rowid satisfies a certain text criterion. The second is known as a functional lookup, and is most commonly done where there is a very selective structured clause, so that only a few rowids must be checked against the text index.

An example of a search where a functional lookup may be used:

                                     
   SELECT ID, SCORE(1), TEXT FROM MYTABLE
    WHERE START_DATE = '21 Oct 1992'         <- highly selective
    AND CONTAINS (TEXT, 'commonword') > 0    <- unselective

                                  
[Back to Top]

What tables are involved in queries?

All queries will need to look at the "$I" or token table. Its name has the form DR$indexname$I. This contains the list of tokens (column TOKEN_TEXT) and the information about the row and word positions where the token occurs (column TOKEN_INFO).

The row information is stored as internal DOCID values. These must be translated into external ROWID values. The table used for this depends on the type of lookup: For functional lookups, the "$K" table - DR$indexname$K - is used. This is a simple Index Organized Table (IOT) which contains a row for each DOCID/ROWID pair.

For indexed lookups, the "$R" table - DR$indexname$R - is used. This holds the complete list of ROWIDs in a BLOB column (spread over up to 20 rows). Since ROWIDs are fixed length (18 characters), we can easily find the ROWID corresponding to DOCID 3, for example, by looking at byte position 55 (18 x 3 + 1).

Hence we can easily find out whether a functional or indexed lookup is being used by examining a SQL trace, and looking for the $K or $R tables.

[Back to Top]

Why should I use section searching?

As we've seen above, the fastest type of query is one where there is only a single CONTAINS clause, and no other conditions in the WHERE clause. Rather than having a query like:

                                     
 SELECT title, isbn FROM booklist
    WHERE CONTAINS (title, 'horse') > 0
      AND CONTAINS (abstract, 'racing') > 0

                                  

We would prefer to run a query such as

                                     
 SELECT title, isbn FROM booklist
    WHERE CONTAINS (alltext, 
      'horse WITHIN title AND racing WITHIN abstract')>0

                                  

This will be a much faster query. In order to use a query like this, we must copy all the data into a single text column for indexing, with section tags around each column's data. This can be done via PL/SQL procedures before indexing, or by making use of the USER DATASTORE.

[Back to Top]

Is it OK to have lots of expansions in a query?

Each distinct word used in a query will require at least one row to be fetched from the $I token table. It is therefore best to keep the number of expansions down as much as possible. You should not use expansions such as wild cards, stemming and fuzzy matching unless they are necessary to the task. In general, a few expansions (say up to 20) is OK, but you should try to avoid more than 100 or so expansions in a query. The query feedback mechanism may be used to establish the number of expansions for any particular query expression.

[Back to Top]

How can local partition indexes help?

Oracle Text version 9i introduced the idea of "local partition indexes". This means that on a partitioned table, each partition has its own set of index tables. Effectively, there are multiple indexes, but the results from each are combined as necessary to produce the final result set.

The index is created using the LOCAL keyword:

                                     
   CREATE INDEX index_name ON table_name (column_name) 
    INDEXTYPE IS ctxsys.context
    PARAMETERS ('...')
    LOCAL

                                  
This method is particularly useful if
  • Queries frequently restrict the search by a particular column (eg a date range)
  • Queries are normally sorted by a particular column (eg price), and only the first page(s) of hits must be returned
In these examples, the table could be partitioned by date or price respectively. If the date range in the first is quite restrictive, it is very likely that the query can be satisfied by only looking in a single partition. This should be very much faster than searching through the entire table, then throwing away most of the hits because they don't match the structured criteria.

In the second example, with the table partitioned by price, the query may only need to get hits from the first partition in order to satisfy the first page of results.

There are, however a couple of points to watch out for:

  • Queries which are not restricted or sorted by the partition column will take slightly longer, as the results from searching each partition's index must be combined.
  • Due to a kernel restriction in version 9.0.1, descending sorts do not take advantage of local partition indexes, so this will only work if the sort is ascending. This restriction will probably be removed in a future version (see bug 1873616).
[Back to Top]

When should I use a CTXCAT index?

In Oracle 8i Release 3 (Version 8.1.7), Oracle introduced a new type of index - the "catalog" or CTXCAT index type. The name derives from the main purpose of the index type - online catlogs.

CTXCAT indexes work best when text is in "small chunks" - maybe a couple of lines maximum - and searches need to restrict and/or sort the result set according to certain structured criteria - usually numbers or dates.

For example, lets consider an on-line auction site. Each item for sale has a short description, a current bid price and dates for the start and end of the auction. A user might want to see all the records with "antique cabinet" in the description, with a current bid price less than $500. Since he's particularly interested in new items, he wants the results sorted by auction start time.

Such a search would be fairly inefficient using a normal CONTEXT index. The kernel would have to find all the records that matched the text search, then restrict the set to those with the correct price (which requires the use of a different index), and then sort the results using a third index.

By including structured information such as price and date within the CTXCAT index, we are able to make this search very much more efficient.

[Back to Top]

When is a CTXCAT index NOT suitable?

The query language with CTXCAT is considerably simpler than for CONTEXT indexes. Basically, you can search for phrases and words (with wild cards if required), using AND and OR operators, but that's about it. If your application needs more complex text retrieval featurs - such as stemming, thesaurus, fuzzy matching and so on, you should be using a CONTEXT index.

There are also differences in the time and space needed to create the index. CTXCAT indexes take quite a bit longer to create - and use considerably more disk space - than CONTEXT indexes. If you are tight on disk space, you should consider carefully whether CTXCAT indexes are appropriate for you.

[Back to Top]

Indexing

What optimizer hints are available, and what do they do?

The optimizer hint INDEX( table column) may be used in the usual way to force the use of a text or b-tree index (or NO_INDEX( table column) to disable a specific index). Additionally, the FIRST_ROWS hint has a special meaning for text queries, as documented above. Use of the FIRST_ROWS hint in conjunction with "ORDER BY SCORE(n) DESC" will tell the kernel to accept a sorted set from the text index, and not to do a further sort.

Finally, if your query has an ORDER BY clause on a B-tree indexed column, it may be worth considering use of the INDEX_DESC hint. This plan is never chosen by the cost-based optimizer, but may be optimal in some circumstances.

[Back to Top]

How long should indexing take?

Indexing text is a resource-intensive process. Obviously, the speed of indexing will depend on the power of the hardware involved, but you should expect somewhere between 50MB / hour on workstation-class NT machine (approx 400MHz CPU, 128MB memory) to more than 1GB per hour on a large multi-CPU, multi-gigabyte server machine. The latter figure assume you are using parallel indexing on a partitioned table - a new option for 8.1.6 - see the documentation for more details.

For most real-life systems, the time to index a complete table of documents will be measured in hours, and in some cases days.

[Back to Top]

How can I tell how far my indexing has got?

You can use the ctx_output.start_log ( filename) command to log output from the indexing process. filename will normally be written to $ORACLE_HOME/ctx, but you can change the directory using the log_directory parameter in ctx_adm.set_parameter.

Otheriwse, for a course-grained answer, you can count the number of rows in the DR$xxx$K table. There will be one row in here for each row that has been indexed, BUT these rows are only committed when the indexing process runs out of indexing memory and does a "flush" to the database. It is even possible that this will never happen until indexing is complete.

[Back to Top]

How much disk overhead will the indexing require?

The overhead - the amount of space needed for the DR$ index tables - varies between about 25% of the original text volume, and 100%. Generally, the larger the total amount of text, the smaller the overhead, but many small records will use more overhead than fewer large records. Also, "clean" data (such as published text) will require less overhead than "dirty" data such as emails or discussion notes, since the "dirty" data is likely to include many unique words from mis-spellings, abbreviations, etc.

Theme indexes are generally much smaller than text indexes. Creating ONLY a theme index will generally require very little storage, but creating only a text index will not save you much space over a combined index (though it is likely to be significantly faster).

[Back to Top]

How does the format of my data affect indexing?

Looking at indexing overhead, you can expect much lower overheads for formatted documents (eg. Microsoft Word files) since such documents tend to be very large compared to the actual text held in them. So 1GB of Word documents might only require 50MB of index space, whereas 1GB of plain text might require 500MB, since there is ten times as much "plain text" in the latter set.

Indexing time is less clear-cut. Although the reduction in the amount of text to be indexed will have an obvious effect, we must balance this out against the cost of filtering the documents. In general, these will roughly cancel out, so the time to index 1GB of formatted docs will be about the same as to index 1GB of plain text, although it may be a little longer.

[Back to Top]

Can I index in parallel?

In 8.1.6 and later, you may use parallel indexing to index partitioned tables, with one indexing process per partition.

How you do this depends on whether or not you are creating a local partition index. If you are not, you simply use the PARALLEL n keyword when creating the index:

                                     
   CREATE INDEX index_name ON table_name (column_name) 
    INDEXTYPE IS ctxsys.context
    PARAMETERS ('...')
    PARALLEL 3

                                  
This will create the index with three separate indexing processes. If you are using a local partition index, the process is a little more complex. You must first create the index using the NOPOPULATE keyword in the parameters string, then rebuild the index for each partition in a separate session. This can be done using the SQL command:
                                     
   ALTER INDEX index_name REBUILD PARTITION partition_name

                                  
[Back to Top]

What INDEX MEMORY settings should I use?

You should aim to set the INDEX MEMORY value as high as possible, without causing paging. Experience has shown that using a large INDEX MEMORY setting - even into hundreds of megabytes - will improve the speed of indexing and reduce the fragmentation of the final indexes. However, if set too high, then the memory paging that occurs will cripple indexing speed. If using parallel indexing, don't forget that each stream will require its own index memory.

When dealing with very large tables, you may wish to tune your database SGA differently for indexing and retrieval. For retrieval, you are hoping to get as much information buffered in the SGA (block buffer cache) as possible. So you should be allocating a large amount of memory to the block buffer cache. But this will not make any difference to indexing, so you would be better off reducing the size of the SGA to make more room for a large INDEX MEMORY setting.

[Back to Top]

Updates

How often should I index new/updated records?

How often do you need to? The less often you run reindexing (via the command ALTER INDEX indexname REBUILD ONLINE PARAMETERS('SYNC')) then the less fragmented your indexes will be, and the less you will need to optimize them. However, this means that your data will become progressively more out of date, which may be unacceptable for your users.

Many systems are OK with overnight indexing. This means data that is less than a day old is not searchable. Other systems use hourly, ten minute, or five minute updates.

Note that running a CTXSRV process may cause records to be indexed as soon as they are committed. If records are batch loaded in large numbers before being committed, this is probably OK. However, if they are added one at a time and committed after each add, then a CTXSRV process is likely to cause drastic fragmentation of the indexes.

[Back to Top]

How can I tell when my indexes are getting fragmented?

The best way is to time some queries, run index optimization, then time the same queries (restarting the database to clear the SGA each time, of course). If the queries speed up significantly, then optimization was worthwhile. If they don't ... well, you can wait longer next time.

A more scientific method involves counting the number of rows for each term in the DR$xxx$I table:

                                     
 SELECT AVG(COUNT(*)) FROM DR$index_name$I
    GROUP BY TOKEN_TEXT HAVING COUNT(*) > 1;

                                  

(note we ignore all words with only a single row in the index table).

A value greater than 10 from this query may indicate the need to optimize the index, but experimentation should yield the best value in any particular circumstances. Very large tables will inevitably have a lot of rows where the TOKEN_INFO data overflows the 4K internal limit, so you would expect the average to be greater on large systems.

[Back to Top]

Does INDEX MEMORY affect ALTER INDEX ... 'SYNC'?

Yes - just the same way as for normal indexing. But of course, there are often far fewer records to be indexed during a SYNC, so it is not usually necessary to provide hundreds of megabytes of indexing memory.

[Back to Top]


Last modified August 28 2001 by Roger Ford


Left Curve
Popular Downloads
Right Curve
Untitled Document
Left Curve
More Database Downloads
Right Curve