Contents
Section 1: Queries
Section 2: Indexing
Section 3: Updates
Queries
What do we mean by "query performance" anyway?
There are generally two measures of query performance - response time
(the time to get an answer to an individual query), and throughput
(the number of queries that can be run in any time period, eg queries
per second). These two are related, but are not the same. In a heavily
loaded system, we would normally want maximum throughput, whereas in a
relatively lightly loaded system, we probably want minimum response
time.
Also, some applications will require a query to deliver all its hits
to the user, whereas others may just require the first 20 hits
from an ordered set. It is important to distinguish between these
two scenarios.
What is the fastest type of text query?
The fastest type of query will meet the following conditions:
- Single CONTAINS clause
- No other conditions in the WHERE clause
- EITHER: No "ORDER BY" clause at all, OR just "ORDER BY SCORE(n) DESCEND" together with the "FIRST_ROWS" hint
- Only the first page of results is returned (eg. the first 10 or 20 hits).
How does the size of my data affect queries?
The speed at which the text index can deliver ROWIDs is not affected
by the actual size of the data, but by the size of the Token Table
which holds the list of words, and information about the rows in which
they appear. Text query speed will be related to the number of rows that
must be fetched from this Token Table, and the length of each row.
Hence it should be nearly as fast to find a rare word in a large
document set as it is to find a common word - or many uncommon
words - in a smaller document set.
How does the source type of my data affect queries?
The format of the documents (eg. plain ascii text, HTML or
Microsoft Word) should make no difference to query speed.
The documents are filtered to plain text at indexing time,
not query time.
The "cleanliness" of the data will make a difference. Spell-checked
and sub-edited text for publication will tend to have a much smaller
total vocabulary (and therefore size of token table) than informal
text such as emails, which will contain many spelling errors and
abbreviations which bloat the token table.
Does sorting the results slow a text-only query?
It most certainly does - except when sorting by relevence, or when
the table is partitioned by the sort column (of which more later). If
you sort by any other criteria - such as date, or price - then Oracle
must resolve the entire query before it can pass back any results to
the user application. If there is no sorting, then Oracle can return
results as it finds them, which is very much quicker in the common
case where the application needs to only display a page of results at
a time.
There is a special case - sorting by relevence (SCORE(n)) can be
extremely quick if the FIRST_ROWS hint is used. In this case, the
kernel allows the text engine to perform a high speed internal sort
when fetching from the text index tables.
An example of such a query:
SELECT /*+ FIRST_ROWS */ ID, SCORE(1), TEXT FROM MYTABLE
WHERE CONTAINS (TEXT, 'searchterm', 1) > 0
ORDER BY SCORE(1) DESC;
Note that for this to work efficiently, there must be no other
criteria in the WHERE clause other than a single CONTAINS.
Where the query contains non-text conditions, it may be that the
query is "driven" from a B-tree index on a non-text column. In this
case, an ORDER BY on this column will not be an overhead, since the
B-tree index can deliver ROWIDs in sorted order. Such a query will
use a functional lookup in the text index, which
can be expensive if many rows are fetched from the B-tree index.
What is a "functional" versus an "indexed" lookup?
There are two ways the kernel can query the text index.
In the first - and most common - case, the kernel asks the
text index for all the rowids that satisfy a particular
text search. These rowids are returned in blocks. In the
second, the kernel passes individual rowids to the text index,
and asks whether that particular rowid satisfies a certain
text criterion. The second is known as a functional lookup,
and is most commonly done where there is a very selective
structured clause, so that only a few rowids must be
checked against the text index.
An example of a search where a functional lookup may be used:
SELECT ID, SCORE(1), TEXT FROM MYTABLE
WHERE START_DATE = '21 Oct 1992' <- highly selective
AND CONTAINS (TEXT, 'commonword') > 0 <- unselective
What tables are involved in queries?
All queries will need to look at the "$I" or token table. Its name
has the form DR$indexname$I. This contains the list of tokens
(column TOKEN_TEXT) and the information about the row and word
positions where the token occurs (column TOKEN_INFO).
The row information is stored as internal DOCID values. These must be
translated into external ROWID values. The table used for this depends
on the type of lookup: For functional lookups, the "$K" table -
DR$indexname$K - is used. This is a simple Index Organized Table (IOT)
which contains a row for each DOCID/ROWID pair.
For indexed lookups, the "$R" table - DR$indexname$R - is used. This
holds the complete list of ROWIDs in a BLOB column (spread over up to
20 rows). Since ROWIDs are fixed length (18 characters), we can
easily find the ROWID corresponding to DOCID 3, for example, by
looking at byte position 55 (18 x 3 + 1).
Hence we can easily find out whether a functional or indexed lookup is
being used by examining a SQL trace, and looking for the $K or $R
tables.
Why should I use section searching?
As we've seen above, the fastest type of query is one where there
is only a single CONTAINS clause, and no other conditions in the
WHERE clause. Rather than having a query like:
SELECT title, isbn FROM booklist
WHERE CONTAINS (title, 'horse') > 0
AND CONTAINS (abstract, 'racing') > 0
We would prefer to run a query such as
SELECT title, isbn FROM booklist
WHERE CONTAINS (alltext,
'horse WITHIN title AND racing WITHIN abstract')>0
This will be a much faster query. In order to use a query like this,
we must copy all the data into a single text column for indexing, with
section tags around each column's data. This can be done via PL/SQL
procedures before indexing, or by making use of the USER DATASTORE.
Is it OK to have lots of expansions in a query?
Each distinct word used in a query will require at least one row
to be fetched from the $I token table. It is therefore best to
keep the number of expansions down as much as possible. You should
not use expansions such as wild cards, stemming and fuzzy matching
unless they are necessary to the task. In general, a few expansions
(say up to 20) is OK, but you should try to avoid more than 100 or so
expansions in a query. The query feedback mechanism may be used to
establish the number of expansions for any particular query expression.
How can local partition indexes help?
Oracle Text version 9i introduced the idea of "local partition indexes".
This means that on a partitioned table, each partition has its own set of
index tables. Effectively, there are multiple indexes, but the results from
each are combined as necessary to produce the final result set.
The index is created using the LOCAL keyword:
CREATE INDEX index_name ON table_name (column_name)
INDEXTYPE IS ctxsys.context
PARAMETERS ('...')
LOCAL
This method is particularly useful if
- Queries frequently restrict the search by a particular column (eg a date range)
- Queries are normally sorted by a particular column (eg price), and only the
first page(s) of hits must be returned
In these examples, the table could be partitioned by date or price
respectively. If the date range in the first is quite restrictive, it
is very likely that the query can be satisfied by only looking in a
single partition. This should be very much faster than searching through
the entire table, then throwing away most of the hits because they
don't match the structured criteria.
In the second example, with the table partitioned by price, the
query may only need to get hits from the first partition in order to
satisfy the first page of results.
There are, however a couple of points to watch out for:
-
Queries which are not restricted or sorted by the partition
column will take slightly longer, as the results from searching each
partition's index must be combined.
-
Due to a kernel restriction in version 9.0.1, descending sorts do
not take advantage of local partition indexes, so this will
only work if the sort is ascending. This restriction will probably be
removed in a future version (see bug 1873616).
When should I use a CTXCAT index?
In Oracle 8i Release 3 (Version 8.1.7), Oracle introduced a new type
of index - the "catalog" or CTXCAT index type. The name derives from
the main purpose of the index type - online catlogs.
CTXCAT indexes work best when text is in "small chunks" - maybe a
couple of lines maximum - and searches need to restrict and/or sort
the result set according to certain structured criteria - usually
numbers or dates.
For example, lets consider an on-line auction site. Each item for sale
has a short description, a current bid price and dates for the start
and end of the auction. A user might want to see all the records with
"antique cabinet" in the description, with a current bid price less
than $500. Since he's particularly interested in new items, he wants
the results sorted by auction start time.
Such a search would be fairly inefficient using a normal CONTEXT
index. The kernel would have to find all the records that matched the
text search, then restrict the set to those with the correct price
(which requires the use of a different index), and then sort the
results using a third index.
By including structured information such as price and date within the
CTXCAT index, we are able to make this search very much more
efficient.
When is a CTXCAT index NOT suitable?
The query language with CTXCAT is considerably simpler than for
CONTEXT indexes. Basically, you can search for phrases and words (with
wild cards if required), using AND and OR operators, but that's about
it. If your application needs more complex text retrieval featurs -
such as stemming, thesaurus, fuzzy matching and so on, you should be
using a CONTEXT index.
There are also differences in the time and space needed to create the
index. CTXCAT indexes take quite a bit longer to create - and use
considerably more disk space - than CONTEXT indexes. If you are tight
on disk space, you should consider carefully whether CTXCAT indexes
are appropriate for you.
Indexing
What optimizer hints are available, and what do they do?
The optimizer hint INDEX(table column) may
be used in the usual way to force the use of a text or b-tree index
(or NO_INDEX(table column) to disable a
specific index). Additionally, the FIRST_ROWS hint has a special
meaning for text queries, as documented above. Use
of the FIRST_ROWS hint in conjunction with "ORDER BY SCORE(n) DESC"
will tell the kernel to accept a sorted set from the text index, and
not to do a further sort.
Finally, if your query has an ORDER BY clause on a B-tree indexed column,
it may be worth considering use of the INDEX_DESC hint. This
plan is never chosen by the cost-based optimizer, but may be optimal
in some circumstances.
How long should indexing take?
Indexing text is a resource-intensive process. Obviously, the
speed of indexing will depend on the power of the hardware
involved, but you should expect somewhere between 50MB / hour
on workstation-class NT machine (approx 400MHz CPU, 128MB memory)
to more than 1GB per hour on a large multi-CPU, multi-gigabyte
server machine. The latter figure assume you are using parallel
indexing on a partitioned table - a new option for 8.1.6 -
see the documentation for more details.
For most real-life systems, the time to index a complete table of
documents will be measured in hours, and in some cases days.
How can I tell how far my indexing has got?
You can use the ctx_output.start_log (filename)
command to log output from the indexing process. filename
will normally be written to $ORACLE_HOME/ctx, but you can
change the directory using the log_directory parameter
in ctx_adm.set_parameter.
Otheriwse, for a course-grained answer, you can count the number of
rows in the DR$xxx$K table. There will be one row in here for each
row that has been indexed, BUT these rows are only committed when the
indexing process runs out of indexing memory and does a "flush" to the
database. It is even possible that this will never happen until
indexing is complete.
How much disk overhead will the indexing require?
The overhead - the amount of space needed for the DR$
index tables - varies between about 25% of the original text
volume, and 100%. Generally, the larger the total amount of
text, the smaller the overhead, but many small records will
use more overhead than fewer large records. Also, "clean"
data (such as published text) will require less overhead
than "dirty" data such as emails or discussion notes, since
the "dirty" data is likely to include many unique words from
mis-spellings, abbreviations, etc.
Theme indexes are generally much smaller than text indexes. Creating
ONLY a theme index will generally require very little storage, but
creating only a text index will not save you much space over a combined
index (though it is likely to be significantly faster).
How does the format of my data affect indexing?
Looking at indexing overhead, you can expect much lower
overheads for formatted documents (eg. Microsoft Word files)
since such documents tend to be very large compared to the
actual text held in them. So 1GB of Word documents might
only require 50MB of index space, whereas 1GB of plain text
might require 500MB, since there is ten times as much "plain
text" in the latter set.
Indexing time is less clear-cut. Although the reduction in the amount
of text to be indexed will have an obvious effect, we must balance
this out against the cost of filtering the documents. In general,
these will roughly cancel out, so the time to index 1GB of formatted
docs will be about the same as to index 1GB of plain text, although it
may be a little longer.
Can I index in parallel?
In 8.1.6 and later, you may use parallel indexing to index
partitioned tables, with one indexing process per partition.
How you do this depends on whether or not you are creating a
local partition index. If you are not, you simply
use the PARALLEL n keyword when creating the index:
CREATE INDEX index_name ON table_name (column_name)
INDEXTYPE IS ctxsys.context
PARAMETERS ('...')
PARALLEL 3
This will create the index with three separate indexing processes. If you
are using a local partition index, the process is a little more complex.
You must first create the index using the NOPOPULATE keyword in the
parameters string, then rebuild the index for each partition in a separate session.
This can be done using the SQL command:
ALTER INDEX index_name REBUILD PARTITION partition_name
What INDEX MEMORY settings should I use?
You should aim to set the INDEX MEMORY value as high as
possible, without causing paging. Experience has shown that
using a large INDEX MEMORY setting - even into hundreds of
megabytes - will improve the speed of indexing and reduce
the fragmentation of the final indexes. However, if set too
high, then the memory paging that occurs will cripple indexing
speed. If using parallel indexing, don't forget that each stream
will require its own index memory.
When dealing with very large tables, you may wish to tune your
database SGA differently for indexing and retrieval. For retrieval,
you are hoping to get as much information buffered in the SGA (block
buffer cache) as possible. So you should be allocating a large amount
of memory to the block buffer cache. But this will not make any
difference to indexing, so you would be better off reducing the size
of the SGA to make more room for a large INDEX MEMORY setting.
Updates
How often should I index new/updated records?
How often do you need to? The less often you run reindexing
(via the command ALTER INDEX indexname REBUILD ONLINE PARAMETERS('SYNC'))
then the less fragmented your indexes will be, and the less you
will need to optimize them. However, this means that your data
will become progressively more out of date, which may be
unacceptable for your users.
Many systems are OK with overnight indexing. This means data that is
less than a day old is not searchable. Other systems use hourly, ten
minute, or five minute updates.
Note that running a CTXSRV process may cause records to be indexed as
soon as they are committed. If records are batch loaded in large
numbers before being committed, this is probably OK. However, if they
are added one at a time and committed after each add, then a CTXSRV
process is likely to cause drastic fragmentation of the indexes.
How can I tell when my indexes are getting fragmented?
The best way is to time some queries, run index optimization,
then time the same queries (restarting the database to clear
the SGA each time, of course). If the queries speed up significantly,
then optimization was worthwhile. If they don't ... well, you can
wait longer next time.
A more scientific method involves counting the number of rows for each
term in the DR$xxx$I table:
SELECT AVG(COUNT(*)) FROM DR$index_name$I
GROUP BY TOKEN_TEXT HAVING COUNT(*) > 1;
(note we ignore all words with only a single row in the index table).
A value greater than 10 from this query may indicate the need to
optimize the index, but experimentation should yield the best value in
any particular circumstances. Very large tables will inevitably have
a lot of rows where the TOKEN_INFO data overflows the 4K internal limit,
so you would expect the average to be greater on large systems.
Does INDEX MEMORY affect ALTER INDEX ... 'SYNC'?
Yes - just the same way as for normal indexing. But of course, there
are often far fewer records to be indexed during a SYNC, so it
is not usually necessary to provide hundreds of megabytes of
indexing memory.
|