Query Improvements
Fuzzy Score Passthrough
The FUZZY operator (represented by the question mark (?)) expands the input
term to words which are spelled similarly. It can increase query recall when
query terms or words in the indexed documents are frequently misspelled.
However, it could also expand to hundreds of words, leading to low precision
and long-running queries.
The size of the fuzzy expansion can be controlled at index time
through wordlist attributes introduced in version 8.1.5.
FUZZY_NUMRESULTS limits the expansion to no more than the specified
number of terms. FUZZY_SCORE limits the expansion size by fuzzy
score. Fuzzy score is a measure of the degree of similarity of the
input term to each result term. This ranges from 50 to 80, with 80
meaning identical. For instance, ?grisam might, internally, expand
to:
Word Fuzzy Score
------- -----------
Grisham 73
Grissom 66
Gresham 65
Graham 56
Setting FUZZY_SCORE limits the expansion terms to those with a fuzzy score
equal to or greater than the specified value. If in our example we set
FUZZY_SCORE to 70, ?Grisham would expand only to Grisham.
Query-time control over these limits was introduced in version
8.1.6, although this new feature was not documented in the official
documentation nor in the interMedia Text 8.1.6 Technical Overview.
The FUZZY function was expanded with two parameters for specification
of fuzzy score threshhold or numresults limit:
FUZZY(term [, fuzzy_score [, fuzzy_numresults]])
For instance,
FUZZY(term)
uses the index-time wordlist attribute limits. However,
FUZZY(term, 65)
overrides the index-time setting of FUZZY_SCORE and uses 65 as the fuzzy score
threshhold. Similarly,
FUZZY(term,,15)
overrides the index-time setting of FUZZY_NUMRESULTS and instead limits the
size of the expansion to 15 terms.
Although you could limit the terms by fuzzy score, the contains
score of the FUZZY function still reflected only the term frequency of
the expanded word -- there was no way to find documents containing
words similar to X, and order them by similarity to the input term.
So, in 8.1.7, a new parameter has been added to the FUZZY function:
FUZZY(term [, fuzzy_score [, fuzzy_numresults [, weight]]])
When the weight parameter is set to "weight" or "w", the FUZZY function weights
each expanded term by ((3 * fuzzyscore) - 140)/10) -- a factor proportional to
its fuzzy score. For instance, our example above, FUZZY(grisam), would
normally be re-written as:
grisham = grissom = gresham = graham
with the weight parameter, this becomes
(grisham * 7.9) = (grissom * 5.8) = (gresham * 5.5) = (graham * 2.8)
And the more similar terms, having higher weights, would tends towards higher
scores. Let's see this in practice. We'll start with a simple table, indexed
with FUZZY_NUMRESULTS set to 500 and FUZZY_SCORE set to 50. The query:
select score(1), text from mytable
where contains(text, 'FUZZY(Grisam)',1) > 0
order by score(1) desc
results in:
SCORE TEXT
----- -----------------------
4 Heather Graham
4 Gus Grissom
4 Douglas Gresham
4 John Grisham
The record most similar to our input term -- John Grisham -- cannot be
distinguished from all the other records, because their scores are the same.
If instead we do:
select score(1), text from mytable
where contains(text, 'FUZZY(Grisam,,,w)',1) > 0
order by score(1) desc
The results come back in decreasing similarity order:
SCORE TEXT
----- -----------------------
31 John Grisham
23 Gus Grissom
21 Douglas Gresham
11 Heather Graham
which allows us to easily pick out the best matches.
It is important to remember that fuzzy weighting is not simply a
normalized fuzzy score -- it is a weighting of the regular,
inverse-document frequency score. Because of this, it is possible for
less similar terms to outscore more similar terms, if the less similar
terms occur frequently in the document. For instance, if we add the
document "Graham Graham Graham" to the corpus, the weighted query
returns:
SCORE TEXT
----- -----------------------
37 Graham Graham Graham
35 John Grisham
26 Gus Grissom
24 Douglas Gresham
12 Heather Graham
Consequently, weighted fuzzy is most effective when applied to indexes on
small, short text fields (like names), where term repetition in a single
document is rarer than in longer documents.
$I Query in Cost Estimate
interMedia Text is integrated with the cost-based optimizer. Contains
queries are passed to interMedia Text for estimation of selectivity and cost,
if the text index has been analyzed. In 8.1.5 and 8.1.6, analyze index scans
all rows of the $I table and calculates the document counts for the most
unselective tokens. An average measure is used for all other tokens.
One problem with this approach is that the statistics are gathered
at analyze time, and not updated, so over time they become
increasingly out-of- date. Furthermore, because of limited statistic
storage space, accurate counts are stored for only a few hundred
tokens - everything else uses the average measure, which is
inaccurate. Finally, because it does so much work, analyze itself is
a time-consuming operation.
In 8.1.7, analyze has been changed to do very little - it now simply
records the fact that it has been run. At query time, $I is directly
queried to get the token document count. The statistics are more
accurate, always up-to-date, and analyze index takes seconds instead
of hours. Although the cost and selectivity estimation takes longer,
the cost estimation queries are only $X index scans, and are the same
that the query itself will run, so essentially they cache the blocks
that the query will read. Overall query time does not degrade.
Improved Performance on Large Datasets
We have made some internal changes to the code to optimize query-time disk
access. This should result in improved text query performance for large
datasets. The exact improvements implemented are too detailed for the scope
of this document.
Knowledge Base Improvements
User-Defined Knowledge Bases
Version 8.1.5 introduced knowledge base extension: you load a custom
thesaurus then use the knowledge base compiler (ctxkbtc) to add these new
terms to the provided knowledge base. However, you could only extend
the existing English knowledge base. Version 8.1.6 introduced a French
knowledge base, and a THEME_LANGUAGE attribute to basic lexer to select
the knowledge base to use for theme indexing, but user customization was
still limited to extension of the provided knowledge bases.
Version 8.1.7 allows you to load your own knowledge base, and create
a completely new knowledge base from scratch. This allows a company
with an extensive Spanish thesaurus, for instance, to generate a
Spanish knowledge base, and use that when generating theme indexes, in
ABOUT query, etc.
The steps are:
- Load a custom thesaurus using ctxload
- Set NLS_LANG so that the language portion is the target language The charset portion must be a single-byte character set.
- Compile the loaded thesaurus using ctxkbtc
ctxkbtc -user ctxsys/ctxsys -name
A new knowledge base will be compiled from the loaded thesaurus. To now use
in an index, specify the NLS_LANG language as the THEME_LANGUAGE attribute
value for the lexer.
Limitations:
- Oracle supplies knowledge bases in French and English only. You must provide your own thesaurus for any other language,
- This process only works for single-byte character sets. You cannot create a knowledge base for languages which can be expressed only in multi-byte character sets. If the database character set is multi-byte, such as UTF-8, NLS_LANG must be set to a compatible single-byte character set when compiling the thesaurus.
- It may not be possible to create effective knowledge bases for languages with extensive compounding, such as Dutch or German, or with many word forms, such as Hebrew or Arabic.
- You can have at most one knowledge base per NLS language.
During knowledge base compilation, you can also now specify a
stoplist:
ctxkbtc -user ctxsys/ctxsys -name -stoplist
The words of this stoplist are marked as low-information words that are
prevented from becoming or contributing to a theme.
Document Services Improvements
Tokens Document Service
TOKENS is a new document service which returns the words of a document as
seen by the indexing engine. The document is fetched, filtered, sectioned,
and lexed, and the tokens placed in a result table or PL/SQL table. This is
useful for applications which may want to implement classification, clustering,
or routing.
The result table form of the call is:
ctx_doc.tokens(, , , );
The result table should have the form
QUERY_ID NUMBER -- The query id
TOKEN VARCHAR2(64) -- the lexed token
OFFSET NUMBER -- the character offset of the token
LENGTH NUMBER -- the original length of the token before lexing
The PL/SQL in-memory form is:
ctx_doc.tokens(, , );
where is of the type ctx_doc.token_tab:
type token_rec is record (
token varchar2(64),
offset number,
length number
);
type token_tab is table of token_rec index by binary_integer;
The output will contain only text tokens - theme tokens, section information,
etc. are not output. Stopwords are not output, either. If composite is on,
compounds will be produced, so there may be several tokens at the same offset.
NUM_THEMES Limit
CTX_DOC.THEMES and CTX_DOC.GIST have both been extended with a new numeric
argument, NUM_THEMES. This argument limits the number of themes or POV
gists generated. For instance:
ctx_doc.themes('myidx','1','ttab',num_themes=>12)
will populate the "ttab" table with the 12 strongest themes, instead of the
up-to-50 themes normally generated.
Thesaurus Improvements
In-Memory Thesaurus Build
ctxload thesaurus loading has been re-written to construct the thesaurus
in-memory then array insert the phrase information. Loading a large thesaurus
is now much faster than in previous versions.
Create/Drop/Alter Translation
The ctx_thes API has been expanded with the following calls for maintaining
translation information in an existing thesaurus:
ctx_thes.create_translation(, , , )
ctx_thes.drop_translation(, , , )
ctx_thes.update_translation(, , ,
, )
Has_Relation
A relation testing procedure has been added for applications which may need
to test that a relation exists without actually doing the expansion:
ctx_thes.has_relation(,,)
rel can be a single thesaural relation, or a comma-separated list of relations,
except PT. The function returns TRUE if the phrase has any of the relations
in the list. For instance:
ctx_thes.has_relation('cat','BT,BTG')
returns TRUE if the phrase "cat" in the DEFAULT thesaurus has any broader
terms or broader generic terms.