Oracle interMedia Text 8.1.6 - Technical Overview
Oracle8i interMedia Text 8.1.6 - Technical Overview
Garrett Kaminaga, Product Development, Database and Application Server Technologies, Oracle HQ
This is a technical overview of the improvements found in interMedia Text,
version 8.1.6. This is intended for an audience familiar with version 8.1.5. If you are
new to interMedia Text, please start with Oracle8i interMedia Text 8.1.5 - Technical Overview. Please refer to the
documentation for detailed information.
Table of Contents
- Overview of 8.1.6 Improvements
- Upgrading 8.1.5 to 8.1.6
- Datastore Improvements
- Filter Improvements
- Section Group Improvements
- Lexer Improvements
- Wordlist Improvements
- Storage Improvements
- Index Creation and Maintenance
- Query Language Improvements
- Document Services Improvements
- Thesaurus Improvements
Overview of 8.1.6 Improvements
While 8.1.6 has features which improve index performance and ease-of-use, most of the
new features are targeted in two key areas: structured documents and multi-lingual
databases.
The improvements in structured document support include indexing and searching
attribute text, nested within for sophisticated queries, doctype-limited tag detection,
dynamic add section for existing indexes, and a new AUTO sectioner which requires no
section pre-definition.
Multi-lingual databases can now store documents of different languages in a single
column, thanks to the new MULTI lexer. Furthermore, improvements in INSO and charset
filters allow documents of different formats and character sets in a single column.
UNICODE support has been added to the Japanese and Korean lexers.
Other interesting features include limited parallel indexing, more flexible stem and
fuzzy operators, and in-memory document services.
Upgrading 8.1.5 to 8.1.6
Automated Upgrade
For an automated upgrade from 8.1.5 to 8.1.6, use Oracle's Database Migration Assistant
tool (ODMA) -- this will run all needed scripts for interMedia.
There's no easy way to verify that the upgrade completed successfully. However, you
can, after the upgrade, issue:
select * from ctx_version
as ctxsys. If this reports 8.1.6, the upgrade script has been run. It's probably wise
to check for any invalid ctxsys objects anyway.
Manual Upgrade
In the event that the automatic upgrade fails, the manual process is below. This
process assumes that you are starting with a valid 8.1.5 ctxsys data dictionary, and that
the system is not being used for index, DML, or query.
1. run ?/ctx/admin/upgrade/s0801060.sql as SYS
this grants new, needed database privileges to ctxsys
2. run ?/ctx/admin/upgrade/u0801060.sql as CTXSYS
this brings your ctxsys data dictionary to 8.1.6
3. run ?/ctx/admin/dr0pkh.sql as CTXSYS
this re-creates all public and private package headers
4. run ?/ctx/admin/dr0plb.sql as CTXSYS
this re-creates all public and private package bodies
5. run ?/ctx/admin/dr0type.plb as CTXSYS
this re-creates the index type body
6. check for any invalid ctxsys objects
Upgrading Existing Indexes
Nothing needs to be done to upgrade existing 8.1.5 context indexes. These will still
work without any problems in 8.1.6.
Datastore Improvements
USER_DATASTORE
The user datastore in 8.1.5 is limited to CLOB output for the generated documents. A
new attribute, OUTPUT_TYPE, allows you to change the output type to BLOB or VARCHAR2. For
instance, a preference like this:
begin
ctx_ddl.create_preference('mystore','user_datastore');
ctx_ddl.set_attribute('mystore','procedure','foo');
end;
mandates a foo procedure with the following signature:
procedure foo (in rowid, in out clob)
If instead we add:
ctx_ddl.set_attribute('mystore','output_type','varchar2');
then interMedia Text will expect the foo procedure to have the signature:
procedure foo (in rowid, in out varchar2)
If the synthesized document is expected to be less than 32000 bytes, then a varchar2
output function is probably easier to write, easier to test, and faster to execute than
the CLOB version.
You can also set the output_type to BLOB, for formatted documents or documents in a
different character set. Note, however, than it is still not possible to use or call an
external procedure in the user datastore.
If output_type is not specified, CLOB is assumed, for backward compatibility.
NESTED_DATASTORE
The new NESTED_DATASTORE object allows indexing the rows of a nested table as parts of
a single document -- it is very much like the detail datastore, but here the detail table
is a nested table.
For example, consider a table mytab, where the document text is stored in a nested
table, one line per row of the nested table:
create type nt_rec as object (
lno number, -- line number
ltxt varchar2(80) -- text of line
);
create type nt_tab as table of nt_rec;
create table mytab (
id number primary key,
dummy char(1),
doc nt_tab
)
nested table doc store as myntab;
If you are running the above in SQL*Plus, you'll need forward slashes after each
statemeny to execute it.
insert into mytab values (1, null, nt_tab());
insert into table(select doc from mytab where id = 1)
values (1, 'this is the first line');
insert into table(select doc from mytab where id = 1)
values (2, 'this is the second line');
For each row of mytab, interMedia Text needs to concatenate the rows of the
nested doc table into a single document. In 8.1.5, this has to be done using a user
datastore. 8.1.6 provides this as built-in functionality.
We start by creating a nested table datastore:
ctx_ddl.create_preference('mynt','nested_datastore');
We need to provide the name of the nested table column in the base table:
ctx_ddl.set_attribute('mynt','nested_column','doc');
We also need to provide the type of the nested table. This should include the type
owner schema. This cannot be determined automatically because the table name is not known
until create index time:
ctx_ddl.set_attribute('mynt','nested_type','scott.nt_tab');
Finally, we need to specify the names of the line number and text columns within the
nested table record type:
ctx_ddl.set_attribute('mynt','nested_lineno','lno');
ctx_ddl.set_attribute('mynt','nested_text', 'ltxt');
The extensible indexing framework does not allow creating an index on a nested table
type, so we need to create the index on a dummy column, just like the detail datastore:
create index myidx on mytab(dummy)
indextype is ctxsys.context
parameters ('datastore mynt');
For each record of the base table, interMedia Text will select the rows of the
nested table using sql somewhat like this:
select <nested_text>
from table(select <nested_column>
from <base_table>
where rowid = <current_row_rowid>)
order by nested_table_id, <nested_lineno>
and concatenate the rows into a single document. Thus, a search like:
select id from mytab where contains(dummy, 'first and second')>0
will return the one base table row, even though the hits are in different rows of the
nested table.
Like the detail datastore, the nested table datastore has a BINARY attribute to control
the concatenation of the nested table lines. If BINARY is TRUE, interMedia Text
will simply concatenate the text together. If BINARY is FALSE, interMedia Text will
insert a newline in between each nested table line.
Given the nested table sql, a unique index on the nested table, on the columns
(nested_table_id, <lineno>) should improve indexing performance.
Because the index is created on a dummy column, any DML on the nested table must be
manually propagated to the the dummy column -- interMedia Text cannot automatically
detect such changes and mark rows for reindex.
Filter Improvements
INSO_FILTER Character Set Support
In 8.1.5, the INSO filter outputs ISO8859-1 text, which is then converted to the
database character set. However, not all documents can be expressed in ISO8859-1, so the
INSO filter cannot be used in all situations.
8.1.6 uses a newer revision of the INSO libraries which features UNICODE output. Since
UNICODE is designed to be able to express any text, the INSO filter can now be used on any
database, as long as the document text can be expressed in the database character set.
For instance, in 8.1.5, a Japanese document in MS Word on a JA16EUC database cannot be
sent to the INSO filter, because the plain text output is forced to ISO8859-1, which is
incapable of expressing the document. In 8.1.6, the plain text is UNICODE, and everything
works fine. Sending the same document through on a US7ASCII database, however, will not
work since the Japanese text cannot be converted to the database character set.
Format Column
Another improvement in the INSO filter is the addition of optional bypasses for
heterogenous docsets. In 8.1.5, if you employ the INSO filter in your index, every
document in the base table is sent through the INSO filter. This works fine when all the
documents are formatted documents, but the docset may also have HTML and plain text
documents. Filtering these wastes time and sometimes produces unexpected results.
8.1.6 adds a mechanism to bypass INSO filtering for selected rows. This requires a
column in the base table -- called the format column -- which specifies that the data is
BINARY or TEXT. For instance, let's consider the table:
create table hdocs (
id number primary key,
fmt varchar2(10),
text varchar2(80)
);
We can insert a word doc, with the value 'BINARY' in the format column:
insert into hdocs values (1, 'binary', '/docs/myword.doc');
and an HTML document, with the value 'TEXT' in the format column:
insert into hdocs values (2, 'text', '/docs/index.html');
When we create the index, we need to specify the name of the format column, using the
new 'format column' clause of the parameters string:
create index hdocsx on hdocs(text)
indextype is ctxsys.context
parameters ('datastore file_datastore filter inso_filter
format column fmt');
Since we have specified a format column, for each row, the INSO filter will examine the
value in 'fmt'. If this value is 'BINARY', as it is for row 1, then it will send it
through the INSO filter. If the value is 'TEXT', it by-passes the INSO filter and sends
the text straight through.
The format column must be in the same table as the text column, and it must be CHAR,
VARCHAR, or VARCHAR2 type. The value is case-insensitive, so 'BINARY' and 'binary' are the
same. If the column value is null or not 'BINARY' nor 'TEXT', then the document is assumed
to be binary, and it is filtered by the INSO filters.
The format column does not have to be specified when using the INSO filter. This bypass
mechanism is optional. The format column can be specified when not using the INSO filter,
but it has no effect.
Charset Column
If the document set is even more heterogenous, the TEXT documents may not all be in the
same character set, and may not be in the database character set. The INSO filter can do
per-row character set translation for such documents, if the character set is specified in
another column, called the charset column. For instance, let's expand our example:
create table hdocs (
id number primary key,
fmt varchar2(10),
cset varchar2(20),
text varchar2(80)
);
We can insert a plain text Japanese document in EUC:
insert into hdocs values (1, 'text', 'JA16EUC', '/docs/tekusuto.euc');
and one in Shift-JIS:
insert into hdocs values (2, 'text', 'JA16SJIS', '/docs/tekusuto.sjs');
When we create the index, we need to specify the names of the format and charset
columns:
create index hdocsx on hdocs(text)
indextype is ctxsys.context
parameters ('datastore file_datastore filter inso_filter
format column fmt
charset column cset');
The INSO filter first looks at the format column. If the format column value is TEXT,
then it looks at the charset column for the character set of the document. The document
text is converted from this character set to the database character set.
The charset column must be in the same table as the text column, and it must be CHAR,
VARCHAR, or VARCHAR2 type. The value is case-insensitive, but should be an NLS character
set string. If the column value is null or is not recognized, then the document is passed
through without any conversion. Charset column does not have to be specified when using
the INSO filter.
If the charset column is specified and the format column is not specified, then all
documents are assumed to be text documents, and the INSO filter works like the CHARSET
filter, described below.
CHARSET_FILTER
The CHARSET filter can also take advantage of the charset column, which allows you to
index a table of text documents in different character sets. When using the charset
column, the CHARSET attribute becomes the default value -- when the charset column is null
or the value is not recognized, then the document is assumed to be in the character set
named in the attribute, and is converted from that to the database character set.
Like the INSO filter, you do not need to specify the charset column -- it works just
like the 8.1.5 version, and assumes all documents are in the CHARSET attribute character
set.
Section Group Improvements
BASIC_SECTION_GROUP
The tagged section groups (basic, html, xml) have all been re-written to share a common
parser. As a result, the BASIC section group can now parse tags with attributes, although
currently it is not able to index those attributes.
HTML_SECTION_GROUP
The HTML section group can now index META tags. META tags in HTML have two forms. The
first is used for embedding http header data, and is not interesting to information
retrieval. The second form, however, is used for encoding document meta-data in name/value
pairs:
<META NAME="author" CONTENT="Mr. Roboto">
8.1.5 is unable to make use of this because the information is in attribute values,
which were ignored. Furthermore, even if the product had attribute value searching, the
name part of the name/value pair is actually an attribute value rather than an attribute
name.
Instead, the HTML section group has been enhanced with special detection of the META
tag. Like any other section, the META pairs which should be indexed are declared using
add_section. However, the tag portion has a special syntax: META@<name>:
ctx_ddl.create_section_group('mysg','html_section_group');
ctx_ddl.add_field_section('mysg','author','meta@author');
With this tag syntax, the HTML section group will interpret
<META NAME="author" CONTENT="Mr. Roboto">
as:
<META@AUTHOR>Mr.Roboto</META@AUTHOR>
and index "Mr. Roboto" as the contents of the field section
"author". The query looks like any other section query:
Roboto within author
to find the document.
META name/value pairs which have not been added to the section group are ignored and
not indexed. META tags which do not have NAME or CONTENT attributes are also ignored.
XML_SECTION_GROUP Doctype Limiter
One problem with the XML section group in 8.1.5 is the inability to distinguish between
tags in different DTD's. For instance, perhaps you have a DTD for storing contact
information:
<!DOCTYPE contact>
<contact>
<address>506 Blue Pool Road</address>
<email>dudeman@oracle.com</email>
</contact>
Appropriate sections might look like:
ctx_ddl.add_field_section('mysg','email', 'email');
ctx_ddl.add_field_section('mysg','address','address');
Which works out fine... until you start getting a different kind of document in the
same table:
<!DOCTYPE mail>
<mail>
<address>dudeman@oracle.com</address>
</mail>
Now your address section, originally intended for street addresses, starts picking up
email addresses, because of tag collision.
8.1.6 allows you to specify doctype limiters to distinguish between these tags across
doctypes. You simply specify the doctype in parentheses before the tag:
ctx_ddl.add_field_section('mysg','email','email');
ctx_ddl.add_field_section('mysg','address','(contact)address');
ctx_ddl.add_field_section('mysg','email','(mail)address');
Now when the XML section group sees an address tag, it will index it as the address
section when the document type is contact, or as the email section when the document type
is mail.
If you have both doctype-limited and unlimited tags in a section group:
ctx_ddl.add_field_section('mysg','sec1','(type1)tag1');
ctx_ddl.add_field_section('mysg','sec2','tag1');
Then the limited tag applies when in the doctype, and the unlimited tag applies in all
other doctypes.
Querying is unaffected by this -- the query is done on the section name, not the tag,
so querying for an email address would be done like:
oracle within email
which, since we have mapped two different kinds of tags to the same section name, finds
documents independent of which tags are used to express the email address.
XML_SECTION_GROUP Attribute Sections
Another commonly-requested feature in the 8.1.6 XML section group is the ability to
index and search within attribute values. For instance, take a document with the
following:
<comment author="jeeves">
I really like interMedia Text
</comment>
in 8.1.5, the author information, like all attributes, is ignored and cannot be indexed
or searched. In 8.1.6, the XML section group offers a new type of section called an
attribute section, which nominates interesting attribute values to index:
ctx_ddl.add_attr_section('mysg','author','comment@author');
The syntax is very similar to the other add_section calls -- the first argument is the
name of the section group, the second is the name of the section, and the third is the
tag, in the form <tag_name>@<attribute_name>.
This tells interMedia Text to index the contents of the author attribute of the
comment tag as the section "author". Query syntax is just like any other
section:
jeeves within author
finds the document. Attribute text is considered to be invisible, though, so the
following:
jeeves
does NOT find the document, somewhat like field sections. Unlike field sections,
however, attribute section within searches can distinguish between occurrences. Given the
document:
<comment author="jeeves">
I really like interMedia Text
</comment>
<comment author="bertram">
Me too
</comment>
the query:
(cryil and bertram) within author
will not find the document, because "jeeves" and "bertram" do not
occur within the SAME attribute text.
Attribute section names cannot overlap with zone or field section names, although you
can map more than one tag@attr to a single section name.
Attribute sections do not support default values. Given the document:
<!DOCTYPE foo [
<!ELEMENT foo (bar)>
<!ELEMENT bar (#PCDATA)>
<!ATTLIST bar
rev CDATA "8i">
]>
<foo>
<bar>whatever</bar>
</foo>
and attribute section:
ctx_ddl.add_attr_section('mysg','barrev','bar@rev');
the query:
8i within barrev
does not hit the document, although in XML semantics, the "bar" element has a
default value for its "rev" attribute.
Attribute Value Sensitive Section Search
Attribute sections allow you to search the contents of attributes. They do not allow
you to use attribute values to specify sections to search. For instance, given the
document:
<comment author="jeeves">
I really like interMedia Text
</comment>
You can find this document by asking:
jeeves within comment@author
which is equivalent to "find me all documents which have a comment element whose
author attribute's value includes the word jeeves".
However, there is no way to ask for something like:
interMedia within comment where (@author = "jeeves")
in other words, "find me all documents where interMedia appears in a comment
element whose author is jeeves". This feature -- attribute value sensitive section
searching -- is planned for future versions of the product.
Dynamic Add Section
Because the section group is defined before creating the index, 8.1.5 is limited in its
ability to cope with changing structured document sets; if your documents start coming
with new tags, or you start getting new doctypes, you have to re-create the index to start
making use of those tags.
8.1.6 allows you to add new sections to an existing index without rebuilding the index,
using alter index and the new add section parameters string syntax:
add zone section <section_name> tag <tag>
add field section <section_name> tag <tag> [ visible | invisible ]
For instance, to add a new zone section named tsec using the tag title:
alter index <indexname> rebuild
parameters ('add zone section tsec tag title')
To add a new field section named asec using the tag author:
alter index <indexname> rebuild
parameters ('add field section asec tag author')
This field section would be invisible by default, just like when using
add_field_section. To add it as visible field section:
alter index <indexname> rebuild
parameters ('add field section asec tag author visible')
Dynamic add section only modifies the index meta-data, and does not rebuild the index
in any way. This means that these sections take effect for any document indexed after the
operation, and do not affect any existing documents -- if the index already has documents
with these sections, they must be manually marked for re-indexing (usually with an update
of the indexed column to itself).
This operation does not support addition of special sections. Those would require all
documents to be re-indexed, anyway. This operation cannot be done using rebuild online,
but it should be a fairly quick operation.
AUTO_SECTION_GROUP
The ultimate in coping with changing structured document sets is the new AUTO section
group. The AUTO section group is like the XML section group, but requires no
pre-definition of sections. Instead, any non-empty tag is automatically indexed as a zone
section with the section name the same as the tag. For instance, given a document like
this:
<book>
<author>Neal Stephenson</author>
<title>The Diamond Age</title>
<description>
Decades into our future ...
</description>
</book>
The auto sectioner will create zone sections named book, author, title, and
description. Queries like this:
diamond within description
will find the document. Because sections are created dynamically from the tags, the
auto sectioner is ideal when the structure of the documents is not known to the index
creator, or when they are expected to change so that pre-definition with dynamic add is
impractical.
The only quirk to watch out for is that section names are case-insensitive, but tags in
XML are case-sensitive. For instance, given the above document, the query
diamond within description
finds the document, as does
diamond within DESCRIPTION
Both queries would also find a document:
<DESCRIPTION>Diamond Earring</DESCRIPTION>
even though "description" and "DESCRIPTION" are different tags in
XML.
You cannot add zone, field, or attribute sections to an AUTO_SECTION_GROUP in this
version. Furthermore, because the AUTO_SECTION_GROUP may encounter documents using the
tags PARAGRAPH or SENTENCE, it also does not support special sections.
AUTO_SECTION_GROUP Stop Sections
Even when using the AUTO section group, there may be some tags which are so common that
indexing them would be a waste of time, since they are not useful for searching. For such
tags, the AUTO section group has stop sections, which indicate tags to ignore -- much like
stop words indicate words to ignore.
Stop sections use the new add_stop_section call:
ctx_ddl.add_stop_section(<group name>, <tag>)
which indicates to the named AUTO section group that <tag> should be ignored and
not indexed as a section. add_stop_section can be used only on AUTO section groups. The
number of stop sections per AUTO group is unlimited.
The tag is case-sensitive, so:
ctx_ddl.add_stop_section('mysg','description')
is different from
ctx_ddl.add_stop_section('mysg','DESCRIPTION')
and these stop different things. If you have only:
ctx_ddl.add_stop_section('mysg','description')
there may still be a section named "DESCRIPTION", if another document uses
"Description" or "DESCRIPTION" as the tag. For this reason, a query
that does "within <stop section>" does not error.
The tag also supports doctype limiters. So, for instance, if you wanted to stop tag
"foo", but only in document type "bar":
ctx_ddl.add_stop_section('mysg','(bar)foo')
Stop sections can be added to an existing index dynamically, using the parameters
string syntax:
add stop section <tag>
as in:
alter index <indexname> rebuild
parameters ('add stop section author')
Unlike stop words, adding a stop section dynamically does not apply to documents
already indexed. This is partially to be consistent with the other dynamically add section
calls, but mainly because there could be other tags using that section name due to
case-insensitivity of section names.
AUTO_SECTION_GROUP Attribute Sections
The AUTO section group also automatically indexes attribute values as attribute
sections named <tag>@<attr>. For instance, given:
<book author="Edmund Wells">
<title>Rarnaby Budge</title>
</book>
You can find this document with the query:
wells within book@author
since the author attribute is automatically indexed as an attribute section.
Nested Within
8.1.6 also adds nested within queries -- also known as hierarchical section searching.
For instance, given a document with:
<book>
<title>Star Wars</title>
</book>
<video>
<title>Star Wars<title>
<video>
You can distinguish between the two types of titles using a nested within:
(Star Wars within title) within video
for instance, which is not allowed in 8.1.5. 8.1.6 provides this feature with two
important caveats. First, this only works with zone sections. Attribute or field sections
-- even visible field sections -- cannot be used in a nested within.
Second, because of the way interMedia Text indexes zone sections, equal sections
cannot be distinguished. For instance, the document:
<A><B>foo</B></A>
can be found with either:
(foo within A) within B
or
(foo within B) within A
This applies only when the sections bound exactly the same range. Any intervening text
allows interMedia Text to distinguish between the sections -- for instance:
<A><B>foo</B>bar</A>
Another aspect of this is that nested within does not imply an exact parent-child
relationship. In XQL, a query string with:
A/B
means that the B section is a first-level child of the A section. This not equivalent
to nested within:
(x within A) within B
does not mean that the A section is a child of the B section -- it could be a
grandchild or great-grandchild. Nested within is more like the XQL:
A/*/B
Lexer Improvements
BASIC_LEXER Generic Base Lettering
The BASE_LETTER function of the BASIC lexer is designed to translate any accented
characters to their unaccented representation for indexing and query. In 8.1.5, this
feature uses an NLS function which is sensitive to the language part of the NLS_LANG
setting. This means, for instance, that BASE_LETTER does not work for an AMERICAN
database, because the NLS function does not work in this language. For 8.1.6, based on our
input, NLS has provided a generic base lettering function which works in all languages.
This means that the BASE_LETTER function now works no matter what language your database
uses, which is ideal for multi-lingual and foreign-language databases.
This new function requires the NLS file lx30044.nlb to be installed, so if base
lettering is not working in 8.1.6, check for this file.
BASIC_LEXER THEME_LANGUAGE
The French knowledge base and lexicon is now public in 8.1.6. This allows indexing and
generation of themes for French documents. Normally, the BASIC lexer will choose which
knowledge base to use (French or English) based on the NLS_LANG setting for the session.
If, however, your NLS_LANG setting does not match your documents -- your documents are in
French and your env is in English -- then this will choose the wrong lexicon, or give an
error (if your environment is neither English nor French).
For such situations, the BASIC_LEXER has a new THEME_LANGUAGE setting to override
environment detection. This attribute can be set to 'ENGLISH' or 'FRENCH' to choose a
lexicon explicitly:
ctx_ddl.set_attribute('mylexer','theme_language','french');
This lexer now produces French themes independent of the environment. This attribute
can also be set to 'AUTO', or left unset. In such cases, the lexicon is chosen based on
the environment.
BASIC_LEXER Theme Admissibility
8.1.5 theme indexing produces a maximum of 50 themes per document. This means that long
documents may be difficult to find with a theme query, since lesser themes tend not to
make it into the top 50.
8.1.6 solves this by removing the 50-theme limit for theme indexing -- all document
themes are produced and indexed. Theme generation, via document services, however, is not
changed, and is still subject to the 50-theme maximum.
JAPANESE/KOREAN_LEXER UTF-8 Support
In 8.1.5, the JAPANESE_VGRAM_LEXER works only if your database character set is JA16EUC
or JA16SJIS. The KOREAN_LEXER works only in KSC5601. These lexers are enhanced in 8.1.6 to
also work in UTF-8. This allows multi-lingual databases to finally use the Japanese and
Korean lexers.
MULTI_LEXER
The biggest 8.1.6 addition in the lexer area is the new MULTI lexer, which allows
documents of different languages to be stored in a single table. The MULTI lexer chooses a
language-specific lexer for each row, based on a language column.
The first step with the MULTI lexer is the creation of language-specific lexer
preferences for each language in the multi-lingual table. As an example, let's create
English, French, and German lexers:
ctx_ddl.create_preference('english_lexer', 'basic_lexer');
ctx_ddl.set_attribute('english_lexer','index_themes','yes');
ctx_ddl.create_preference('french_lexer', 'basic_lexer');
ctx_ddl.set_attribute('french_lexer','base_letter','yes');
ctx_ddl.create_preference('german_lexer', 'basic_lexer');
ctx_ddl.set_attribute('german_lexer','composite','german');
ctx_ddl.set_attribute('german_lexer','alternate_spelling','german');
ctx_ddl.set_attribute('german_lexer','mixed_case','yes');
Once the language-specific lexer preferences are created, they need to be gathered
together under a single MULTI lexer preference. First, we need to create the MULTI lexer
preference, using the MULTI_LEXER object:
ctx_ddl.create_preference('global_lexer','multi_lexer');
Now we must add the language-specific lexers to the multi lexer preference using the
new add_sub_lexer call:
ctx_ddl.add_sub_lexer(<multi_lexer>, <language>, <lang-specific lexer>)
For our example:
ctx_ddl.add_sub_lexer('global_lexer', 'german', 'german_lexer');
ctx_ddl.add_sub_lexer('global_lexer', 'french', 'french_lexer');
ctx_ddl.add_sub_lexer('global_lexer', 'default','english_lexer');
This nominates the german_lexer preference to handle german documents, the french_lexer
preference to handle french documents, and the english_lexer preference to handle
everything else, using 'DEFAULT' as the language.
The MULTI lexer decides which lexer to use for each row based on a language column.
This is a character column in the table which stores the language of the document in the
text column. For instance, our table looks like:
create table globaldoc (
doc_id number primary key,
lang varchar2(10),
text clob
);
insert into globaldoc values (1, 'german', 'Ich bin Berliner');
insert into globaldoc values (2, 'french', 'Je suis Parisian');
insert into globaldoc values (3, 'english', 'I''m from New Yawk');
To index this table, we need to use the MULTI lexer preference, but we also need to
specify the name of the language column:
create index globalx on globaldoc(text)
indextype is ctxsys.context
parameters ('lexer global_lexer
language column lang');
The MULTI lexer then reads the value of the language column for each row, and uses the
associated language-specific lexer for that language.
The language column is case-independent, and can contain either the NLS name or
abbreviation of the language. If the table uses some other value for the language, you
need to specify that alternate value as the fourth argument when adding the sub lexers:
ctx_ddl.add_sub_lexer('global_lexer', 'german', 'german_lexer', 'ger');
for instance, to use the ISO 639-2 string for german instead. Note that these alternate
values are in addition to the NLS names and abbreviations, so "D" and
"GERMAN" in the language column still select the german lexer. This means that
alternate values cannot be NLS language names or abbreviations.
The DEFAULT lexer is used when the language of the document is unknown or is not mapped
to a language-specific lexer. The DEFAULT lexer is also used to lex stopwords. A DEFAULT
lexer MUST be specified, or you will get an error when using the MULTI lexer preference in
create index. You cannot specify an alternate value for the DEFAULT sub lexer -- this does
not make sense.
At query time, the MULTI lexer chooses a language-specific lexer to lex the query
tokens. This is based on the NLS_LANG setting for the query session. Thus, a query session
in the FRENCH language will use the lexer for FRENCH. A more explicit method of specifying
the query language may appear in future versions of interMedia Text.
Wordlist Improvements
AUTO Setting for STEMMER and FUZZY_MATCH
The 8.1.5 stemmer and fuzzy_match attributes are set for a single, specific language at
create index time, making it unable to handle multi-lingual tables. 8.1.6 provides the new
AUTO setting for STEMMER and FUZZY_MATCH:
ctx_ddl.create_preference('mywl','basic_wordlist');
ctx_ddl.set_attribute('mywl', 'STEMMER', 'AUTO');
ctx_ddl.set_attribute('mywl', 'FUZZY_MATCH', 'AUTO');
During queries, when set to AUTO, the stemmer or fuzzy match expander will examine the
language setting of the session and choose the most appropriate setting for that language.
A user with NLS_LANG French, for instance, will get French stemming and fuzzy matching.
This mapping is not configurable in this release -- you cannot specify a particular
stemmer or fuzzy match cluster to use for each language.
This setting is designed for use with the MULTI lexer, which switches lexers based on
the query session's language setting. Together, they allow multi- lingual tables to be
indexed and queried.
WILDCARD_MAXTERMS
All versions of interMedia Text and ConText offer wildcard searching -- a
contains for any word matching a pattern. This is done by expanding the pattern against
the main index table using LIKE, and searching on all the resulting words.
ConText limited the size of the expansion to around 2500 words. interMedia Text
8.1.5 has no limit, but when the pattern is unselective, it uses up memory quickly.
8.1.6 re-establishes the expansion limit, but with a configurable maximum expansion --
the new WILDCARD_MAXTERMS attribute.:
ctx_ddl.create_preference('mywl','basic_wordlist');
ctx_ddl.set_attribute('mywl', 'WILDCARD_MAXTERMS', '2500');
Wildcard patterns matching more words than this will receive a "query too
complex" error. The default value, if unset, is 5000 words.
SUBSTRING_INDEX
Patterns which start with a wildcard pose another problem for wildcard searching.
Because wildcard searches are processed by expanding the pattern using a LIKE, they force
a full-table scan of the main index table. This type of query, sometimes called a
left-truncated search, takes a long time.
For those applications which require faster left-truncated searching, 8.1.6 offers the
new SUBSTRING_INDEX option. This is set like any other boolean attribute:
ctx_ddl.create_preference('mywl','basic_wordlist');
ctx_ddl.set_attribute('mywl', 'SUBSTRING_INDEX', 'TRUE');
When set, the index will include a new pattern table ($P) which stores rotations of all
unique words in the main index table. Using the pattern table instead of the main index
table allows interMedia Text to process left- truncated searches much much faster.
The query-time savings come at the cost of increased index time and space. Indexing
time can take as much as four times as long, and use up to 25% more space, so this option
should be used only when left-truncated wildcard searches are frequent and need to be
fast.
Storage Improvements
P_TABLE_CLAUSE for BASIC_STORAGE
Since we have a new index table when SUBSTRING_INDEX is set, the BASIC storage object
gets a new attribute -- P_TABLE_CLAUSE, which is tacked on the end of the create table
statement for the P table:
create table DR$<indexname>$P (
pat_part1 varchar2(61),
pat_part2 varchar2(64),
primary key (pat_part1, pat_part2)
) organization index
<p_table_clause>
We estimate that the P table should have approximately the same number of rows as the
$I table.
Index Creation and Maintenance
Parallel Indexing
ConText V2.x supported parallel index creation, using multiple ctxsrv processes. This
was removed in 8.1.5 because tighter integration with the kernel made it unfeasible. While
we continue to work for full support in future versions, 8.1.6 adds a temporary workaround
which allows parallel index creation with two caveats.
First, in order to use 8.1.6 parallel indexing, the base table must be a partitioned
table -- each parallel indexer will work on its own partition, so the parallel degree
should not exceed the number of partitions.
Second, because of the limitations of the kernel-interMedia Text interface, we
must use dbms_job to spawn the parallel indexers. This means that the database must be
running with job_queue_processes > 0, and that the parallel degree cannot exceed this
number.
Otherwise, the syntax is just as you would expect:
create index textidx on doctab(text)
indextype is ctxsys.context
parallel 3
This will create the index with parallel degreee 3. Note that because we use dbms_job
to spawn parallel indexers, so there is some lag before all indexers are running,
proportional to job_queue_interval.
If logging is active, multiple log files will be created by this -- one for each
parallel indexer. Those log files will be named <logfile>_<process#>. If, for
instance, you did:
ctx_output.start_log('mylog')
you could expect files "mylog", "mylog_1", "mylog_2",
etc.
PL/SQL Sync and Optimize
PL/SQL functions for sync and optimize have been added. The preferred method for sync
and optimize is alter index rebuild online. Ideally, these could be called in a dbms_job.
However, separate sync and optimize jobs can conflict because two alter index operations
cannot be run at the same time for a single index.
The new PL/SQL functions resolve both problems by proving an alternate PL/SQL API for
sync and optimize:
ctx_ddl.sync_index(<idx_name>)
idx_name is the name of the index to sync. The effective user must be the owner of the
index or ctxsys. If effective user is ctxsys, idx_name can be NULL, which will cause this
call to examine all context indexes for pending DML. This uses the same code as ctxsrv. We
now recommend processing pending DML through a dbms_job -- an individual job for each
index, a ctxsys job for all indexes, or some combination of the two. This eliminates the
need for ctxsrv, and thus eliminates the problem of starting ctxsrv each time the database
starts. The ctxsrv program itself is now deprecated, and may be removed in some future
release.
ctx_ddl.optimize_index(<idx_name>, <optlevel>, <maxtime>)
idx_name is the name of the index to optimize. optlevel is 'FAST' or 'FULL' for the
optimization level. There are also symbols CTX_DDL.OPTLEVEL_FAST and CTX_DDL.OPTLEVEL_FULL
for these values. maxtime is the maximum time, in minutes, to spend on optimization. This
applies only to FULL optimization. By default, maxtime is unlimited. You can also pass in
NULL or the symbol CTX_DDL.MAXTIME_UNLIMITED.
Query Language Improvements
New Accumulate Scoring
8.1.6 has a new algorithm for processing accumulate queries. The changes mostly resolve
scoring inconsistencies the 8.1.5 algorithm has in certain cases. The new algorithm also
handles weighted children better than its predecessor.
Nested Within
8.1.6 can process nested withins -- also known as hierarchical section searches --
which allows queries like this:
(dog within title) within book
Please refer above for details and caveats.
BROWSE_WORDS
Many IR systems include a feature called vocabulary browse. This takes an input word
and displays words in the text index which are alphabetically close. This feature is
useful for users to correct misspelled words in the query. By including a rough document
count for each word, it also allows the user to eliminate unselective words.
Administrators may also use it to correct misspellings in document text.
8.1.6 provides a new function for this -- browse_words:
ctx_query.browse_words(<index_name>, <seed>, <restab>,
<browse_id>, <numwords>, <direction>)
index_name is the name of the index to browse.
seed is the start word. This must be a single word, and not a phrase.
restab is the name of a result table which should have the schema:
browse_id number
word varchar2(64)
doc_count number
The results of the browse are placed in this table. Existing contents are not deleted
before browse.
browse_id is a numeric identifier for the operation. Output rows in the result table
will have this value in the browse_id column. If not specified this defaults to 0.
numwords is the length of the produced list, in words. This must be between 1 and 1000.
If not specified, this defaults to 10 words.
direction specified the direction of the browse. 'BEFORE' means the seed word and words
alphabetically before the seed, 'AFTER' means the seed and words alphabetically after the
seed, and 'AROUND' means the seed and words both before and after. There are also symbols
CTX_QUERY.BROWSE_BEFORE, CTX_QUERY.BROWSE_AROUND, and CTX_QUERY.BROWSE_AFTER for these
values. This defaults to 'AROUND'
An example might look like:
exec ctx_query.browse_words('myidx','dog','myres', 5);
select word, doc_count from myres order by word;
WORD DOC_COUNT
-------- ---------
CZAR 15
DARLING 5
DOG 73
DUNK 100
EAR 3
Note that the DOC_COUNT is an upper bound which does not take into account old
information from deleted or updated documents. FULL optimization removes this old
information from the index, and makes these counts exact.
Browse also has a variant which delivers the result in a PL/SQL table instead of a
result table:
ctx_query.browse_words(<index_name>, <seed>, <resarr>,
<numwords>, <direction>)
Here, resarr is a PL/SQL table of type ctx_query.browse_tab:
type browse_rec is record (
word varchar2(64),
doc_count number
);
type browse_tab is table of browse_rec index by binary_integer;
The results are placed in the table, ordered alphabetically, with the alphabetically
first word at index 1. Existing elements of the table are deleted upon entry into the
function.
Document Services Improvements
Rowid Input
All document services work on single documents. To specify the document, these
interfaces take in a primary key specification. This may not be convenient when the table
uses a composite primary key or you have only the rowid. 8.1.6 allows you to specify a
rowid instead.
The document services interfaces have not been changed for this feature. The textkey
argument can now take either a rowid or a primary key spec. The key type of the document
services determines how the value is interpreted.
You set the key type using the new set_key_type call:
ctx_doc.set_key_type('ROWID');
After which, all document services will expect rowid values. This setting has session
duration and scope. To set it back:
ctx_doc.set_key_type('PRIMARY_KEY');
there are also symbols (CTX_DOC.TYPE_ROWID and CTX_DOC.TYPE_PRIMARY_KEY) defined for
these values.
The default is primary key mode, but you can change the default using the new system
parameter CTX_DOC_KEY_TYPE. For instance, if you want your system to default to ROWID
input:
ctx_adm.set_parameter('CTX_DOC_KEY_TYPE', CTX_DOC.TYPE_ROWID);
This has system scope and permanent duration. Note that this only sets the default --
sessions can always override this via set_key_type.
In-Memory Document Services
In 8.1.5, the document services use result tables to store their results. If you don't
need to store the result of the operation, result table I/O is wasteful, and application
code must be written to manage these temporary rows. When you do need to store the
results, the schema of the result tables may not be sufficient for your needs -- you may
need to store a varchar2 primary key, for instance.
8.1.6 introduces in-memory variants of all document services which avoid result tables
and instead output to PL/SQL variables. This avoids unneeded I/O and allows the
application to more easily process the transient results. We'll look at these new
functions individually, starting with:
ctx_doc.filter(<index_name>, <textkey>, <result>, <plaintext>)
The result argument is a CLOB locator. The document will be filtered and the filtered
text placed into this CLOB. If the locator passed in is NULL, a temporary CLOB is
allocated for you. It is up to the app to deallocate it.
ctx_doc.markup(<index_name>, <textkey>, <text_query>, <result>,
<plaintext>, <tagset>, <starttag>, <endtag>,
<prevtag>, <nexttag>)
The result argument is a CLOB locator, just like in the filter call.
ctx_doc.highlight(<index_name>, <textkey>, <text_query>,
<restab>, <plaintext>)
The highlight call can return multiple highlights, so a single PL/SQL variable isn't
sufficient. Instead, the restab argument is a PL/SQL table of type ctx_doc.highlight_tab:
type highlight_rec is record (
offset number;
length number;
);
type highlight_tab is table of highlight_rec index by binary_integer;
The first highlight will be placed at index 1. Existing contents of the table when
passed in are deleted.
ctx_doc.themes(<index_name>, <textkey>, <restab>, <full_themes>)
Like highlight, ctx_doc.themes can return multiple themes. restab is a PL/SQL table of
type ctx_doc.theme_tab:
type theme_rec is record (
theme varchar2(2000);
weight number;
);
type theme_tab is table of theme_rec index by binary_integer;
The first theme is placed at index 1. Existing contents of the table when passed in are
deleted. The themes are not guaranteed to be placed in the theme_tab in any particular
order.
ctx_doc.gist(<index_name>, <textkey>, <result>, <glevel>, <pov>,
<numparagraphs>, <maxpercent>)
The in-memory version of gist is the only call which deviates from its result-table
version. Unlike the result-table version, which returns multiple gists -- one per
point-of-view -- the in-memory version returns a single gist, selected by the pov
argument. If no pov is specified, the GENERIC gist is generated. result is a CLOB locator
as in ctx_doc.filter or ctx_doc.markup.
Thesaurus Improvements
Thesaurus Maintenance
The thesaurus functionality in 8.1.5 supports file import and export, but once a
thesaurus is in the database, it's difficult to change -- the API allows only the addition
of new phrases and relations.
8.1.6 rounds out the API with new maintenance functions which allow changing, renaming,
and removal of phrases and relations. There are also calls for adding new phrases and
relations which are hopefully less confusing than the existing functions.
Let's look at the new additions call-by-call.
ALTER_THESAURUS
ctx_thes.alter_thesaurus(<tname>, <op>, <operand>)
This call has two variations. To rename a thesaurus, op is set to 'RENAME' and operand
is set to the new name:
ctx_thes.alter_thesaurus('oldthesname','RENAME','newthesname');
To truncate a thesaurus -- that is, remove all phrases and relations -- op is set to
'TRUNCATE' and the operand is not needed:
ctx_thes.alter_thesaurus('mythes','TRUNCATE');
'RENAME' and 'TRUNCATE' also have symbols defined -- CTX_THES.OP_RENAME and
CTX_THES.OP_TRUNCATE.
ALTER_PHRASE
ctx_thes.alter_phrase(<tname>, <phrase>, <op>, <operand>)
This call has three variations. To change a phrase, set op to 'RENAME':
ctx_thes.alter_phrase('mythes','old phrase','RENAME','new phrase');
This leaves all the relations intact -- it just changes the phrase. You can also use
this to add, change, or remove qualifiers:
ctx_thes.alter_phrase('mythes','crane','RENAME','crane(bird)');
ctx_thes.alter_phrase('mythes','crane(bird)','RENAME','crane(stork)');
ctx_thes.alter_phrase('mythes','crane(bird)','RENAME','crane');
The second variation makes a phrase the preferred term in its synonym ring:
ctx_thes.alter_phrase('mythes','laser','PT');
If the phrase's synonym ring already has a preferred term, this operation overrides it.
The old preferred term becomes a non-preferred synonym and the named phrase becomes the
preferred term.
The third variant changes the scope note on a phrase:
ctx_thes.alter_phrase('mythes','ships','SN', 'sea-going vessels');
The operand overwrites whatever scope note was on the phrase, so this can be used to
add, change, or remove a scope note.
The operations all have symbols: CTX_THES.OP_RENAME, CTX_THES.OP_PT, and
CTX_THES.OP_SN.
DROP_PHRASE
ctx_thes.drop_phrase(<tname>, <phrase>)
This call removes a phrase from a thesaurus. BT*/NT* relations are patched around the
dropped phrase. A relation is established between each narrower term and each broader
term. For instance, a simple example:
A
BT B
B
BT C
and you do:
ctx_thes.drop_phrase('mythes','B');
then the resulting thesaurus looks like:
A
BT C
For multiple broader and narrower terms:
A B
\ /
C
/ \
D E
when C is removed, the thesaurus looks like this:
A B
|\/|
|/\|
D E
This patching is done for all hierarchical relations -- BT, BTI, BTP, BTG, etc.
Associative (RT) relations are not patched. If A RT B and B RT C, and B is removed, then
no A RT C relation is created. Synonymity is also not affected. If the removed phrase is
the preferred term, no phrase in the ring is elevated to preferred term.
CREATE_RELATION
ctx_thes.create_relation(<tname>, <phrase>, <rel>, <relphrase>);
This call creates a relation between phrase and relphrase. You can do this in 8.1.5
using create_phrase, but the usage is confusing. For instance, if you want to create the
thesaurus:
dog
BT animal
It looks like:
ctx_thes.create_phrase('mythes', 'dog');
ctx_thes.create_phrase('mythes', 'animal', 'BT', 'dog');
Two problems here. The first problem is that you need two calls to create a thesaurus
with one relation. The second problem is that in the second call, DOG and ANIMAL are in
opposite order of how they look in the thesaurus file. This is confusing and unintuitive.
create_relation solves both by automatically creating unknown phrases or relphrases,
and adhering closely to thesaurus file order. The above thesaurus is created using
create_relation like this:
ctx_thes.create_relation('mythes', 'dog', 'BT', 'animal');
create_relation also allows you to add translations to phrases, using <lang>: as
the relation:
ctx_thes.create_relation('mythes', 'dog', 'french:', 'chien');
Again, very similar to how it would look in the thesaurus file.
create_relation is the preferred syntax now, and the rel and relphrase arguments of
create_phrase are now deprecated and should not be used. Users are discouraged from using
these arguments.
DROP_RELATION
ctx_thes.drop_relation(<tname>, <phrase>, <rel>, <relphrase>);
This is the opposite of create_relation, and removes a relation between two phrases,
leaving the phrases and their other relations untouched. For instance, the sequence:
ctx_thes.create_relation('mythes', 'A', 'BT', 'B');
ctx_thes.drop_relation('mythes', 'A', 'BT', 'B');
results in mythes having the phrases A and B, but no relations between them.
You can remove BT*/NT* and RT relations. You can also use SYN to remove the phrase from
its synonym ring:
ctx_thes.create_relation('mythes', 'A', 'SYN', 'B');
ctx_thes.create_relation('mythes', 'A', 'SYN', 'C');
--> A, B, and C are synonyms
ctx_thes.drop_relation('mythes', 'A', 'SYN');
--> B and C are synonyms, but A is not in the ring
or PT to remove preferred term status, but leave the phrase in the ring:
ctx_thes.create_relation('mythes', 'B', 'SYN', 'C');
ctx_thes.create_relation('mythes', 'B', 'USE', 'A');
--> A, B, and C are synonyms, and A is the preferred term
ctx_thes.drop_relation('mythes', 'A', 'PT');
--> A, B, and C are synonyms, but the ring no longer has any preferred term
Finally, you can use <lang>: to remove translations:
ctx_thes.drop_relation('mythes', 'dog', 'french:', 'chien');
relphrase is actually optional. If omitted, all relations of that type are removed from
the phrase. For instance:
ctx_thes.drop_relation('mythes','A', 'BT');
drops all broader term relations of A.
ctx_thes.drop_relation('mythes','A', 'french:');
drops all french translations of A.
Maintenance Security
One last security note for all the above functions -- only the owner of the thesaurus
and ctxsys is allowed to modify the thesaurus, although all users can use a thesaurus
during query.
PL/SQL Table Expansion
The thesaurus is really meant to be used during a contains query, with the thesaurus
operators. However, 8.1.5 does include a simple PL/SQL expansion API. The functions work
just like their query counterparts, and return the expansion as a single string. For
instance, given the thesaurus:
animal
NT dog
NT cat
NT bird
and the PL/SQL:
declare
exp varchar2(80);
begin
exp := ctx_thes.nt('animal',1,'mythes');
dbms_output.put_line(exp);
end;
would print out something like:
{animal}|{dog}|{cat}|{bird}
which allows you to see the expansion. However, the single string expansion must be
parsed if you want the individual terms, for use in a web interface or your own thesaurus
dumping mechanism.
8.1.6 adds PL/SQL table expansion for such applications. Expansions are placed into a
PL/SQL table like this:
type exp_rec is record (
xrel varchar2(12),
xlevel number,
xphrase varchar2(256)
);
type exp_tab is table of exp_rec index by binary_integer;
xrel holds the relation of the input phrase to the output phrase (BT, NT, SYN, etc).
The input phrase is also in the table, with xrel of PHRASE. For SYN and TRSYN expansions,
the preferred term has xrel of PT.
xlevel tells you the distance from the input phrase to the output phrase. This is used
mainly for hierarchical (BT*/NT*) expansions. For instance, in a thesaurus like:
A
NT B
B
NT C
C
NT D
and an NT expansion on A for 3 levels, the resulting table looks like:
xrel xlevel xphrase
------ ------ -------
PHRASE 0 A
NT 1 B
NT 2 C
NT 3 D
For non-hierarchical expansions, xlevel is 1 for all expanded terms, except the phrase
itself, which is always 0, and translations of synonyms in TRSYN, which always have xlevel
of 2.
xphrase is the output phrase itself. This will include the qualifier in parentheses, if
one exists for the phrase.
The table API is like the existing string API, but instead of returning the expansion,
it takes a table as the first argument. Our example above, using the new API, looks like:
declare
exp ctx_thes.exp_tab;
begin
ctx_thes.nt(exp, 'animal', 1, 'mythes');
for i in 1..exp.count loop
dbms_output.put_line(lpad(' ',2*exp(i).xlevel)||
exp(i).xrel||' '||exp(i).xphrase);
end loop;
end;
and prints out:
PHRASE animal
NT dog
NT cat
NT bird
To complete the API, 8.1.6 adds two new expansion functions:
sn(<phrase>, [<tname>]);
returns the scope note for the given phrase, and
thes_tt(<restab>, [<tname>])
returns thesaurus top terms into a PL/SQL table. This is different from TT. TT returns
the top term for a given phrase. THES_TT searches the entire thesaurus and returns all top
terms. A top term for THES_TT is defined as a term which has narrower terms but no broader
terms. This is designed to be used in browsing applications, which may wish to start with
the thesaurus top terms. Note that because THES_TT searches the entire thesaurus, it may
take some time to complete for large thesauri. If the application needs the thesaurus top
terms frequently, we strongly suggest caching the results.
|