Oracle interMedia Text 8.1.6 - Technical Overview

Oracle8i interMedia Text 8.1.6 - Technical Overview

Garrett Kaminaga, Product Development, Database and Application Server Technologies, Oracle HQ

This is a technical overview of the improvements found in interMedia Text, version 8.1.6. This is intended for an audience familiar with version 8.1.5. If you are new to interMedia Text, please start with Oracle8i interMedia Text 8.1.5 - Technical Overview.  Please refer to the documentation for detailed information.

 

Table of Contents

  1. Overview of 8.1.6 Improvements
  2. Upgrading 8.1.5 to 8.1.6
  3. Datastore Improvements
  4. Filter Improvements
  5. Section Group Improvements
  6. Lexer Improvements
  7. Wordlist Improvements
  8. Storage Improvements
  9. Index Creation and Maintenance
  10. Query Language Improvements
  11. Document Services Improvements
  12. Thesaurus Improvements



Overview of 8.1.6 Improvements

While 8.1.6 has features which improve index performance and ease-of-use, most of the new features are targeted in two key areas: structured documents and multi-lingual databases.

The improvements in structured document support include indexing and searching attribute text, nested within for sophisticated queries, doctype-limited tag detection, dynamic add section for existing indexes, and a new AUTO sectioner which requires no section pre-definition.

Multi-lingual databases can now store documents of different languages in a single column, thanks to the new MULTI lexer. Furthermore, improvements in INSO and charset filters allow documents of different formats and character sets in a single column. UNICODE support has been added to the Japanese and Korean lexers.

Other interesting features include limited parallel indexing, more flexible stem and fuzzy operators, and in-memory document services.



Upgrading 8.1.5 to 8.1.6

Automated Upgrade

For an automated upgrade from 8.1.5 to 8.1.6, use Oracle's Database Migration Assistant tool (ODMA) -- this will run all needed scripts for interMedia.

There's no easy way to verify that the upgrade completed successfully. However, you can, after the upgrade, issue:

  select * from ctx_version

as ctxsys. If this reports 8.1.6, the upgrade script has been run. It's probably wise to check for any invalid ctxsys objects anyway.


Manual Upgrade

In the event that the automatic upgrade fails, the manual process is below. This process assumes that you are starting with a valid 8.1.5 ctxsys data dictionary, and that the system is not being used for index, DML, or query.

1. run ?/ctx/admin/upgrade/s0801060.sql as SYS
this grants new, needed database privileges to ctxsys

2. run ?/ctx/admin/upgrade/u0801060.sql as CTXSYS
this brings your ctxsys data dictionary to 8.1.6

3. run ?/ctx/admin/dr0pkh.sql as CTXSYS
this re-creates all public and private package headers

4. run ?/ctx/admin/dr0plb.sql as CTXSYS
this re-creates all public and private package bodies

5. run ?/ctx/admin/dr0type.plb as CTXSYS
this re-creates the index type body

6. check for any invalid ctxsys objects


Upgrading Existing Indexes

Nothing needs to be done to upgrade existing 8.1.5 context indexes. These will still work without any problems in 8.1.6.



Datastore Improvements

USER_DATASTORE

The user datastore in 8.1.5 is limited to CLOB output for the generated documents. A new attribute, OUTPUT_TYPE, allows you to change the output type to BLOB or VARCHAR2. For instance, a preference like this:

  begin
     ctx_ddl.create_preference('mystore','user_datastore');
     ctx_ddl.set_attribute('mystore','procedure','foo');
  end;

mandates a foo procedure with the following signature:

  procedure foo (in rowid, in out clob)

If instead we add:

  ctx_ddl.set_attribute('mystore','output_type','varchar2');

then interMedia Text will expect the foo procedure to have the signature:

  procedure foo (in rowid, in out varchar2)

If the synthesized document is expected to be less than 32000 bytes, then a varchar2 output function is probably easier to write, easier to test, and faster to execute than the CLOB version.

You can also set the output_type to BLOB, for formatted documents or documents in a different character set. Note, however, than it is still not possible to use or call an external procedure in the user datastore.

If output_type is not specified, CLOB is assumed, for backward compatibility.


NESTED_DATASTORE

The new NESTED_DATASTORE object allows indexing the rows of a nested table as parts of a single document -- it is very much like the detail datastore, but here the detail table is a nested table.

For example, consider a table mytab, where the document text is stored in a nested table, one line per row of the nested table:

  create type nt_rec as object (
    lno  number,        -- line number
    ltxt varchar2(80)   -- text of line
  );

  create type nt_tab as table of nt_rec;

  create table mytab (
    id    number   primary key,
    dummy char(1),
    doc   nt_tab
  )
  nested table doc store as myntab;

If you are running the above in SQL*Plus, you'll need forward slashes after each statemeny to execute it.

 insert into mytab values (1, null, nt_tab());
  insert into table(select doc from mytab where id = 1) 
    values (1, 'this is the first line');
  insert into table(select doc from mytab where id = 1) 
    values (2, 'this is the second line');

For each row of mytab, interMedia Text needs to concatenate the rows of the nested doc table into a single document. In 8.1.5, this has to be done using a user datastore. 8.1.6 provides this as built-in functionality.

We start by creating a nested table datastore:

  ctx_ddl.create_preference('mynt','nested_datastore');

We need to provide the name of the nested table column in the base table:

  ctx_ddl.set_attribute('mynt','nested_column','doc');

We also need to provide the type of the nested table. This should include the type owner schema. This cannot be determined automatically because the table name is not known until create index time:

  ctx_ddl.set_attribute('mynt','nested_type','scott.nt_tab');

Finally, we need to specify the names of the line number and text columns within the nested table record type:

  ctx_ddl.set_attribute('mynt','nested_lineno','lno');
  ctx_ddl.set_attribute('mynt','nested_text',  'ltxt');

The extensible indexing framework does not allow creating an index on a nested table type, so we need to create the index on a dummy column, just like the detail datastore:

  create index myidx on mytab(dummy)
   indextype is ctxsys.context
   parameters ('datastore mynt');

For each record of the base table, interMedia Text will select the rows of the nested table using sql somewhat like this:

  select <nested_text>
    from table(select <nested_column>
                 from <base_table>
                where rowid = <current_row_rowid>)
   order by nested_table_id, <nested_lineno>

and concatenate the rows into a single document. Thus, a search like:

  select id from mytab where contains(dummy, 'first and second')>0

will return the one base table row, even though the hits are in different rows of the nested table.

Like the detail datastore, the nested table datastore has a BINARY attribute to control the concatenation of the nested table lines. If BINARY is TRUE, interMedia Text will simply concatenate the text together. If BINARY is FALSE, interMedia Text will insert a newline in between each nested table line.

Given the nested table sql, a unique index on the nested table, on the columns (nested_table_id, <lineno>) should improve indexing performance.

Because the index is created on a dummy column, any DML on the nested table must be manually propagated to the the dummy column -- interMedia Text cannot automatically detect such changes and mark rows for reindex.



Filter Improvements

INSO_FILTER Character Set Support

In 8.1.5, the INSO filter outputs ISO8859-1 text, which is then converted to the database character set. However, not all documents can be expressed in ISO8859-1, so the INSO filter cannot be used in all situations.

8.1.6 uses a newer revision of the INSO libraries which features UNICODE output. Since UNICODE is designed to be able to express any text, the INSO filter can now be used on any database, as long as the document text can be expressed in the database character set.

For instance, in 8.1.5, a Japanese document in MS Word on a JA16EUC database cannot be sent to the INSO filter, because the plain text output is forced to ISO8859-1, which is incapable of expressing the document. In 8.1.6, the plain text is UNICODE, and everything works fine. Sending the same document through on a US7ASCII database, however, will not work since the Japanese text cannot be converted to the database character set.


Format Column

Another improvement in the INSO filter is the addition of optional bypasses for heterogenous docsets. In 8.1.5, if you employ the INSO filter in your index, every document in the base table is sent through the INSO filter. This works fine when all the documents are formatted documents, but the docset may also have HTML and plain text documents. Filtering these wastes time and sometimes produces unexpected results.

8.1.6 adds a mechanism to bypass INSO filtering for selected rows. This requires a column in the base table -- called the format column -- which specifies that the data is BINARY or TEXT. For instance, let's consider the table:

  create table hdocs (
    id   number        primary key,
    fmt  varchar2(10), 
    text varchar2(80)
  );

We can insert a word doc, with the value 'BINARY' in the format column:

  insert into hdocs values (1, 'binary', '/docs/myword.doc');

and an HTML document, with the value 'TEXT' in the format column:

  insert into hdocs values (2, 'text',   '/docs/index.html');

When we create the index, we need to specify the name of the format column, using the new 'format column' clause of the parameters string:

  create index hdocsx on hdocs(text)
   indextype is ctxsys.context
   parameters ('datastore file_datastore filter inso_filter 
                format column fmt');

Since we have specified a format column, for each row, the INSO filter will examine the value in 'fmt'. If this value is 'BINARY', as it is for row 1, then it will send it through the INSO filter. If the value is 'TEXT', it by-passes the INSO filter and sends the text straight through.

The format column must be in the same table as the text column, and it must be CHAR, VARCHAR, or VARCHAR2 type. The value is case-insensitive, so 'BINARY' and 'binary' are the same. If the column value is null or not 'BINARY' nor 'TEXT', then the document is assumed to be binary, and it is filtered by the INSO filters.

The format column does not have to be specified when using the INSO filter. This bypass mechanism is optional. The format column can be specified when not using the INSO filter, but it has no effect.


Charset Column

If the document set is even more heterogenous, the TEXT documents may not all be in the same character set, and may not be in the database character set. The INSO filter can do per-row character set translation for such documents, if the character set is specified in another column, called the charset column. For instance, let's expand our example:

  create table hdocs (
    id   number        primary key,
    fmt  varchar2(10),
    cset varchar2(20), 
    text varchar2(80)
  );

We can insert a plain text Japanese document in EUC:

  insert into hdocs values (1, 'text', 'JA16EUC', '/docs/tekusuto.euc');

and one in Shift-JIS:

  insert into hdocs values (2, 'text', 'JA16SJIS', '/docs/tekusuto.sjs');

When we create the index, we need to specify the names of the format and charset columns:

  create index hdocsx on hdocs(text)
   indextype is ctxsys.context
   parameters ('datastore file_datastore filter inso_filter 
                format  column fmt
                charset column cset');

The INSO filter first looks at the format column. If the format column value is TEXT, then it looks at the charset column for the character set of the document. The document text is converted from this character set to the database character set.

The charset column must be in the same table as the text column, and it must be CHAR, VARCHAR, or VARCHAR2 type. The value is case-insensitive, but should be an NLS character set string. If the column value is null or is not recognized, then the document is passed through without any conversion. Charset column does not have to be specified when using the INSO filter.

If the charset column is specified and the format column is not specified, then all documents are assumed to be text documents, and the INSO filter works like the CHARSET filter, described below.


CHARSET_FILTER

The CHARSET filter can also take advantage of the charset column, which allows you to index a table of text documents in different character sets. When using the charset column, the CHARSET attribute becomes the default value -- when the charset column is null or the value is not recognized, then the document is assumed to be in the character set named in the attribute, and is converted from that to the database character set.

Like the INSO filter, you do not need to specify the charset column -- it works just like the 8.1.5 version, and assumes all documents are in the CHARSET attribute character set.



Section Group Improvements

BASIC_SECTION_GROUP

The tagged section groups (basic, html, xml) have all been re-written to share a common parser. As a result, the BASIC section group can now parse tags with attributes, although currently it is not able to index those attributes.


HTML_SECTION_GROUP

The HTML section group can now index META tags. META tags in HTML have two forms. The first is used for embedding http header data, and is not interesting to information retrieval. The second form, however, is used for encoding document meta-data in name/value pairs:

  <META NAME="author" CONTENT="Mr. Roboto">

8.1.5 is unable to make use of this because the information is in attribute values, which were ignored. Furthermore, even if the product had attribute value searching, the name part of the name/value pair is actually an attribute value rather than an attribute name.

Instead, the HTML section group has been enhanced with special detection of the META tag. Like any other section, the META pairs which should be indexed are declared using add_section. However, the tag portion has a special syntax: META@<name>:

  ctx_ddl.create_section_group('mysg','html_section_group');
  ctx_ddl.add_field_section('mysg','author','meta@author');

With this tag syntax, the HTML section group will interpret

  <META NAME="author" CONTENT="Mr. Roboto">

as:

  <META@AUTHOR>Mr.Roboto</META@AUTHOR>

and index "Mr. Roboto" as the contents of the field section "author". The query looks like any other section query:

  Roboto within author

to find the document.

META name/value pairs which have not been added to the section group are ignored and not indexed. META tags which do not have NAME or CONTENT attributes are also ignored.


XML_SECTION_GROUP Doctype Limiter

One problem with the XML section group in 8.1.5 is the inability to distinguish between tags in different DTD's. For instance, perhaps you have a DTD for storing contact information:

  <!DOCTYPE contact>
  <contact>
    <address>506 Blue Pool Road</address>
    <email>dudeman@oracle.com</email>
  </contact>

Appropriate sections might look like:

  ctx_ddl.add_field_section('mysg','email',  'email');
  ctx_ddl.add_field_section('mysg','address','address');

Which works out fine... until you start getting a different kind of document in the same table:

  <!DOCTYPE mail>
  <mail>
    <address>dudeman@oracle.com</address>
  </mail>

Now your address section, originally intended for street addresses, starts picking up email addresses, because of tag collision.

8.1.6 allows you to specify doctype limiters to distinguish between these tags across doctypes. You simply specify the doctype in parentheses before the tag:

  ctx_ddl.add_field_section('mysg','email','email');
  ctx_ddl.add_field_section('mysg','address','(contact)address');
  ctx_ddl.add_field_section('mysg','email','(mail)address');

Now when the XML section group sees an address tag, it will index it as the address section when the document type is contact, or as the email section when the document type is mail.

If you have both doctype-limited and unlimited tags in a section group:

  ctx_ddl.add_field_section('mysg','sec1','(type1)tag1');
  ctx_ddl.add_field_section('mysg','sec2','tag1');

Then the limited tag applies when in the doctype, and the unlimited tag applies in all other doctypes.

Querying is unaffected by this -- the query is done on the section name, not the tag, so querying for an email address would be done like:

  oracle within email

which, since we have mapped two different kinds of tags to the same section name, finds documents independent of which tags are used to express the email address.


XML_SECTION_GROUP Attribute Sections

Another commonly-requested feature in the 8.1.6 XML section group is the ability to index and search within attribute values. For instance, take a document with the following:

  <comment author="jeeves">
    I really like interMedia Text
  </comment>

in 8.1.5, the author information, like all attributes, is ignored and cannot be indexed or searched. In 8.1.6, the XML section group offers a new type of section called an attribute section, which nominates interesting attribute values to index:

  ctx_ddl.add_attr_section('mysg','author','comment@author');

The syntax is very similar to the other add_section calls -- the first argument is the name of the section group, the second is the name of the section, and the third is the tag, in the form <tag_name>@<attribute_name>.

This tells interMedia Text to index the contents of the author attribute of the comment tag as the section "author". Query syntax is just like any other section:

  jeeves within author

finds the document. Attribute text is considered to be invisible, though, so the following:

  jeeves

does NOT find the document, somewhat like field sections. Unlike field sections, however, attribute section within searches can distinguish between occurrences. Given the document:

  <comment author="jeeves">
    I really like interMedia Text
  </comment>
  <comment author="bertram">
    Me too
  </comment>

the query:

  (cryil and bertram) within author

will not find the document, because "jeeves" and "bertram" do not occur within the SAME attribute text.

Attribute section names cannot overlap with zone or field section names, although you can map more than one tag@attr to a single section name.

Attribute sections do not support default values. Given the document:

  <!DOCTYPE foo [
    <!ELEMENT foo (bar)>
    <!ELEMENT bar (#PCDATA)>
      <!ATTLIST bar
        rev CDATA "8i">
  ]>
  <foo>
    <bar>whatever</bar>
  </foo>

and attribute section:

  ctx_ddl.add_attr_section('mysg','barrev','bar@rev');

the query:

  8i within barrev

does not hit the document, although in XML semantics, the "bar" element has a default value for its "rev" attribute.


Attribute Value Sensitive Section Search

Attribute sections allow you to search the contents of attributes. They do not allow you to use attribute values to specify sections to search. For instance, given the document:

  <comment author="jeeves">
    I really like interMedia Text
  </comment>

You can find this document by asking:

  jeeves within comment@author

which is equivalent to "find me all documents which have a comment element whose author attribute's value includes the word jeeves".

However, there is no way to ask for something like:

  interMedia within comment where (@author = "jeeves")

in other words, "find me all documents where interMedia appears in a comment element whose author is jeeves". This feature -- attribute value sensitive section searching -- is planned for future versions of the product.


Dynamic Add Section

Because the section group is defined before creating the index, 8.1.5 is limited in its ability to cope with changing structured document sets; if your documents start coming with new tags, or you start getting new doctypes, you have to re-create the index to start making use of those tags.

8.1.6 allows you to add new sections to an existing index without rebuilding the index, using alter index and the new add section parameters string syntax:

  add zone  section <section_name> tag <tag>
  add field section <section_name> tag <tag> [ visible | invisible ]

For instance, to add a new zone section named tsec using the tag title:

  alter index <indexname> rebuild
  parameters ('add zone section tsec tag title')

To add a new field section named asec using the tag author:

  alter index <indexname> rebuild
  parameters ('add field section asec tag author')

This field section would be invisible by default, just like when using add_field_section. To add it as visible field section:

  alter index <indexname> rebuild
  parameters ('add field section asec tag author visible')

Dynamic add section only modifies the index meta-data, and does not rebuild the index in any way. This means that these sections take effect for any document indexed after the operation, and do not affect any existing documents -- if the index already has documents with these sections, they must be manually marked for re-indexing (usually with an update of the indexed column to itself).

This operation does not support addition of special sections. Those would require all documents to be re-indexed, anyway. This operation cannot be done using rebuild online, but it should be a fairly quick operation.


AUTO_SECTION_GROUP

The ultimate in coping with changing structured document sets is the new AUTO section group. The AUTO section group is like the XML section group, but requires no pre-definition of sections. Instead, any non-empty tag is automatically indexed as a zone section with the section name the same as the tag. For instance, given a document like this:

  <book>
    <author>Neal Stephenson</author>
    <title>The Diamond Age</title>
    <description>
      Decades into our future ...
    </description>
  </book>

The auto sectioner will create zone sections named book, author, title, and description. Queries like this:

  diamond within description

will find the document. Because sections are created dynamically from the tags, the auto sectioner is ideal when the structure of the documents is not known to the index creator, or when they are expected to change so that pre-definition with dynamic add is impractical.

The only quirk to watch out for is that section names are case-insensitive, but tags in XML are case-sensitive. For instance, given the above document, the query

  diamond within description

finds the document, as does

  diamond within DESCRIPTION

Both queries would also find a document:

  <DESCRIPTION>Diamond Earring</DESCRIPTION>

even though "description" and "DESCRIPTION" are different tags in XML.

You cannot add zone, field, or attribute sections to an AUTO_SECTION_GROUP in this version. Furthermore, because the AUTO_SECTION_GROUP may encounter documents using the tags PARAGRAPH or SENTENCE, it also does not support special sections.


AUTO_SECTION_GROUP Stop Sections

Even when using the AUTO section group, there may be some tags which are so common that indexing them would be a waste of time, since they are not useful for searching. For such tags, the AUTO section group has stop sections, which indicate tags to ignore -- much like stop words indicate words to ignore.

Stop sections use the new add_stop_section call:

  ctx_ddl.add_stop_section(<group name>, <tag>)

which indicates to the named AUTO section group that <tag> should be ignored and not indexed as a section. add_stop_section can be used only on AUTO section groups. The number of stop sections per AUTO group is unlimited.

The tag is case-sensitive, so:

  ctx_ddl.add_stop_section('mysg','description')

is different from

  ctx_ddl.add_stop_section('mysg','DESCRIPTION')

and these stop different things. If you have only:

  ctx_ddl.add_stop_section('mysg','description')

there may still be a section named "DESCRIPTION", if another document uses "Description" or "DESCRIPTION" as the tag. For this reason, a query that does "within <stop section>" does not error.

The tag also supports doctype limiters. So, for instance, if you wanted to stop tag "foo", but only in document type "bar":

  ctx_ddl.add_stop_section('mysg','(bar)foo')

Stop sections can be added to an existing index dynamically, using the parameters string syntax:

  add stop section <tag>

as in:

  alter index <indexname> rebuild
  parameters ('add stop section author')

Unlike stop words, adding a stop section dynamically does not apply to documents already indexed. This is partially to be consistent with the other dynamically add section calls, but mainly because there could be other tags using that section name due to case-insensitivity of section names.


AUTO_SECTION_GROUP Attribute Sections

The AUTO section group also automatically indexes attribute values as attribute sections named <tag>@<attr>. For instance, given:

  <book author="Edmund Wells">
    <title>Rarnaby Budge</title>
  </book>

You can find this document with the query:

  wells within book@author

since the author attribute is automatically indexed as an attribute section.


Nested Within

8.1.6 also adds nested within queries -- also known as hierarchical section searching. For instance, given a document with:

  <book>
    <title>Star Wars</title>
  </book>
  <video>
    <title>Star Wars<title>
  <video>

You can distinguish between the two types of titles using a nested within:

  (Star Wars within title) within video

for instance, which is not allowed in 8.1.5. 8.1.6 provides this feature with two important caveats. First, this only works with zone sections. Attribute or field sections -- even visible field sections -- cannot be used in a nested within.

Second, because of the way interMedia Text indexes zone sections, equal sections cannot be distinguished. For instance, the document:

  <A><B>foo</B></A>

can be found with either:

  (foo within A) within B

or

  (foo within B) within A

This applies only when the sections bound exactly the same range. Any intervening text allows interMedia Text to distinguish between the sections -- for instance:

  <A><B>foo</B>bar</A>

Another aspect of this is that nested within does not imply an exact parent-child relationship. In XQL, a query string with:

  A/B

means that the B section is a first-level child of the A section. This not equivalent to nested within:

  (x within A) within B

does not mean that the A section is a child of the B section -- it could be a grandchild or great-grandchild. Nested within is more like the XQL:

  A/*/B



Lexer Improvements

BASIC_LEXER Generic Base Lettering

The BASE_LETTER function of the BASIC lexer is designed to translate any accented characters to their unaccented representation for indexing and query. In 8.1.5, this feature uses an NLS function which is sensitive to the language part of the NLS_LANG setting. This means, for instance, that BASE_LETTER does not work for an AMERICAN database, because the NLS function does not work in this language. For 8.1.6, based on our input, NLS has provided a generic base lettering function which works in all languages. This means that the BASE_LETTER function now works no matter what language your database uses, which is ideal for multi-lingual and foreign-language databases.

This new function requires the NLS file lx30044.nlb to be installed, so if base lettering is not working in 8.1.6, check for this file.


BASIC_LEXER THEME_LANGUAGE

The French knowledge base and lexicon is now public in 8.1.6. This allows indexing and generation of themes for French documents. Normally, the BASIC lexer will choose which knowledge base to use (French or English) based on the NLS_LANG setting for the session. If, however, your NLS_LANG setting does not match your documents -- your documents are in French and your env is in English -- then this will choose the wrong lexicon, or give an error (if your environment is neither English nor French).

For such situations, the BASIC_LEXER has a new THEME_LANGUAGE setting to override environment detection. This attribute can be set to 'ENGLISH' or 'FRENCH' to choose a lexicon explicitly:

  ctx_ddl.set_attribute('mylexer','theme_language','french');

This lexer now produces French themes independent of the environment. This attribute can also be set to 'AUTO', or left unset. In such cases, the lexicon is chosen based on the environment.


BASIC_LEXER Theme Admissibility

8.1.5 theme indexing produces a maximum of 50 themes per document. This means that long documents may be difficult to find with a theme query, since lesser themes tend not to make it into the top 50.

8.1.6 solves this by removing the 50-theme limit for theme indexing -- all document themes are produced and indexed. Theme generation, via document services, however, is not changed, and is still subject to the 50-theme maximum.


JAPANESE/KOREAN_LEXER UTF-8 Support

In 8.1.5, the JAPANESE_VGRAM_LEXER works only if your database character set is JA16EUC or JA16SJIS. The KOREAN_LEXER works only in KSC5601. These lexers are enhanced in 8.1.6 to also work in UTF-8. This allows multi-lingual databases to finally use the Japanese and Korean lexers.


MULTI_LEXER

The biggest 8.1.6 addition in the lexer area is the new MULTI lexer, which allows documents of different languages to be stored in a single table. The MULTI lexer chooses a language-specific lexer for each row, based on a language column.

The first step with the MULTI lexer is the creation of language-specific lexer preferences for each language in the multi-lingual table. As an example, let's create English, French, and German lexers:

  ctx_ddl.create_preference('english_lexer', 'basic_lexer');
  ctx_ddl.set_attribute('english_lexer','index_themes','yes');

  ctx_ddl.create_preference('french_lexer', 'basic_lexer');
  ctx_ddl.set_attribute('french_lexer','base_letter','yes');

  ctx_ddl.create_preference('german_lexer', 'basic_lexer');
  ctx_ddl.set_attribute('german_lexer','composite','german');
  ctx_ddl.set_attribute('german_lexer','alternate_spelling','german');
  ctx_ddl.set_attribute('german_lexer','mixed_case','yes');

Once the language-specific lexer preferences are created, they need to be gathered together under a single MULTI lexer preference. First, we need to create the MULTI lexer preference, using the MULTI_LEXER object:

  ctx_ddl.create_preference('global_lexer','multi_lexer');

Now we must add the language-specific lexers to the multi lexer preference using the new add_sub_lexer call:

  ctx_ddl.add_sub_lexer(<multi_lexer>, <language>, <lang-specific lexer>)

For our example:

 ctx_ddl.add_sub_lexer('global_lexer', 'german', 'german_lexer');
  ctx_ddl.add_sub_lexer('global_lexer', 'french', 'french_lexer');
  ctx_ddl.add_sub_lexer('global_lexer', 'default','english_lexer');

This nominates the german_lexer preference to handle german documents, the french_lexer preference to handle french documents, and the english_lexer preference to handle everything else, using 'DEFAULT' as the language.

The MULTI lexer decides which lexer to use for each row based on a language column. This is a character column in the table which stores the language of the document in the text column. For instance, our table looks like:

 create table globaldoc (
    doc_id   number       primary key,
    lang     varchar2(10),
    text     clob
  );

  insert into globaldoc values (1, 'german',  'Ich bin Berliner');
  insert into globaldoc values (2, 'french',  'Je suis Parisian');
  insert into globaldoc values (3, 'english', 'I''m from New Yawk');

To index this table, we need to use the MULTI lexer preference, but we also need to specify the name of the language column:

  create index globalx on globaldoc(text)
  indextype is ctxsys.context
  parameters ('lexer global_lexer
               language column lang');

The MULTI lexer then reads the value of the language column for each row, and uses the associated language-specific lexer for that language.

The language column is case-independent, and can contain either the NLS name or abbreviation of the language. If the table uses some other value for the language, you need to specify that alternate value as the fourth argument when adding the sub lexers:

  ctx_ddl.add_sub_lexer('global_lexer', 'german', 'german_lexer', 'ger');

for instance, to use the ISO 639-2 string for german instead. Note that these alternate values are in addition to the NLS names and abbreviations, so "D" and "GERMAN" in the language column still select the german lexer. This means that alternate values cannot be NLS language names or abbreviations.

The DEFAULT lexer is used when the language of the document is unknown or is not mapped to a language-specific lexer. The DEFAULT lexer is also used to lex stopwords. A DEFAULT lexer MUST be specified, or you will get an error when using the MULTI lexer preference in create index. You cannot specify an alternate value for the DEFAULT sub lexer -- this does not make sense.

At query time, the MULTI lexer chooses a language-specific lexer to lex the query tokens. This is based on the NLS_LANG setting for the query session. Thus, a query session in the FRENCH language will use the lexer for FRENCH. A more explicit method of specifying the query language may appear in future versions of interMedia Text.



Wordlist Improvements

AUTO Setting for STEMMER and FUZZY_MATCH

The 8.1.5 stemmer and fuzzy_match attributes are set for a single, specific language at create index time, making it unable to handle multi-lingual tables. 8.1.6 provides the new AUTO setting for STEMMER and FUZZY_MATCH:

  ctx_ddl.create_preference('mywl','basic_wordlist');
  ctx_ddl.set_attribute('mywl', 'STEMMER', 'AUTO');
  ctx_ddl.set_attribute('mywl', 'FUZZY_MATCH', 'AUTO');

During queries, when set to AUTO, the stemmer or fuzzy match expander will examine the language setting of the session and choose the most appropriate setting for that language. A user with NLS_LANG French, for instance, will get French stemming and fuzzy matching. This mapping is not configurable in this release -- you cannot specify a particular stemmer or fuzzy match cluster to use for each language.

This setting is designed for use with the MULTI lexer, which switches lexers based on the query session's language setting. Together, they allow multi- lingual tables to be indexed and queried.


WILDCARD_MAXTERMS

All versions of interMedia Text and ConText offer wildcard searching -- a contains for any word matching a pattern. This is done by expanding the pattern against the main index table using LIKE, and searching on all the resulting words.

ConText limited the size of the expansion to around 2500 words. interMedia Text 8.1.5 has no limit, but when the pattern is unselective, it uses up memory quickly.

8.1.6 re-establishes the expansion limit, but with a configurable maximum expansion -- the new WILDCARD_MAXTERMS attribute.:

  ctx_ddl.create_preference('mywl','basic_wordlist');
  ctx_ddl.set_attribute('mywl', 'WILDCARD_MAXTERMS', '2500');

Wildcard patterns matching more words than this will receive a "query too complex" error. The default value, if unset, is 5000 words.


SUBSTRING_INDEX

Patterns which start with a wildcard pose another problem for wildcard searching. Because wildcard searches are processed by expanding the pattern using a LIKE, they force a full-table scan of the main index table. This type of query, sometimes called a left-truncated search, takes a long time.

For those applications which require faster left-truncated searching, 8.1.6 offers the new SUBSTRING_INDEX option. This is set like any other boolean attribute:

  ctx_ddl.create_preference('mywl','basic_wordlist');
  ctx_ddl.set_attribute('mywl', 'SUBSTRING_INDEX', 'TRUE');

When set, the index will include a new pattern table ($P) which stores rotations of all unique words in the main index table. Using the pattern table instead of the main index table allows interMedia Text to process left- truncated searches much much faster.

The query-time savings come at the cost of increased index time and space. Indexing time can take as much as four times as long, and use up to 25% more space, so this option should be used only when left-truncated wildcard searches are frequent and need to be fast.



Storage Improvements

P_TABLE_CLAUSE for BASIC_STORAGE

Since we have a new index table when SUBSTRING_INDEX is set, the BASIC storage object gets a new attribute -- P_TABLE_CLAUSE, which is tacked on the end of the create table statement for the P table:

  create table DR$<indexname>$P (
    pat_part1   varchar2(61),
    pat_part2   varchar2(64),
    primary key (pat_part1, pat_part2) 
  ) organization index
  <p_table_clause>

We estimate that the P table should have approximately the same number of rows as the $I table.



Index Creation and Maintenance

Parallel Indexing

ConText V2.x supported parallel index creation, using multiple ctxsrv processes. This was removed in 8.1.5 because tighter integration with the kernel made it unfeasible. While we continue to work for full support in future versions, 8.1.6 adds a temporary workaround which allows parallel index creation with two caveats.

First, in order to use 8.1.6 parallel indexing, the base table must be a partitioned table -- each parallel indexer will work on its own partition, so the parallel degree should not exceed the number of partitions.

Second, because of the limitations of the kernel-interMedia Text interface, we must use dbms_job to spawn the parallel indexers. This means that the database must be running with job_queue_processes > 0, and that the parallel degree cannot exceed this number.

Otherwise, the syntax is just as you would expect:

  create index textidx on doctab(text)
  indextype is ctxsys.context
  parallel 3

This will create the index with parallel degreee 3. Note that because we use dbms_job to spawn parallel indexers, so there is some lag before all indexers are running, proportional to job_queue_interval.

If logging is active, multiple log files will be created by this -- one for each parallel indexer. Those log files will be named <logfile>_<process#>. If, for instance, you did:

  ctx_output.start_log('mylog')

you could expect files "mylog", "mylog_1", "mylog_2", etc.


PL/SQL Sync and Optimize

PL/SQL functions for sync and optimize have been added. The preferred method for sync and optimize is alter index rebuild online. Ideally, these could be called in a dbms_job. However, separate sync and optimize jobs can conflict because two alter index operations cannot be run at the same time for a single index.

The new PL/SQL functions resolve both problems by proving an alternate PL/SQL API for sync and optimize:

  ctx_ddl.sync_index(<idx_name>)

idx_name is the name of the index to sync. The effective user must be the owner of the index or ctxsys. If effective user is ctxsys, idx_name can be NULL, which will cause this call to examine all context indexes for pending DML. This uses the same code as ctxsrv. We now recommend processing pending DML through a dbms_job -- an individual job for each index, a ctxsys job for all indexes, or some combination of the two. This eliminates the need for ctxsrv, and thus eliminates the problem of starting ctxsrv each time the database starts. The ctxsrv program itself is now deprecated, and may be removed in some future release.

  ctx_ddl.optimize_index(<idx_name>, <optlevel>, <maxtime>)

idx_name is the name of the index to optimize. optlevel is 'FAST' or 'FULL' for the optimization level. There are also symbols CTX_DDL.OPTLEVEL_FAST and CTX_DDL.OPTLEVEL_FULL for these values. maxtime is the maximum time, in minutes, to spend on optimization. This applies only to FULL optimization. By default, maxtime is unlimited. You can also pass in NULL or the symbol CTX_DDL.MAXTIME_UNLIMITED.



Query Language Improvements

New Accumulate Scoring

8.1.6 has a new algorithm for processing accumulate queries. The changes mostly resolve scoring inconsistencies the 8.1.5 algorithm has in certain cases. The new algorithm also handles weighted children better than its predecessor.


Nested Within

8.1.6 can process nested withins -- also known as hierarchical section searches -- which allows queries like this:

  (dog within title) within book

Please refer above for details and caveats.


BROWSE_WORDS

Many IR systems include a feature called vocabulary browse. This takes an input word and displays words in the text index which are alphabetically close. This feature is useful for users to correct misspelled words in the query. By including a rough document count for each word, it also allows the user to eliminate unselective words. Administrators may also use it to correct misspellings in document text.

8.1.6 provides a new function for this -- browse_words:

  ctx_query.browse_words(<index_name>, <seed>, <restab>, 
                         <browse_id>, <numwords>, <direction>)

index_name is the name of the index to browse.

seed is the start word. This must be a single word, and not a phrase.

restab is the name of a result table which should have the schema:

  browse_id  number
  word       varchar2(64)
  doc_count  number

The results of the browse are placed in this table. Existing contents are not deleted before browse.

browse_id is a numeric identifier for the operation. Output rows in the result table will have this value in the browse_id column. If not specified this defaults to 0.

numwords is the length of the produced list, in words. This must be between 1 and 1000. If not specified, this defaults to 10 words.

direction specified the direction of the browse. 'BEFORE' means the seed word and words alphabetically before the seed, 'AFTER' means the seed and words alphabetically after the seed, and 'AROUND' means the seed and words both before and after. There are also symbols CTX_QUERY.BROWSE_BEFORE, CTX_QUERY.BROWSE_AROUND, and CTX_QUERY.BROWSE_AFTER for these values. This defaults to 'AROUND'

An example might look like:

     exec ctx_query.browse_words('myidx','dog','myres', 5);
     select word, doc_count from myres order by word;

  WORD     DOC_COUNT
  -------- ---------
  CZAR     15
  DARLING  5
  DOG      73
  DUNK     100
  EAR      3

Note that the DOC_COUNT is an upper bound which does not take into account old information from deleted or updated documents. FULL optimization removes this old information from the index, and makes these counts exact.

Browse also has a variant which delivers the result in a PL/SQL table instead of a result table:

  ctx_query.browse_words(<index_name>, <seed>, <resarr>, 
                         <numwords>, <direction>)

Here, resarr is a PL/SQL table of type ctx_query.browse_tab:

  type browse_rec is record (
    word      varchar2(64),
    doc_count number
  );
  type browse_tab is table of browse_rec index by binary_integer;

The results are placed in the table, ordered alphabetically, with the alphabetically first word at index 1. Existing elements of the table are deleted upon entry into the function.



Document Services Improvements

Rowid Input

All document services work on single documents. To specify the document, these interfaces take in a primary key specification. This may not be convenient when the table uses a composite primary key or you have only the rowid. 8.1.6 allows you to specify a rowid instead.

The document services interfaces have not been changed for this feature. The textkey argument can now take either a rowid or a primary key spec. The key type of the document services determines how the value is interpreted.

You set the key type using the new set_key_type call:

  ctx_doc.set_key_type('ROWID');

After which, all document services will expect rowid values. This setting has session duration and scope. To set it back:

  ctx_doc.set_key_type('PRIMARY_KEY');

there are also symbols (CTX_DOC.TYPE_ROWID and CTX_DOC.TYPE_PRIMARY_KEY) defined for these values.

The default is primary key mode, but you can change the default using the new system parameter CTX_DOC_KEY_TYPE. For instance, if you want your system to default to ROWID input:

  ctx_adm.set_parameter('CTX_DOC_KEY_TYPE', CTX_DOC.TYPE_ROWID);

This has system scope and permanent duration. Note that this only sets the default -- sessions can always override this via set_key_type.


In-Memory Document Services

In 8.1.5, the document services use result tables to store their results. If you don't need to store the result of the operation, result table I/O is wasteful, and application code must be written to manage these temporary rows. When you do need to store the results, the schema of the result tables may not be sufficient for your needs -- you may need to store a varchar2 primary key, for instance.

8.1.6 introduces in-memory variants of all document services which avoid result tables and instead output to PL/SQL variables. This avoids unneeded I/O and allows the application to more easily process the transient results. We'll look at these new functions individually, starting with:

  ctx_doc.filter(<index_name>, <textkey>, <result>, <plaintext>)

The result argument is a CLOB locator. The document will be filtered and the filtered text placed into this CLOB. If the locator passed in is NULL, a temporary CLOB is allocated for you. It is up to the app to deallocate it.

  ctx_doc.markup(<index_name>, <textkey>, <text_query>, <result>,
                 <plaintext>, <tagset>, <starttag>, <endtag>,
                 <prevtag>, <nexttag>)

The result argument is a CLOB locator, just like in the filter call.

  ctx_doc.highlight(<index_name>, <textkey>, <text_query>, 
                    <restab>, <plaintext>)

The highlight call can return multiple highlights, so a single PL/SQL variable isn't sufficient. Instead, the restab argument is a PL/SQL table of type ctx_doc.highlight_tab:

  type highlight_rec is record (
    offset number;
    length number;
  );
  type highlight_tab is table of highlight_rec index by binary_integer; 

The first highlight will be placed at index 1. Existing contents of the table when passed in are deleted.

  ctx_doc.themes(<index_name>, <textkey>, <restab>, <full_themes>)

Like highlight, ctx_doc.themes can return multiple themes. restab is a PL/SQL table of type ctx_doc.theme_tab:

  type theme_rec is record (
    theme  varchar2(2000);
    weight number;
  );
  type theme_tab is table of theme_rec index by binary_integer;

The first theme is placed at index 1. Existing contents of the table when passed in are deleted. The themes are not guaranteed to be placed in the theme_tab in any particular order.

  ctx_doc.gist(<index_name>, <textkey>, <result>, <glevel>, <pov>,
               <numparagraphs>, <maxpercent>)

The in-memory version of gist is the only call which deviates from its result-table version. Unlike the result-table version, which returns multiple gists -- one per point-of-view -- the in-memory version returns a single gist, selected by the pov argument. If no pov is specified, the GENERIC gist is generated. result is a CLOB locator as in ctx_doc.filter or ctx_doc.markup.



Thesaurus Improvements

Thesaurus Maintenance

The thesaurus functionality in 8.1.5 supports file import and export, but once a thesaurus is in the database, it's difficult to change -- the API allows only the addition of new phrases and relations.

8.1.6 rounds out the API with new maintenance functions which allow changing, renaming, and removal of phrases and relations. There are also calls for adding new phrases and relations which are hopefully less confusing than the existing functions.

Let's look at the new additions call-by-call.


ALTER_THESAURUS

  ctx_thes.alter_thesaurus(<tname>, <op>, <operand>)

This call has two variations. To rename a thesaurus, op is set to 'RENAME' and operand is set to the new name:

  ctx_thes.alter_thesaurus('oldthesname','RENAME','newthesname');

To truncate a thesaurus -- that is, remove all phrases and relations -- op is set to 'TRUNCATE' and the operand is not needed:

  ctx_thes.alter_thesaurus('mythes','TRUNCATE');

'RENAME' and 'TRUNCATE' also have symbols defined -- CTX_THES.OP_RENAME and CTX_THES.OP_TRUNCATE.


ALTER_PHRASE

  ctx_thes.alter_phrase(<tname>, <phrase>, <op>, <operand>)

This call has three variations. To change a phrase, set op to 'RENAME':

  ctx_thes.alter_phrase('mythes','old phrase','RENAME','new phrase');

This leaves all the relations intact -- it just changes the phrase. You can also use this to add, change, or remove qualifiers:

  ctx_thes.alter_phrase('mythes','crane','RENAME','crane(bird)');
  ctx_thes.alter_phrase('mythes','crane(bird)','RENAME','crane(stork)');
  ctx_thes.alter_phrase('mythes','crane(bird)','RENAME','crane');

The second variation makes a phrase the preferred term in its synonym ring:

  ctx_thes.alter_phrase('mythes','laser','PT');

If the phrase's synonym ring already has a preferred term, this operation overrides it. The old preferred term becomes a non-preferred synonym and the named phrase becomes the preferred term.

The third variant changes the scope note on a phrase:

  ctx_thes.alter_phrase('mythes','ships','SN', 'sea-going vessels');

The operand overwrites whatever scope note was on the phrase, so this can be used to add, change, or remove a scope note.

The operations all have symbols: CTX_THES.OP_RENAME, CTX_THES.OP_PT, and CTX_THES.OP_SN.


DROP_PHRASE

  ctx_thes.drop_phrase(<tname>, <phrase>)

This call removes a phrase from a thesaurus. BT*/NT* relations are patched around the dropped phrase. A relation is established between each narrower term and each broader term. For instance, a simple example:

  A
    BT B
  B
    BT C

and you do:

  ctx_thes.drop_phrase('mythes','B');

then the resulting thesaurus looks like:

  A
    BT C

For multiple broader and narrower terms:

  A   B
   \ /
    C
   / \
  D   E

when C is removed, the thesaurus looks like this:

  A  B
  |\/|
  |/\|
  D  E

This patching is done for all hierarchical relations -- BT, BTI, BTP, BTG, etc. Associative (RT) relations are not patched. If A RT B and B RT C, and B is removed, then no A RT C relation is created. Synonymity is also not affected. If the removed phrase is the preferred term, no phrase in the ring is elevated to preferred term.


CREATE_RELATION

  ctx_thes.create_relation(<tname>, <phrase>, <rel>, <relphrase>);

This call creates a relation between phrase and relphrase. You can do this in 8.1.5 using create_phrase, but the usage is confusing. For instance, if you want to create the thesaurus:

  dog
    BT  animal

It looks like:

  ctx_thes.create_phrase('mythes', 'dog');
  ctx_thes.create_phrase('mythes', 'animal', 'BT',  'dog');

Two problems here. The first problem is that you need two calls to create a thesaurus with one relation. The second problem is that in the second call, DOG and ANIMAL are in opposite order of how they look in the thesaurus file. This is confusing and unintuitive.

create_relation solves both by automatically creating unknown phrases or relphrases, and adhering closely to thesaurus file order. The above thesaurus is created using create_relation like this:

  ctx_thes.create_relation('mythes', 'dog', 'BT', 'animal');

create_relation also allows you to add translations to phrases, using <lang>: as the relation:

  ctx_thes.create_relation('mythes', 'dog', 'french:', 'chien');

Again, very similar to how it would look in the thesaurus file.

create_relation is the preferred syntax now, and the rel and relphrase arguments of create_phrase are now deprecated and should not be used. Users are discouraged from using these arguments.


DROP_RELATION

  ctx_thes.drop_relation(<tname>, <phrase>, <rel>, <relphrase>);

This is the opposite of create_relation, and removes a relation between two phrases, leaving the phrases and their other relations untouched. For instance, the sequence:

  ctx_thes.create_relation('mythes', 'A', 'BT', 'B');
  ctx_thes.drop_relation('mythes', 'A', 'BT', 'B');

results in mythes having the phrases A and B, but no relations between them.

You can remove BT*/NT* and RT relations. You can also use SYN to remove the phrase from its synonym ring:

  ctx_thes.create_relation('mythes', 'A', 'SYN', 'B');
  ctx_thes.create_relation('mythes', 'A', 'SYN', 'C');

--> A, B, and C are synonyms

  ctx_thes.drop_relation('mythes', 'A', 'SYN');

--> B and C are synonyms, but A is not in the ring

or PT to remove preferred term status, but leave the phrase in the ring:

  ctx_thes.create_relation('mythes', 'B', 'SYN', 'C');
  ctx_thes.create_relation('mythes', 'B', 'USE', 'A');

--> A, B, and C are synonyms, and A is the preferred term

  ctx_thes.drop_relation('mythes', 'A', 'PT');

--> A, B, and C are synonyms, but the ring no longer has any preferred term

Finally, you can use <lang>: to remove translations:

  ctx_thes.drop_relation('mythes', 'dog', 'french:', 'chien');

relphrase is actually optional. If omitted, all relations of that type are removed from the phrase. For instance:

  ctx_thes.drop_relation('mythes','A', 'BT');

drops all broader term relations of A.

  ctx_thes.drop_relation('mythes','A', 'french:');

drops all french translations of A.


Maintenance Security

One last security note for all the above functions -- only the owner of the thesaurus and ctxsys is allowed to modify the thesaurus, although all users can use a thesaurus during query.


PL/SQL Table Expansion

The thesaurus is really meant to be used during a contains query, with the thesaurus operators. However, 8.1.5 does include a simple PL/SQL expansion API. The functions work just like their query counterparts, and return the expansion as a single string. For instance, given the thesaurus:

  animal
    NT dog
    NT cat
    NT bird

and the PL/SQL:

  declare
    exp varchar2(80);
  begin
    exp := ctx_thes.nt('animal',1,'mythes');
    dbms_output.put_line(exp);
  end;

would print out something like:

  {animal}|{dog}|{cat}|{bird}

which allows you to see the expansion. However, the single string expansion must be parsed if you want the individual terms, for use in a web interface or your own thesaurus dumping mechanism.

8.1.6 adds PL/SQL table expansion for such applications. Expansions are placed into a PL/SQL table like this:

  type exp_rec is record (
    xrel    varchar2(12),
    xlevel  number,
    xphrase varchar2(256)
  );
  type exp_tab is table of exp_rec index by binary_integer;

xrel holds the relation of the input phrase to the output phrase (BT, NT, SYN, etc). The input phrase is also in the table, with xrel of PHRASE. For SYN and TRSYN expansions, the preferred term has xrel of PT.

xlevel tells you the distance from the input phrase to the output phrase. This is used mainly for hierarchical (BT*/NT*) expansions. For instance, in a thesaurus like:

  A
   NT B
  B
   NT C
  C
   NT D

and an NT expansion on A for 3 levels, the resulting table looks like:

  xrel   xlevel xphrase
  ------ ------ -------
  PHRASE   0       A
  NT       1       B
  NT       2       C
  NT       3       D

For non-hierarchical expansions, xlevel is 1 for all expanded terms, except the phrase itself, which is always 0, and translations of synonyms in TRSYN, which always have xlevel of 2.

xphrase is the output phrase itself. This will include the qualifier in parentheses, if one exists for the phrase.

The table API is like the existing string API, but instead of returning the expansion, it takes a table as the first argument. Our example above, using the new API, looks like:

  declare
    exp ctx_thes.exp_tab;
  begin 
  ctx_thes.nt(exp, 'animal', 1, 'mythes');
    for i in 1..exp.count loop
      dbms_output.put_line(lpad(' ',2*exp(i).xlevel)||
                           exp(i).xrel||' '||exp(i).xphrase);
    end loop;
  end;

and prints out:

  PHRASE animal
    NT dog
    NT cat
    NT bird

To complete the API, 8.1.6 adds two new expansion functions:

  sn(<phrase>, [<tname>]); 

returns the scope note for the given phrase, and

 thes_tt(<restab>, [<tname>])

returns thesaurus top terms into a PL/SQL table. This is different from TT. TT returns the top term for a given phrase. THES_TT searches the entire thesaurus and returns all top terms. A top term for THES_TT is defined as a term which has narrower terms but no broader terms. This is designed to be used in browsing applications, which may wish to start with the thesaurus top terms. Note that because THES_TT searches the entire thesaurus, it may take some time to complete for large thesauri. If the application needs the thesaurus top terms frequently, we strongly suggest caching the results.

E-mail this page
Printer View Printer View
Oracle Is The Information Company About Oracle | Oracle RSS Feeds | Careers | Contact Us | Site Maps | Legal Notices | Terms of Use | Privacy