Oracle Text 10g Technical Overview

Oracle Text 10g
Technical Overview

This is a technical overview of the improvements found in Oracle Text version 10.1.0.2. This is intended for an audience fairly familiar with previous versions of the product.

If you are new to the product, you should familiarize yourself with the product before reading this document. The official documentation, especially the Application Developer's Guide, is a good place to start, as are the interMedia Text technical overviews for 8.1.5, 8.1.6, 8.1.7, 9.0.1, and 9.2.0 in order. (I really need to get around to creating an updated overall technical overview which merges these together)

The goal of this overview is to supplement the documentation in three ways:

  1. Collect all the new features -- and only the new features -- in one document so that expert users can catch up just by reading this instead of reading the whole doc and mentally diff'ing.

  2. Sketch each new feature without covering every option, bell, and whistle. Give the reader enough information to decide if the new feature merits further investigation in the main doc.

  3. Include information, observations, and notes on each feature which may not be suitable for documentation. For instance, opinions on the pros and cons of feature X, the internal mechanisms of feature Y, speculation of how feature Z could be useful, etc.

If, while reading this document, you think to yourself, "I wish it could do X," or "there's a better way to do this," or (hopefully not) "that's really dumb," please tell us -- we're always eager to hear from the people who actually use our product!  Oracle personnel can email us or our product help mailing list.  External customers -- contact support and have them log an enhancement or drop us an email.




Table of Contents



Upgrading to 10g

Oracle Text upgrade should happen automatically as part of the normal database upgrade.  This section details the manual upgrade process, in case something goes wrong or for the incorrigibly curious.

Oracle Text is now consistent with the database upgrade process in naming convention, run procedure, and registry integration:

  • The naming convention of the upgrade scripts in previous versions was the version to which you are upgrading.  That is, to upgrade to 9.2.0, you ran u0902000.sql.  In 10g, the naming convention is the version from which you are upgrading.  To upgrade to 10g from 9.2, u0902000.sql gets run.

  • In 9.2, there were two scripts to run -- an s script run as SYS, and a u script run as ctxsys.  Now, there is a single script -- ctxu920.sql, for instance, that should be run as sys.  Internally, the script does a ALTER SESSION SET CURRENT_SCHEMA when needed.  In 9.2.0, if jumping multiple versions -- upgrading from 8.1.7 to 9.2.0, for instance, you had to run a pair of scripts for each version step -- s/u0901000 to go from 8.1.7 to 9.0.1, then s/u0902000 to go to 9.2.0.  In 10g, the earlier scripts call the later scripts, so you run only one script, even for a multiple-version jump.

  • As in 9.2.0, Oracle Text supports the DBMS_REGISTRY facility, so that you can query the registry views to see if Oracle Text is installed, and its version.  In 10g, we've added a simple validation procedure which checks that all expected database objects owned by CTXSYS are present and VALID.  This is not an exhaustive check of data dictionary validity, so a VALID status in the registry does not absolutely guarantee successful install, but it is a reasonable first check.

So, the manual upgrade procedure really boils down to a single step:

  1. run ctxu<ver>.sql as SYS
where <ver> is the version FROM which you are upgrading.  For instance, to upgrade from 9.2.0 to 10g, run ctxu920.sql.  If you are coming from 8.1.7, you would run ctxu817.sql instead.
  

Manual Post-upgrade Steps

Due to the security changes documented below, manual post-upgrade steps may be required to keep an existing system operational.  The biggest potential problem is a user datastore procedure owned by ctxsys which depends upon database objects in other users' schemas.  ctxsys's new lowered database privileges will prevent this procedure from working, which in turn will prevent the index from sync'ing new data.  The quickest solution is to manually grant any needed permissions to ctxsys.  However, the recommended solution -- which applies also to those existing user datastores which continue to work -- is to recreate the user datastore as a non-ctxsys-owned procedure and modify the index metadata.  Step-by-step instructions are below.



Security Changes

CTXSYS Database Privileges

CTXSYS no longer has ALL privileges.  It is a normal user with only CONNECT and RESOURCE

In previous versions ctxsys was an über-user with ALL database permissions.  To enhance security in 10g, we've reworked the product infrastructure so that ctxsys needs far fewer permissions.  By default, ctxsys now has only CONNECT, RESOURCE, and some specific SELECT privs on database data dictionary views.  Most systems can even revoke CONNECT and RESOURCE from ctxsys and continue to operate, although this is not actually supported.

This change should not be visible to end users, and even expert OT users shouldn't need to care, except for one wrinkle: the effect on user datastores in existing systems after upgrade.  Consider a simple user datastore created in 9.2 like this:

conn text/iscool
create table foo (a varchar2(10), b varchar2(10));
conn ctxsys/ctxsys
create or replace procedure foo_udp(r in rowid, c in out nocopy clob) is
b varchar2(200);
begin
for c1 in (select a, b from text.foo where rowid = r) loop
b := '<A>'||c1.a||'</A><B>'||c1.b||'</B>';
dbms_lob.writeappend(c, lengthb(b), b);
end loop;
end;
/
grant execute on foo_udp to text;
conn text/iscool
exec ctx_ddl.create_preference('myud','user_datastore');
exec ctx_ddl.set_attribute('myud','procedure','foo_udp');

The user datastore procedure selects from text's table "foo".  The procedure is definer's rights, so executes as ctxsys.  In 9.2, ctxsys has ALL privileges, so the select is allowed.   However, in 10g, ctxsys is a normal user, and no longer has implicit permission to select other users' tables.  The datastore procedure will fail when executed.  There are four simple solutions to this problem:

  1. as text, grant select on foo to ctxsys with grant option

  2. make ctxsys.foo_udp invoker's rights (this works because the effective user during indexing is the index owner -- see below)

  3. recreate the user datastore procedure under the text schema, grant execute to ctxsys, and redefine ctxsys.foo_udp to call text.foo_udp

  4. RECOMMENDED recreate the user datastore procedure under the text schema, and update the index metadata to use this new procedure.  This depends on two new 10g features: user datastore procedures owned by users other than ctxsys, and rebuild index metadata.  Step-by-step instructions, with an example, are given below.

Invoker's Rights CTX Packages

The public CTX packages  are now invoker's rights packages.

In previous versions of the product, all public CTX packages were definer's rights.  The package code emulated invoker's rights by walking up the PL/SQL call stack.  This usually works, but does not support invoker's rights callers.  For instance, consider a wrapper which creates preferences:

conn usera/usera
create or replace procedure cr_ctx_pref(pre_name in varchar2, obj_type in varchar2)
authid current_user
is begin
ctx_ddl.create_preference(pre_name, obj_type);
end;
/
grant execute on cr_ctx_pref to public;
conn userb/userb
exec usera.cr_ctx_pref('myfilter','inso_filter');

userb is calling usera.cr_ctx_pref.  This is an invoker's rights procedure, so the effective user is userb.  This in turn calls ctx_ddl.create_preference.  The effective user emulation code looks at the call stack and sees USERA.CR_CTX_PREF is the caller.  It does not know that this is an invoker's rights package, so assumes that the effective user is USERA; this creates the preference USERA.MYFILTER, rather than USERB.MYFILTER.

In 10g, ctx_ddl has invoker's rights, and the package code determines current effective user from the session environment.  With this method, in the example above, the preference is created correctly as USERB.MYFILTER.

However, one hole still exists in name resolution: we don't honor ALTER SESSION SET CURRENT_SCHEMA.  If the example above does:

conn userb/userb
alter session set current_schema = usera
exec usera.cr_ctx_pref('myfilter','inso_filter');

One might reasonably expect the preference to be created as USERA.MYFILTER, but it is instead created as USERB.MYFILTER.  Current schema support may be added in a future release, but for now this is expected behavior.  To work around this limitation, the call should use owner.object syntax:

conn userb/userb
exec usera.cr_ctx_pref('usera.myfilter','inso_filter');

This will work as long as userb has the ability to create preferences owned by another user, as described below.

One last note.  Invoker determination has effects beyond name resolution, and this change adds support for invoker's rights for those effects, as well.  For instance, during a ctx_query.count_hits operation, to ensure security, the effective user must be able to do a select count(*) where contains -- in other words, we check that the effective user has select privs on the base table of the index.  Since ctx_query is now invoker's rights, you can have globally useful application code owned by a schema without requiring that schema to have permissions on invoker's objects.  For instance, consider an invoker's rights procedure that aggregates count_hits across multiple partitions:

conn usera/usera
create or replace function ach(
qry in varchar2,
idx in varchar2,
parts in sys.odcivarchar2list
) return number
authid current_user is
total number := 0;
begin
for i in 1..parts.count loop
total := total + ctx_query.count_hits(idx, qry, part_name=>parts(i));
end loop;
return total;
end;
/

any user can make use of this function without granting usera select on the tables actually being queried.

Owner's Rights Operations

The effective user during indexing and query is now the index owner.

The effective user during indexing and query in previous versions was ctxsys.  In 10g, the effective user is the index owner.  This is true for all index operations: SQL operations like create index and select .. contains, as well as PL/SQL operations such as ctx_ddl.sync_index, ctx_query.count_hits and ctx_doc.filter -- anywhere where either the index tables or the base table are being touched.

Keep in mind that owner's rights is slightly different from the more familiar invoker's rights or definer's rights models.  Consider a query executed by usera like this:

select * from table where userb.function(a)>0;

The effective user within "function" is either userb, if function is definer's rights, or usera, if the function is invoker's rights.  However, replace userb.function with ctxsys.contains:

select * from table where ctxsys.contains(a,'foo')>0;

and the effective user inside "contains" is neither usera, the invoker, nor the definer ctxsys.  Instead, it is the owner of the index, which may be some third user.

While this is subtle and interesting in a geeky way, the change to owner's rights is of little interest to the Text user except for user datastore procedures (see below) and temp tablespace.  In 9.2, since the effective user was ctxsys when creating the index, it used ctxsys's temp tablespace.  In 10g, it will use the index owner's temp tablespace.

User Datastore Restrictions

The restrictions on user datastore procedures have been relaxed.  Now, any user can own the user datastore procedure.

In 9.2, the effective user during indexing was ctxsys.  ctxsys was a super-user with ALL permissions, so malicious user datastore code could wreak havoc.  To encourage DBA review of user datastore code, only ctxsys could own user datastore procedures.  This made installation of text applications more complex because the ctxsys password had to be known to create the user datastore procedure.

In 10g, thanks to owner's rights during indexing, actions in the user datastore during text indexing are limited by the index owner's permissions -- it is no longer possible for a malicious user to hijack text indexing to give himself more permissions or to do anything that cannot already be done through direct SQL calls.  This eliminates the need for the security restriction on user datastore ownership, so it has been relaxed.  Now, any user can own the user datastore procedure.  In fact, Text development recommends that all user datastore procedures should be owned by non-ctxsys users -- ctxsys should not own application code or objects.  Information on migrating existing user datastores can be found below.

This relaxation is not limited to user datastores.  Any procedure used in a preference attribute -- procedure filter, user lexer, etc. -- can now be owned by any user.

The elimination of the restriction does present a small change in behavior when defining a preference.  In previous versions, since the user datastore procedure had to be owned by ctxsys, something like:

exec ctx_ddl.set_attribute('myuds','procedure','proc');

meant ctxsys.proc.  Now, however, ctx_ddl will follow ctx name resolution and store a fully qualified name in preference metadata.  To avoid potential problems, Text development recommends always using OWNER.PROCEDURENAME style when creating user datastore preferences.

One last security note: owner's rights does not abrogate your responsibility to think carefully about the security ramifications of user datastore procedures.  For instance, a malicious user can still define an invoker's rights user datastore procedure and trick a DBA user into creating an index using it.  In this procedure the malicious user effectively gains DBA permissions.  As with any arbitrary submitted code, user datastores should always be reviewed for security.

Migrating Existing User Datastores

With the new security regime, Text development recommends changing existing indexes with user datastores to non-ctxsys-owned procedures.  Here is a step-by-step procedure to modify an existing text index.  This will not trigger a complete index rebuild, so should have minimal burden for production systems.

  1. create a new user datastore procedure

    The new procedure should be owned by the index owner, or by whatever application code schema is appropriate for your application.  It should duplicate the behavior of the existing user datastore procedure.  If you no longer have the user datastore procedure creation script, you can simply duplicate the existing user datastore procedure -- use ctx_report.describe_index to see which stored procedure is currently being used, then select from user_source as ctxsys to get the code.
  2. create a replacement user datastore preference

    If you still have the user datastore preference used when you created the index, you can simply update that preference by using ctx_ddl.set_attribute. Otherwise, you should create a new user datastore preference using the new procedure name as the PROCEDURE attribute value. As a shortcut, you can use ctx_report.create_index_script on the existing index, and adapt the section of the generated code which creates the user datastore preference.
  3. modify the index metadata

    Now rebuild the index to replace the user datastore:
    alter index myindex rebuild
    parameters('replace metadata datastore newuserdatastorepref');
    Don't worry -- this will NOT reindex the documents. The new replace metadata syntax only updates the index metadata. It should take seconds to run rather than hours.

Cross-User CTX Operations

DBA users can now create/modify/drop CTX objects owned by other users, and sync/optimize other users' indexes.

Since 8.1.5, Text restricted operations on synthetic objects -- preferences, section groups, stoplists, etc. -- to the object owner; usera could create new preferences for himself, set attributes and drop those preferences, but could not touch userb's objects. This is too restrictive for some applications, which do all administration -- including text administration -- from one account which may not be the same as the application code owner schema.  In 10g, DBA's with the CREATE ANY TABLE privilege are allowed to create/modify/drop preferences owned by any user.  This applies to all synthetic objects -- preferences, stoplists, section groups, index sets, and policies.  Yes, it's a little strange to need CREATE ANY TABLE, but there are no database permissions for CTX objects (or for synthetic objects in general) so we picked the closest database priv.

sync and optimize, which in previous versions were limited to the index owner, have also been expanded.  Mimicking the old, deprecated ALTER INDEX method of sync and optimize, the PL/SQL interface now allows any user with ALTER ANY INDEX to sync or optimize any index.

CTX_OUTPUT Restrictions

CTX_OUTPUT.START_LOG  filenames can no longer contain slashes and are limited to 256 bytes

To improve security, log file names can no longer contain slashes, and are limited to 256 bytes in length.  Additionally, an existing log file will be read and format-verified before being overwritten by a new logfile.



Indexing Objects

MULTI_COLUMN_DATASTORE Attribute FILTER

The MULTI_COLUMN_DATASTORE can handle a mix of text and binary columns, automatically sending the binary columns through the INSO filter

The multi-column datastore in 9i worked by simply concatenating the columns together; it was therefore limited to columns which could be converted implicitly to text.  The new FILTER attribute of the multi-column datastore expands support to binary data -- you specify which columns have binary data, and those columns are sent through the INSO filter before concatenation.

FILTER is a comma-separated list of filter indicators which parallel the columns in the column list.  If a column needs filtering, use a Y in its position.  For instance:

ctx_ddl.create_preference('myfil','multi_column_datastore');
ctx_ddl.set_attribute('myfil','columns','cola, colb, colc');
ctx_ddl.set_attribute('myfil','filter', 'Y,N,N');
cola will be filtered, colb and colc will not.  Currently the INSO filtering in the datastore is hard-coded to a fixed timeout of 120, HTML output.  Additionally, either the column gets filtered for every row, or it does not get filtered for every row -- there is no capability for per-row filter indication, like the format column in the general create index.

MULTI_COLUMN_DATASTORE Attribute DELIMITER

The tags that the MULTI_COLUMN_DATASTORE places between columns are now optional

The multi-column datastore places, by default, HTML-like tags around each column.  The output seen by the filter looks like, for instance:

<columna>
these are the contents of columna
</columna>
<columnb>
these are the contents of columnb
</columnb>

In 10g, the new DELIMITER attribute can be set to COLUMN_NAME_TAG, which preserves this behavior, or NEWLINE, which uses a newline instead of these tags.  NEWLINE can be used in situations where the tags are not needed or may be confusing -- columns with XML fragments which are already tagged, or a single value split into multiple columns (like master-detail, but horizontally).

INSO_FILTER Attribute OUTPUT_FORMATTING

The INSO filter is now capable of producing unformatted (non-HTML) output, which is faster to produce than HTML output.

In previous versions, the INSO filter transformed binary documents into (supposedly) WYSIWYG HTML.  In 10g, the INSO filter has a new OUTPUT_FORMATTING boolean which, when set to FALSE, produces unformatted plain text instead.  This is helpful mostly because our tests show that the INSO filter can produce the plain text faster than the HTML -- up to 30 times faster (except for PDF -- there is no performance improvement for PDF because PDF is already using plain text output, as of 9i).  However, it is also helpful in cases where the HTML tags were undesired.  For instance, you can now use the INSO filter with the NULL_SECTION_GROUP.   Or use it with AUTO_SECTION_GROUP and not worry about adding all the HTML tags as stop sections.  Or use it in a user datastore (using policy-based document services) that supposed to output XML without worrying about validity.

And did I mention that it is up to 30 times faster?

MAIL_FILTER

The MAIL_FILTER converts multipart mail messages into a single concatenated text document.

The new MAIL_FILTER object takes email messages and converts them to text documents.  It knows the conventions of RFC-822 and 2045-2049, which is a fancy way of saying that it can deal with MIME.  The short explanation is that the mail filter splits the mail into parts, converts each part to text, then concatenates them together.  In more detail, the basic operation of the mail filter is:

  • separate the headers from the body
  • parse the headers (that is, glue together multi-line headers, handle non-ascii character set encoding, remember details of Content-Type, Content-Transfer-Encoding, etc.)
  • if there are headers which the user wants to preserve, transform them into tagged format (<SUBJECT>blah blah blah</SUBJECT>)
  • split the body into parts
  • for each body part:
    • if the body part is itself multipart, recursively parse it
    • otherwise, if the body part is encoded, decode it
    • if the body part needs filtering, filter it
    • convert the textual body part to the database character set
    • append the body part text to the output
Preserving Header Fields for Section Search

By default the mail filter throws away the mail headers.  You can nominate certain header fields to preserve in the output by using the INDEX_FIELDS attribute, which takes a colon-separated list (why colon-separated instead of comma-separated?  Colon is guaranteed not to be part of a mail header field name, while comma is not).  For instance:

ctx_ddl.create_preference('mymf','mail_filter');
ctx_ddl.set_attribute('mymf','index_fields','subject:from');
...
Input document Output Document
-------------- -------------------------------
Subject: Oracle Text is too fast! <SUBJECT>Oracle Text is too fast!</SUBJECT>
To: otlist@oracle.com <FROM>test.user@someplace.com</FROM>
From: test.user@someplace.com
Content-Type: text/plain; charset=US-ASCII
X-Return-Path: blah blah blah blah blah
The Mail Filter Configuration File

The mail filter does most of the heavy lifting automatically -- splitting up multi-level multi-part messages, converting from the message character set(s) to the database character set, handling RFC-2047 encoded strings, decoding from base64 and quoted printable, etc.  However, its behavior is partially user-adjustable, through the mail filter configuration file.

The mail filter configuration file resides in ?/ctx/config.  By default, it is named drmailfl.txt, but you can create your own file and specify its name in the system parameter MAIL_FILTER_CONFIG_FILE (use ctx_adm.set_parameter).  The configuration applies to all mail filters in the system -- we do not support index-specific configuration in this version.  The file should be in the database character set -- if you have more than one database in more than one character set, ensure that the mail filter configuration file is legal in both charsets, or use two files -- one for each database.

So what's in the mail filter configuration file?  Two parts -- content-type handling specification, and character set name mapping.

Configuring Content-Type Handling

In the content-type handling section, you specify what the mail filter should do with various content-types.  You have three choices:

  • INCLUDE - concatenate the part to the output
  • IGNORE - skip over the part
  • INSOFILTER - INSO filter the part, then concatenate to the output

All you do is list the content-type (type or subtype), some whitespace, then the handling for that content type on the right-hand side. For instance, let's take a look at the default config file:

[behavior]
text INCLUDE
application IGNORE
application/msword INSOFILTER
image IGNORE
audio IGNORE
video IGNORE
model IGNORE

All subtypes of text -- for instance text/plain -- are simply concatenated to the output (charset filtering to the database character set happens automatically).  Most subtypes of application -- application/binkydrawing, for instance -- are discarded.  The specific subtype application/msword, however, gets INSO filtered, then concatenated to the output.  Images, audio, video and 3D models are simply discarded -- they are not useful for text indexing. 

Tuning INSO Filtering

The INSO filter in the mail filter can be tuned.  The INSO_TIMEOUT and INSO_OUTPUT_FORMATTING attributes of the mail filter work just like the TIMEOUT and OUTPUT_FORMATTING attributes of the INSO_FILTER, and their values control the INSO invocations within the mail filter.  Currently timeout type is hardcoded to FIXED.

Adding character-set mappings

The second part of the mail filter configuration file is an area for mapping external character set names to oracle character set names.  By default the mail filter can map IANA character set names (e.g. US-ASCII) to oracle character set names (e.g. US7ASCII).  In this section you can add additional mappings -- external name on the left, oracle name on the right.  For instance, in the default file:

[charsets]
UTF-16 AL16UTF16
UTF-16BE AL16UTF16
UTF-16LE AL16UTF16LE
Changing the Mail Filter Configuration File

When changing the mail filter configuration file, its best to do an "alter system flush shared_pool" to force reloading of the index metadata (and thereby re-reading of the new mail filter config file)

MDATA Sections

The MDATA section is used for handling non-tokenized metadata, with isolated transactional updates.

Metadata in Oracle Text 9i is generally handled through the field section.  For instance, the author of a document may be coded like this:

<author>William Shakespeare</author>

Usually you would see a field section defined for author, and queries like:

(get thee to a nunnery) & (william shakespeare within author)

This works pretty well, but has some limitations:

  • the metadata gets tokenized
    • This prevents exact matching.  A search for "william shakespeare within author" will also hit a document authored by william shakespeare, jr.
    • Browsing does not handle multi-word values.  Using ctx_query.browse_words, you can find out that you have authors with "william" in their name, but not that "william shakespeare" is an author.
    • Special characters serve as token breakers.  For instance, say you have part numbers in a section.  A part number XZ-456 would get tokenized into XZ and 456.  This is now indistinguishable from, say, XZ+456, or XZ-999 (if stop numbers is enabled).  To prevent this, you'll have to add printjoins to the lexer, which may compromise the main body of the documents.
    • Phrase queries are slightly less efficient than single-word queries.  A query for the phrase "William Shakespeare" involves fetching two tokens, merging their document occurrences, then doublechecking the word occurrences for each of the documents in the intersect.  A theoretical query for WilliamShakespeare, on the other hand, involves fetching one token, and the word occurrence information is irrelevant.  Additionally, the posting lists for william and shakespeare will tend to be longer than the posting list for WilliamShakespeare.  In short, fewer tokens make queries a bit faster because there's less work to be done.
  • the metadata can't be changed in isolation
    • let's face it -- usually field section data is coming from another column, the document being indexed is synthetic, coming from pregeneration, user datastore, or multi-column datastore.  When the underlying relational data changes, the only way to propagate the change is to invalidate the existing document, then reindex the entire document.  This makes the document unfindable between the time it is invalidated and the time it is synched, and the unneeded reindexing of the document main body causes excess fragmentation.

10g offers the MDATA section, which is not tokenized and can be changed, transactionally, without synching the entire document. 

Defining MDATA Sections

To add an MDATA section to a section group, call:

ctx_ddl.add_mdata_section(<sectiongroup>,<sectionname>,<tag>)

The semantics and restrictions are the same as zone or field sections -- you can map multiple tags to one section name, section names can't conflict between section types, you can add them to basic, html, xml, and news section groups only, etc.  The one difference is that you can have up to 100 mdata sections.

How MDATA is Indexed

Unlike normal tokens, MDATA values are plucked out of the text by the sectioner and passed directly to the inversion engine, bypassing the lexer.  Let's look at an example, which will contrast normal text, field sections, and mdata sections:

doG <MDATA>  T.S. Elliot  </MDATA> HoUse <FIELD>  T.S. Elliot </FIELD>

This text comes out of the filter and gets processed by the section group.  The normal text (the words which occur outside of tags) gets sent to the lexer.  Both field and mdata section tags get consumed by the section group.  But field section text gets sent to the lexer, while mdata text is consumed by the section group.  So the lexer sees:

doG  HoUse   T.S. Elliot

Let's assume that we're using the default, basic lexer.  This breaks up the text into tokens by looking for non-alphanumerics, then uppercases each resulting token.  Normal text gets a type of 0 and the field section text gets a type of (let's say) 16.  So the engine will produce the following from the lexer (assuming default stoplist, where "S" is a stopword):

DOG      0  DOC1 POS1
HOUSE 0 DOC1 POS2
T 16 DOC1 POS1
ELLIOT 16 DOC1 POS3

Note that because the lexer never sees the MDATA text, you can hit phrases around it -- "dog house" as a query will hit this document, even though "dog" and "house" in the text are separated by an mdata section; MDATA is totally invisible to text search.  Normally, you'd see metadata all at the top of a document or all at the bottom, so most users won't even notice, but it's an interesting aside for detail-oriented types.

Back in the sectioner, the MDATA text gets stripped of leading and trailing whitespace and truncated to 64 bytes.  The result is then sent directly to the engine as a single value.  For our example, the resulting TOKEN_TEXT value will be:

T.S. Elliot

MDATA values get token types between 400 and 499, so we can expect our TOKEN_TYPE to be 400.  As for TOKEN_INFO, MDATA values don't have word position information -- they either exist or don't exist.  So our entry in $I for this is:

T.S. Elliot  400  DOC1

This is very different from how the field section was indexed.

MDATA sections can repeat -- each unique value gets indexed.  So a document like:

<MDATA>value 1</MDATA>
<MDATA>value 2</MDATA>
<MDATA>value 1</MDATA>
<MDATA>value 3</MDATA>
produces
value 1  400 DOC 1
value 2 400 DOC 1
value 3 400 DOC 1
Querying MDATA Sections

MDATA is queried using the new MDATA operator.  The syntax is MDATA(section, value), as in:

where contains(text, 'mdata(author, William Shakespeare)')>0

Pretty easy, but mdata searches have some differences from other elements in the contains language.

First, unlike WITHIN searches, MDATA searches are equality searches -- you'll hit only the exact queried value, no more, no less.  Take our example document and the comparable queries:

Elliot WITHIN field
MDATA(mdata, Elliot)

The first query hits the document, because the token "Elliot" is in between the field start tag and end tag.  The second query does not hit the document, because the mdata value is "T.S. Elliot", not "Elliot".  On the other hand, the field within is unable to search for "Elliot" and NOT hit "Elliot Ness", while the mdata search does this implicitly.  It all comes down to how you want to search your data.  Field sections should be used for cases where word search is needed (titles, for instance) and MDATA sections should be used where the section contents need to be treated as single value.

Second, the MDATA value does not go through the lexer at index time, and neither does the query value.  This means that the query must match the indexed term exactly, without token normalization.  For instance:

t.s. elliot WITHIN field
MDATA(mdata, t.s. elliot)

The first query hits the document.  The input tokens (t, s, elliot) go through the lexer, which uppercases them.  The document text also got uppercased by the lexer at index time, so the tokens hit the document.  The second query, on the other hand, does not hit the document.  "t.s. elliot" does not match "T.S. Elliot".  This may seem limiting at first, but it makes possible features like section-specific case insensitivity.  And, you could always perform symmetric normalization yourself -- just uppercase the value in the document and in your query.  This is fairly easy to do if, like most apps, the queries and documents are generated (user/multi-column datastore, and a user query parser, e.g.).

Third, MDATA searches are not affected by stopwords.  MDATA(artist, The The) is perfectly legal.  In our example, the word "S" was a stopword, so:

T.S. Elliot WITHIN field

is actually looking for T <anyword> ELLIOT in between field start and end tags.  This matches the document

<FIELD>T.R. Elliot</FIELD>

while the comparable MDATA search -- MDATA(mdata, T.S. Elliot) -- would not.

Finally, MDATA searches cannot use expansion operators in this version.  The value in the mdata operator is used directly as the query value -- any characters like %, $, ?, etc. are considered part of the value.  So the query

MDATA(mdata, T.S. Ell%)

does not find our example document.  It would, however, match the document

<MDATA>T.S. Ell%</MDATA>
Isolated MDATA Updates

Using new ctx_ddl functions, you can add and remove mdata values from an already-indexed document.  For example, let's say our documents have mdata to record the clubs to which a student belongs.  An example document:

<NAME>Bruce Aznable</NAME>
<GROUP>Soccer</GROUP>
<GROUP>Cooking</GROUP>

Bruce belongs to the soccer and cooking clubs.  Now he joins the drama club.  With field sections, we would have to reindex the document by updating it, and sync-ing.  With mdata sections, we simply do:

ctx_ddl.add_mdata('myidx','group','Drama','ABCD1234')

The parameters are: index name, mdata section name, new mdata section value, then rowid of the row which we're modifying.  This updates just the mdata for just this document, without syncing.  After this call, a query like

mdata(group, Drama)

Will pull up Bruce.  Similarly, if Bruce decides to stop cooking:

ctx_ddl.remove_mdata('myidx','group','Cooking','ABCD1234')

There is no update_mdata.  If the soccer club changes their name to the football club, you'll have to:

ctx_ddl.remove_mdata('myidx','group','Soccer','ABCD1234');
ctx_ddl.add_mdata('myidx','group','Football','ABCD1234');

Which means that to do an update you'll have to know the existing (old) values.  That shouldn't be too much of a problem, especially if these are called from trigger code.

Under the covers, the add and remove calls insert "delta rows" into $I.  For instance, the add Drama call above results in:

TOKEN  TYPE  FIRST  LAST COUNT  INFO
Drama -400 1234 1234 1 null

Note the negative token type, which makes it a delta row.  At query time, each mdata operator opens a cursor to read its delta rows, which are then woven into the token info for the normal rows.  The end effect is that the delta rows change the hitlist without having to modify existing $I rows.  Note that the add and remove calls do not commit, which allow them to make transactional changes:

ctx_ddl.add_mdata('myidx','group','Football','ABCD1234');
contains(text, 'mdata(group,Football)')>0 -- pulls up the row ABCD1234
rollback;
contains(text, 'mdata(group,Football)')>0 -- does not pull up the row ABCD1234

The delta rows are designed to be transient -- optimize (full or rebuild) integrates them permanently by rewriting the normal rows.  Because each mdata operator opens a cursor, you probably don't want to have hundreds of mdata operators in your query.

One last note on the mdata updates -- if you have a lot of values or a lot of rows, there is a multiple-value/multiple-rowid version:

ctx_ddl.add_mdata('myidx','group',sys.odcivarchar2list('Football','Rugby'),sys.odciridlist('ABCD1234','EFGH5678'));

In each list you can pass several values.  The call adds each value in the list to each rowid in the list.  This should be faster than several add calls.

MDATA Tricks

A companion document for advanced users is available on OTN ( http://otn.oracle.com/products/text/htdocs/mdata_tricks.htm) It's a more technical, in-depth look at some tricks you can do with MDATA -- getting around mdata limitations, handling dynamic token normalization, mixed structured range criteria, and efficient sort by structured column.

BASIC_LEXER Attribute NEW_GERMAN_SPELLING

The basic lexer can automatically handle  some of the differences due to new German spelling

Over the last decade or so, German spelling reform has changed the way German words are spelled.  From an IR perspective, this is as irritating as the Brits misspelling words like "harbour centre" -- it's harder to find occurrences of a word because the computer can't easily tell that the pre-reform spelling of a word (Photo) is the same as the post-reform spelling (Foto).  The NEW_GERMAN_SPELLING attribute of the lexer, when set to TRUE, will detect old spellings and transform them to the new spellings.  For example:

Document: Känguruh daß Potential Schifffahrt
...
exec ctx_ddl.create_preference('ngslx','basic_lexer');
exec ctx_ddl.set_attribute('ngslx','new_german_spelling','true');
...
TOKEN_TEXT in $I:
DAß
DASS
KÄNGURU
KÄNGURUH
POTENTIAL
POTENZIAL
SCHIFFFAHRT

The old spellings are indexed twice -- once in the old form, and once in the new form.  Words already correctly spelled -- Schifffahrt in our example -- are indexed only in new form.  During query, any entered old spellings are transformed to the new spelling:

contains(text, 'Schiffahrt')>0

hits the document.  The old spelling (with 2 f's) is not in our $I.  However, at query time, it's automatically converted to the new spelling (3 f's), which does exist.  This means that you can query in old or new spelling and hit documents which use either the old or new spelling.  Querying for just the new spelling or just the old spelling is not currently possible.

So if old spellings are converted to new spellings at query time, why are the old spellings indexed?  For wildcard.  The basic lexer cannot convert partial old spellings to new spellings, so a query like Potent% must expand on old spellings.  On the other hand, a query for Schiffa% will not hit this document, even though it matches the old spelling of "Schifffahrt", which is in our document.  So it really only works as long as your documents are completely old spelling, or completely new spelling.

This feature does not implement all Rechtschreibreform changes.  Single word to multi-word changes, in particular, are not handled.  For instance, Radfahren should now be written as two words -- Rad fahren, and sequences like 8jährig get hyphenated to 8-jährig, which normally would be lexed as two words.  However, the NGS basic lexer does not currently support this.  It also does not support capitalization changes -- Swimming-pool (old) to Swimming-Pool (new) for instance.  And, it's a little too aggressive on ß changes -- it thinks that spaß should be spelled "spass" -- but these would normally be handled through alternate spelling, which is usually on for German indexing.

There is also possibly inconsistent behavior when combining new German spelling with base letter.  The base letter transform is applied only to the new spelling.
Consider a lexer with new German spelling and base letter processing the word "Känguruh".  We know that this gets indexed as "Känguruh" and "Känguru" because of new German spelling.  But after this only the new spelling gets base lettered, so we index "Känguruh" and "Kanguru". This produces a small hole in the query results.  Compare this to a lexer with only base letter on:

QUERY:       Känguruh  Kanguruh  Känguru  Kanguru
NGS + BL hit no hit hit hit
BL ONLY hit hit no hit no hit

With only base letter, we can hit on Känguruh or Kanguruh, but not the new spellings.  With new German spelling on, we can hit on the new spellings, but we lose the ability to hit on "Kanguruh".  At first glance this may not seem so bad, since the native German speaker would not query for "Kanguruh" -- they'd never just drop the umlaut from the a.  However, consider "Varieté".  Under the new German spelling rules, this becomes "Varietee":

QUERY:       Varieté   Variete   Varietee
NGS + BL hit no hit hit
BL ONLY hit hit no hit

Each lexer has a problem: NGS+BL won't hit on "Variete", which the average German may type, since é is not in the German alphabet (in the same way that an American might type "manana" instead of "mañana"), and the BL ONLY won't hit on the new spelling, Varietee.

Alternate spelling coexists a bit better.  A lexer with alternate spelling German and new German spelling indexes "Kaenguruh" as "Kaenguruh" and "Känguru", and can find this document by any of the four queries "Kaenguruh", "Kaenguru", "Känguruh" or "Känguru". 

When alternate spelling German AND base letter AND new German spelling are on, you still suffer from the foreign base letter query problem discussed above.  Other than that, the query semantics are not surprising. 

BASIC_LEXER Attribute OVERRIDE_BASE_LETTER

OVERRIDE_BASE_LETTER allows you to switch the direction of alternate spelling vowel transforms

9.2 allowed users to set BASE_LETTER and ALTERNATE_SPELLING at the same time.  Unfortunately, this produces linguistically undesirable behavior in some cases, due to our normalization method.  Consider three documents:

DOC1: Brüder
DOC2: Bruder
DOC3: Brueder

In Text, each word is sent through the alternate speller first.  The alternate spelling transformer looks for two-letter alternate spellings and transforms them to single-letters.  The original form is also saved for wildcard matching, in case there are changes.  So our table would get lexed:

       ORIGINAL      ALT. SPELL
DOC1: Brüder -> Brüder
DOC2: Bruder -> Bruder
DOC3: Brueder -> Brueder Brüder

Now the token is sent through base lettering.  Note that if the alternate speller saved an original form, that form does NOT go through the base lettering.  Only the transformed version goes through.  So our table is:

       ORIGINAL      ALT. SPELL          BASE LETTER
DOC1: Brüder -> Brüder -> Bruder
DOC2: Bruder -> Bruder -> Bruder
DOC3: Brueder -> Brueder Brüder -> Brueder Bruder

When someone searches for "Brüder", it goes through the same process -- becoming a search for "Bruder".  This matches all the documents, because the form "Bruder" was indexed for all.  The problem is that this is incorrect.  "Bruder" and "Brüder" are different words.  Searching for "Brüder" should match "Brüder" and "Brueder", but not "Bruder".  Searching for "Bruder" should match only "Bruder".

To get such semantics, 10g has the OVERRIDE_BASE_LETTER attribute.  This reverses the alternate spelling transform so that instead of changing two letters to one, it changes one letter to two (vowels only -- ß is always transformed to ss).  Let's go through our table again:

       ORIGINAL      ALT. SPELL
DOC1: Brüder -> Brüder Brueder
DOC2: Bruder -> Bruder
DOC3: Brueder -> Brueder

Note that the alternate spelling now hits doc 1 instead of doc 3.  Now the base letter:

       ORIGINAL      ALT. SPELL          BASE LETTER
DOC1: Brüder -> Brüder Brueder -> Brüder Brueder
DOC2: Bruder -> Bruder -> Bruder
DOC3: Brueder -> Brueder -> Brueder

The ü in doc 1 is not changed to u because the saved original form is not sent through the base letterer -- only the transformed version.  Now we get the semantics we want.  At query time, "Bruder" is unchanged by alternate spelling or base lettering, and matches only doc 2.  "Brüder" and "Brueder", on the other hand, get transformed to "Brueder", which is the normalized form of docs 1 and 3. 

So it's a little confusing under the covers, but the short story is that if you're indexing with BASE_LETTER and ALTERNATE_SPELLING on, you're probably better off with OVERRIDE_BASE_LETTER also on.

JAPANESE_LEXER Attribute DELIMITER

The DELIMITER attribute signals the Japanese lexer to honor common Japanese printjoins

The Japanese lexer attempts to split Japanese text into words.  Normally, it breaks on any non-Japanese characters.  ABCD&EFG, for instance (assuming ABCD and EFG are Japanese characters) gets split automatically at the &, even if DE is a word (you can add D&E to the lexicon to prevent this).  Middle dot and forward slash are often used in Japanese between related words -- similar to the function of hyphen in English compound words.  Setting the DELIMITER attribute to ALL effectively declares these two characters as printjoins.  So ABC/DEF gets lexed as ABC/DEF instead of ABC and DEF.  It's a little confusing, because by turning DELIMITER on, you're actually saying don't delimit, so just think of it as a join setting.

User-extensible Japanese & Chinese Lexicons

The lexicons used by the Japanese and Chinese lexers are extensible.

The Japanese and Chinese lexers are driven by lexicons -- think of them as big word lists.  In 10g you can extend and override the lexicons with your own words -- neologisms, company-specific terminology, etc.  Here's how to do it:

  1. convert the existing lexicon to text.  You do this by using ctxlc:
       ctxlc -ja -ocs ja16euc > jalex
    This command takes the Japanese lexicon in ?/ctx/data/jalx and converts it to text in the ja16euc character set. The result gets directed to standard out, so we've used UNIX piping to direct it to a file called jalex.  The output character set (ocs) is whatever NLS character set name you want to work in.  For Chinese, use the -zht switch instead of -ja.  Don't worry about the switch being called zht -- the lexicon is not tied to traditional or simplified -- that's all controlled by -ocs.
  2. Edit the lexicon file to add your own terms, delete existing terms, etc.

  3. Compile your wordlist to binary:

        ctxlc -ja -ics ja16euc -n -i jalex
    -ics is the character set of the input file, and -i is the text file to compile to binary. (-n is needed because we're creating new files, rather than modifying old ones). This will generate new lexicon files in the current directory:
        drold.dat droli.dat drolk.dat
  4. Now back up the files in ?/ctx/data/jalx (or ?/ctx/data/zhlx) if you wish, then replace them with the newly-generated files. You will have to append JA or ZHT to the name of the file to conform to the naming convention.  For instance::

        mv drold.dat $ORACLE_HOME/ctx/data/jalx/droldJA.dat
  5. Flush the shared pool to force reload of the new lexicon. This is not technically needed right now, but may be in the future.

Chinese Lexers New Supported Character Sets

The Chinese lexers now support new character sets.

The Chinese lexers (VGRAM and lexical) now support any database character set for Chinese, including the newer ZHS32GB18030 character set.

WORLD_LEXER

The WORLD_LEXER automatically handles multi-language corpora and documents by Unicode codepoint detection.

The new WORLD_LEXER indexes multi-lingual documents and corpora on unicode databases.  It processes a document by looking at the codepoints of the characters, and following an appropriate segmentation strategy:

  • Words made of Chinese, Japanese, and Thai characters get bigram segmented -- they are split into overlapping sequences of 2 characters.  Japanese kana is vgram segmented.
  • Words made of Arabic characters get whitespace segmented, then post-processed with Arabic-specific simple normalization and decompounding:
    • Removal of vowel marks
    • Separation of definite articles, pronoun suffixes, prepositions, conjunctions, and particles from the base word.  For instance, the Arabic translation for "his teacher" is written as a single sequence.  The World lexer separates this into "his" and "teacher"
    • case-folding, verb stemming and gender normalization are not currently handled
  • Everything else (Latin characters, Cyrillic, Greek, etc) gets whitespace segmented

Since the segmentation strategy is determined by codepoint, you do not need to identify the language of the document, like you do with the multi-lexer.  Furthermore, you can index multi-lingual documents -- the world lexer will shift segmentation strategy as needed to deal with the multiple languages.  In keeping with its automatic nature, the world lexer has no attributes to worry about.  To use it, just create a preference:

ctx_ddl.create_preference('mylex','world_lexer');
...
create index .. parameters ('lexer mylex');

On the query side, you don't have to worry about query-time language settings. There's only one lexer, unlike the multi-lexer's collection of multiple lexers.

On the other hand, the world lexer's lack of attributes means that it has no language-specific features.  You cannot do fancy things like German decompounding, Japanese segmentation, or Swedish alternate spelling.  You can't define printjoins, use themes, or do stem-form indexing.  So while the multi-lexer is more work to set up, it is a lot more flexible.  Here's a short comparison between the world lexer and the multi-lexer:


Feature
MULTI_LEXER
WORLD_LEXER
Definition and Setup
use ctx_ddl API to set up individual lexers and collect them into a multi-lexer
no programming needed
Multi-lingual corpora (mono-lingual documents)
Supported
Supported
Multi-lingual documents
Not Supported
Supported
Database charset
Any
Unicode only (AL32UTF8 and UTF8)
Set attributes of each language
Supported
Not Supported
Document language ID
User-identified, per document
Not needed
Query language identification
User-identified, per query Not needed
Arabic
Keyword search only
Arabic-specific features
Chinese/Japanese
VGRAM or Segmentation
VGRAM only

Japanese Stemming

Japanese Stemming is now supported when using the JAPANESE_LEXER

The JAPANESE setting for stemming in the basic wordlist implements Japanese inflection expansion.  It pretty much works like stemming in other languages -- enter in a form of a word and it finds all inflection forms of that word in the documents.  However, there are some differences:

  • Japanese stemming is supported only for the segmentation lexer
  • Stemming is supported only for verbs and adjectives.  Since this includes Japanese-special cases like noun-verb and adjective-verb (unten suru, shizuka datta) and Japanese nouns don't generally have plural form, this should be practically sufficient.
  • The input to stemming must start with kanji and end with a regular kana ending.  For instance, TANOshikatta, where TANO is a kanji and the rest is written in kana

The Japanese stemmer is homegrown technology (unlike our other stemmers which we license).  If you have any feedback on its behavior, let us know.

Japanese Fuzzy

Japanese Fuzzy is now supported for kana fuzzy matching.

The JAPANESE setting for fuzzy match in the basic wordlist implements Japanese fuzzy.  The input at query time should be a kana sequence.  The fuzzy matcher will expand this to similar kana strings in the document corpora using character similarity clusters.  This should work with both the segmentation and vgram Japanese lexers.  It's probably of most use when finding inconsistent Japanization of imported words.  For instance, BAIOLIN or VAIOLIN for violin.



Index Creation and Maintenance

Direct Path Inserts During Index Creation

Index creation should be noticeably faster due to the use of direct path inserts.

During an index creation, Text inverts the documents and collects the token lists in memory.  When memory is full, it flushes the memory to the index tables.  In 9i, we did this using normal SQL inserts.  In 10g, we've switched to direct path inserts.  This is much faster, and you can even avoid undo/redo log use if you've created your index tables with nologging.  Currently this happens only during an index creation -- sync still uses SQL for insertion.  This preserves your ability to sync and query at the same time -- if sync were to use direct path loading, the $X index would go invalid for small windows of time, and concurrent queries would fail.

Parallel Local Domain Text Index Creation

Local domain text indexes can now be built in parallel using the PARALLEL keyword in the create index statement

Local domain text indexes in 9i, like b-tree indexes, can be built in parallel using the DBMS_PCLXUTIL helper package.  In 10g you can also use the parallel keyword in the create index statement.  The index gets built with inter-partition parallelism -- the system will work on multiple partitions at one time, with no more than one slave per partition.  You can still use DBMS_PCLXUTIL for intra- and inter-/intra- partition parallelism.

Metadata-only Index Rebuild

The METADATA keyword in alter index rebuild changes the index metadata without reindexing.

Normally a change to the index metadata necessitates a rebuild.  For instance, changing the lexer to turn base_letter on makes the existing index data invalid -- the documents have to be relexed or queries will get inconsistent results.  However, some changes don't require a rebuild.  Changing the stemmer setting, switching from single-lexer to multi-lexer, changing the user datastore procedure name (while keeping the same code), etc.  For these situations, 10g has the METADATA keyword, which changes a rebuild operation from a complete index rebuild to a metadata-only index rebuild.

Let's say you want to change from a single-language lexer to the world lexer.  First you create your replacement preference:

exec ctx_ddl.create_preference('rlex','world_lexer');

Then rebuild your index using the REPLACE operation, and include the METADATA keyword:

alter index myidx rebuild parameters ('replace metadata lexer rlex');

This removes the existing lexer and plugs in the replacement lexer, but leaves the index tables alone.

The main caveat is that this operation gives you enough rope to shoot yourself in the foot.  If go and do something inconsistent with existing index data -- like changing the stopword list or turning on prefix indexing -- it won't stop you.  So you should either be certain that your changes are backwardly compatible with existing data, or accept possible query inconsistencies.

Another caveat, possibly obvious, is that while you can change the storage preference using the METADATA keyword, it has no effect because the index tables are not touched by this operation.  It would affect a future "alter index .. rebuild;" command, though, since at that time it will read the updated metadata and rebuild the index tables.

The ODCI interface does not allow partitioned indexes to be rebuilt. You can use replace metadata on individual partitions to, for instance, change the sync interval. But you can't alter index rebuild for the overall index -- you'll get the error ORA-14086: a partitioned index may not be rebuilt as a whole. However, we have a PL/SQL procedure which works around ODCI.  Just call it with the index name and the parameters string, like this:

exec ctx_ddl.replace_index_metadata('myidx','replace metadata lexer rlex');

Automatic Sync/Sync on Commit

The SYNC keyword of create index parameters can set up a dbms job or force sync after every commit

To keep their context indexes up-to-date, customers usually have a dbms_job which periodically wakes up and syncs the index.  The new SYNC keyword can do this step for you at create index time:

create index myidx on mytable(mycolumn)
indextype is ctxsys.context
parameters ('sync (every "SYSDATE+5/1440")');

The create index code will automatically create a new dbms_scheduler job that syncs this index every 5 minutes.  The double-quoted interval that follows EVERY is passed directly through to dbms_scheduler -- please refer to the PL/SQL Packages and Types Reference for information on the syntax and examples of use.  You can also set memory and parallel details of the sync call:

create index myidx on mytable(mycolumn)
indextype is ctxsys.context
parameters ('sync (every "SYSDATE+5/1440" MEMORY 15M PARALLEL 2)');

Of course this is mainly a convenience feature -- you could do all this yourself.

But what if every 5 minutes isn't good enough for you?  What if you really really need super up-to-the-minute updates?  Then either use transactional queries or the second option for the sync: COMMIT:

create index myidx on mytable(mycolumn)
indextype is ctxsys.context
parameters ('sync (on commit)');

That's all there is -- with COMMIT you cannot set memory or parallel.  In this mode, the syncs happen automatically when inserts or updates to the base table are committed.   Three caveats:

  1. since usually a transaction has only one or two document changes, you can expect this to increase fragmentation and thus decrease query performance
  2. since only one session can sync at a time this will serialize concurrent commits
  3. the sync is actually done in a two step process. First the transaction commits, then the sync occurs. This means that, in the window of time between the transaction commit and the sync completion, the documents are visible to other users but not searchable.

Sync on commit is not mutually exclusive with transactional query. You can have both of them on if you want. But if you were to choose one, which to choose? The next section, on transaction query, draws a more detailed comparison, but the short answer is that most customers who need up-to-date query should use EVERY sync and transactional query, instead of ON COMMIT sync.

The last mode of SYNC is MANUAL, which maintains current behavior -- that is, don't sync on commit, don't create a scheduler job.

Transactional Queries

Transactional queries use per-row evaluation to deliver statement-level consistent contains queries.

Customers sometimes tell us that they need immediate results -- when they insert or update a row, it needs to be searchable right away.  The context index, like most information retrieval products, does not do this by default.  Deletes are immediate, but inserts and updates are asynchronous; the document is unavailable until the next sync.  Practical experience shows that this is usually acceptable.  The requirement for "right away" search in most systems is more theoretical than necessity -- if the application can live without this requirement, or find some alternative way of fulfilling it, the app generally runs better.  But for those situations where it is unavoidable, context indexes now have the option of being fully transactional.

To use this new feature, specify the transactional keyword during create index:

create index myidx on mytable(mycolumn)
indextype is ctxsys.context
parameters ('transactional');

You can also alter index rebuild .. replace metadata transactional to make an existing index transactional. 

Updates and inserts on a transactional index get logged in dr$pending, like normal indexes, but the rowids also get logged in a second table, dr$unindexed.  During query, the $I produces its hits, then each rowid in dr$unindexed is evaluated, individually, to see if it is a match -- that is, we fetch the document text, filter it, section it, invert it, then run the contains query on that document's data -- somewhat like a highlight operation.  The set of rowids from dr$unindexed that match is combined with the results from the $I scan and the merged set is returned from contains.

Like dr$pending, dr$unindexed is fully transactional -- you can savepoint, rollback, etc. and the effects are immediate.  So you can do stuff like:

select count(*) from table where contains(text, 'someword')>0;    -- 0 hits
insert into table values ('someword');
select count(*) from table where contains(text, 'someword')>0; -- 1 hit (the one we just entered)
rollback;
select count(*) from table where contains(text, 'someword')>0; -- 0 hits

So why do we still need sync?  The cost for evaluating a single unindexed rowid can be fairly high -- depends on your index.  Perhaps a user datastore procedure needs to be executed.  Maybe the INSO filter will have to be run.  Even the presence of one unindexed rowid can slow down query response time.  As dr$unindexed gets more and more rows, the queries will get slower and slower.   A sync indexes the documents, adds the data to the $I index, then removes them from dr$unindexed.  Since searching $I is more scalable than evaluating documents one at a time, your queries should be faster after the sync.  So best practice (if you absolutely MUST have instant searching) is to have transactional query and a reasonable periodic sync -- as infrequently as possible to control fragmentation, but frequent enough that dr$unindexed doesn't get hairy.

If some of your queries need instant searching, but others don't (like count_hits estimates, etc) you can turn off the dr$unindexed scan at the session level:

exec ctx_query.disable_transactional_query := TRUE;

Future contains queries will not scan dr$unindexed.  To turn transactionality back on, set the variable to FALSE.  A bit confusing to set to false to turn something on, but think of it as turning off the override.

How does transactional query compare to sync on commit?


Transactional
Sync on Commit
can query documents immediately after insert or update
YES
NO
can query documents immediately after commit
YES
YES (small window where documents are visible but not queryable through contains)
impacts query speed
YES -- slower because dr$unindexed must be scanned
NO -- not directly, but increased fragmentation can lead to slower queries
increases index fragmentation
NO
YES

Transactional query is slower because of dr$unindexed, true, but the sync on commit, unless dealing with large transactions, will increase fragmentation, which may slow down queries anyway.  And transactional gives you more immediate gratification than sync on commit -- you can query right after the update or insert, instead of having to wait for the commit.  So I think in general transactional is better than sync on commit.  It would be much better than sync for those systems where documents change very frequently when they are young -- for instance, an editorial system where writers are saving drafts and making edits.  Transactional would keep up with the frequent changes, then the companion periodic sync would capture the reasonably "final" version, while sync on commit would have to index every saved version, leading to dramatically increased fragmentation.

However, transactional query and sync on commit are not mutually exclusive. If you want to turn both of them on, you can. Personally, I think sync EVERY with transactional is better, because you get the benefits of transactional consistency without the fragmentation of sync commit, but your application mileage may vary.

Token Type Optimization

The TOKEN_TYPE mode of optimize optimizes all tokens of a specific type

The new TOKEN_TYPE mode does a FULL optimize, but limited to all tokens of a specific token type.  You can think of it as an expansion of the TOKEN mode, which targets a specific (token, token_type) .  You invoke it like a TOKEN optimize, but specify TOKEN_TYPE for the mode:

exec ctx_ddl.optimize_index('myidx','TOKEN_TYPE',token_type=>1);

This optimizes all theme tokens (type 1).  For mode you can use the symbol CTX_DDL.OPTLEVEL_TOKEN_TYPE instead of a literal 'TOKEN_TYPE', if you want.  For the token_type parameter, you can use the numeric token type code, if you know it, or ctx_report.token_type if you don't (see example below). Like TOKEN optimize, TOKEN_TYPE runs fully to completion, so don't specify maxtime.  It optimizes all tokens of the given type, so don't specify a token parameter, either.  You can run TOKEN_TYPE optimization in parallel, so you may specify a parallel_degree parameter if you wish.

Behind the scenes, the token_type optimize goes through almost the same code as a FULL optimize -- read the $N table, scan the $I table, rewrite the rows by removing garbage and defragging small rows.  The main difference is that the TOKEN_TYPE mode will simply ignore any row that is not of the specified type.

This mode is probably most useful for keeping field sections and mdata sections up-to-date.  For instance, suppose you have an mdata section named "symbol" that holds stock symbols, on which you want to maintain speedy performance.  Periodically, run something like:

exec ctx_ddl.optimize_index('myidx','TOKEN_TYPE',
token_type=>ctx_report.token_type('myidx','MDATA symbol'));

This will keep all values of the MDATA section symbol defragged and optimal.

Rebuild Optimization

The REBUILD mode optimizes an index by building a fresh shadow copy of the index, then swapping it with the existing index.

FULL optimize updates the $I table in place.  This can be slow, since we must use SQL inserts and deletes, which are slower than direct path writing.  On top of that, the $X index slows down any DML on $I.  The new REBUILD mode of optimize is designed to get around these problems.  You invoke it like a fast optimize, but specify REBUI