Oracle Text 10g Technical Overview
Oracle Text 10g
Technical Overview
This is a technical overview of the improvements found in Oracle Text
version 10.1.0.2. This is intended for an audience fairly familiar with
previous versions
of the product.
If you are new to the product, you should familiarize yourself with
the
product before reading this document. The official documentation,
especially
the Application Developer's Guide, is a good place to start, as are the
interMedia Text technical overviews for 8.1.5, 8.1.6, 8.1.7, 9.0.1, and
9.2.0
in order. (I really need to get around to creating an updated overall
technical
overview which merges these together)
The goal of this overview is to supplement the documentation in
three ways:
-
Collect all the new features -- and only the new features -- in
one document so that expert users can catch up just by reading this
instead of reading the whole doc and mentally diff'ing.
-
Sketch each new feature without covering every option, bell, and
whistle. Give
the reader enough information to decide if the new feature merits
further investigation in the main doc.
-
Include information, observations, and notes on each feature
which may
not be suitable for documentation. For instance, opinions on the pros
and cons
of feature X, the internal mechanisms of feature Y, speculation of how
feature
Z could be useful, etc.
If, while reading this document, you think to yourself, "I wish it
could do X," or "there's a better way to do this," or (hopefully not)
"that's really dumb," please tell us -- we're always eager to hear from
the people who actually use our product! Oracle personnel can
email us or our product help mailing list. External customers --
contact support and have them log an enhancement or drop us an email.
Table of Contents
Upgrading to 10g
Oracle Text upgrade should happen automatically as part of the normal
database upgrade. This section details the manual upgrade
process, in case something goes wrong or for the incorrigibly
curious.
Oracle Text is now consistent with the database upgrade process in
naming convention, run procedure, and registry integration:
-
The naming convention of the upgrade scripts in previous
versions
was the version to which you are upgrading. That is, to upgrade
to 9.2.0, you ran u0902000.sql. In 10g, the naming convention is
the version from which you are upgrading. To upgrade to 10g from
9.2, u0902000.sql gets run.
-
In 9.2, there were two scripts to run -- an s script run as SYS,
and a u script run as ctxsys. Now, there is a single script --
ctxu920.sql, for instance, that should be run as sys. Internally,
the script does a ALTER SESSION SET CURRENT_SCHEMA when needed.
In 9.2.0, if jumping multiple versions -- upgrading from 8.1.7 to
9.2.0, for instance, you had to run a pair of scripts for each version
step -- s/u0901000 to go from 8.1.7 to 9.0.1, then s/u0902000 to go to
9.2.0. In 10g, the earlier scripts call the later scripts, so you
run only one script, even for a multiple-version jump.
-
As in 9.2.0, Oracle Text supports the DBMS_REGISTRY facility, so
that you can query the registry views to see if Oracle Text is
installed, and its version. In 10g, we've added a simple
validation procedure which checks that all expected database objects
owned by CTXSYS are present and VALID. This is not an exhaustive
check of data dictionary validity, so a VALID status in the registry
does not absolutely guarantee successful install, but it is a
reasonable first check.
So, the manual upgrade procedure really boils down to a single step:
- run ctxu<ver>.sql as SYS
where <ver> is the version FROM which you are upgrading.
For instance, to upgrade from 9.2.0 to 10g, run ctxu920.sql. If
you are coming from 8.1.7, you would run ctxu817.sql instead.
Manual Post-upgrade Steps
Due to the security changes documented below,
manual post-upgrade steps may be required to keep an existing system
operational. The biggest potential problem is a user datastore
procedure owned by ctxsys which depends upon database objects in other
users'
schemas. ctxsys's new lowered database privileges will prevent
this procedure from working, which in turn will prevent the index from
sync'ing new data. The quickest solution is to manually grant any
needed permissions to ctxsys. However, the recommended solution
-- which applies also to those existing user datastores which continue
to work -- is to recreate the user datastore as a non-ctxsys-owned
procedure and modify the index metadata. Step-by-step
instructions are below.
Security Changes
CTXSYS Database Privileges
CTXSYS
no longer has ALL privileges. It is a normal user with only
CONNECT and RESOURCE
In previous versions ctxsys was an über-user with ALL database
permissions. To enhance security in 10g, we've reworked the
product infrastructure so that ctxsys needs far fewer
permissions. By default, ctxsys now has only CONNECT, RESOURCE,
and some specific SELECT privs on database data dictionary views.
Most systems can even revoke CONNECT and RESOURCE from ctxsys and
continue to operate, although this is not actually supported.
This change should not be visible to end users, and even expert OT
users shouldn't need to care, except for one wrinkle: the effect on
user datastores in existing systems after upgrade. Consider a
simple user datastore created in 9.2 like this:
conn text/iscool create table foo (a varchar2(10), b varchar2(10)); conn ctxsys/ctxsys create or replace procedure foo_udp(r in rowid, c in out nocopy clob) is b varchar2(200); begin for c1 in (select a, b from text.foo where rowid = r) loop b := '<A>'||c1.a||'</A><B>'||c1.b||'</B>'; dbms_lob.writeappend(c, lengthb(b), b); end loop; end; / grant execute on foo_udp to text; conn text/iscool exec ctx_ddl.create_preference('myud','user_datastore'); exec ctx_ddl.set_attribute('myud','procedure','foo_udp');
The user datastore procedure selects from text's table "foo".
The
procedure is definer's rights, so executes as ctxsys. In 9.2,
ctxsys has ALL privileges, so the select is allowed.
However, in 10g, ctxsys is a normal user, and no longer has implicit
permission to select other users' tables. The datastore procedure
will fail when executed. There are four simple solutions to
this problem:
-
as text, grant select on foo to ctxsys with grant option
-
make ctxsys.foo_udp invoker's rights (this works because the
effective user during indexing is the index owner -- see below)
-
recreate the user datastore procedure under the text schema,
grant execute to ctxsys, and redefine ctxsys.foo_udp to call
text.foo_udp
-
RECOMMENDED recreate the user datastore procedure under the text
schema, and update the index metadata to use this new procedure.
This depends on two new 10g features: user
datastore procedures owned by users other than ctxsys, and rebuild index metadata.
Step-by-step instructions, with an example, are given below.
Invoker's Rights CTX Packages
The
public CTX packages are now invoker's rights packages.
In previous versions of the product, all public CTX packages were
definer's rights. The package code emulated invoker's rights by
walking up the PL/SQL call stack. This usually works, but does
not support invoker's rights callers. For instance, consider a
wrapper which creates preferences:
conn usera/usera create or replace procedure cr_ctx_pref(pre_name in varchar2, obj_type in varchar2) authid current_user is begin ctx_ddl.create_preference(pre_name, obj_type); end; / grant execute on cr_ctx_pref to public; conn userb/userb exec usera.cr_ctx_pref('myfilter','inso_filter');
userb is calling usera.cr_ctx_pref. This is an invoker's
rights procedure, so the effective user is userb. This in turn
calls ctx_ddl.create_preference. The effective user emulation
code looks at the call stack and sees USERA.CR_CTX_PREF is the
caller. It does not know that this is an invoker's rights
package, so assumes that the effective user is USERA; this creates the
preference USERA.MYFILTER, rather than USERB.MYFILTER.
In 10g, ctx_ddl has invoker's rights, and the package code
determines current effective user from the session environment.
With this method, in the example above, the preference is created
correctly as USERB.MYFILTER.
However, one hole still exists in name resolution: we don't honor
ALTER SESSION SET CURRENT_SCHEMA. If the example above does:
conn userb/userb alter session set current_schema = usera exec usera.cr_ctx_pref('myfilter','inso_filter');
One might reasonably expect the preference to be created as
USERA.MYFILTER, but it is instead created as USERB.MYFILTER.
Current schema support may be added in a future release, but for now
this is expected behavior. To work around this limitation, the
call should use owner.object syntax:
conn userb/userb exec usera.cr_ctx_pref('usera.myfilter','inso_filter');
This will work as long as userb has the ability to create
preferences owned by another user, as described below.
One last note. Invoker determination has effects beyond name
resolution, and this change adds support for invoker's rights for those
effects, as well. For instance, during a ctx_query.count_hits
operation, to ensure security, the effective user must be able to do a
select count(*) where contains -- in other words, we check that the
effective user has select privs on the base table of the index.
Since ctx_query is now invoker's rights, you can have globally useful
application code owned by a schema without requiring that schema to
have permissions on invoker's objects. For instance, consider an
invoker's rights procedure that aggregates count_hits across multiple
partitions:
conn usera/usera create or replace function ach( qry in varchar2, idx in varchar2, parts in sys.odcivarchar2list ) return number authid current_user is total number := 0; begin for i in 1..parts.count loop total := total + ctx_query.count_hits(idx, qry, part_name=>parts(i)); end loop; return total; end; /
any user can make use of this function without granting usera select
on the tables actually being queried.
Owner's Rights Operations
The
effective user during indexing and query is now the index owner.
The effective user during indexing and query in previous versions
was ctxsys. In 10g, the effective user is the index owner.
This is true for all index operations: SQL operations like create index
and select .. contains, as well as PL/SQL operations such as
ctx_ddl.sync_index, ctx_query.count_hits and ctx_doc.filter -- anywhere
where either the index tables or the base table are being touched.
Keep in mind that owner's rights is slightly different from the more
familiar invoker's rights or definer's rights models. Consider a
query executed by usera like this:
select * from table where userb.function(a)>0;
The effective user within "function" is either userb, if function is
definer's rights, or usera, if the function is invoker's rights.
However, replace userb.function with ctxsys.contains:
select * from table where ctxsys.contains(a,'foo')>0;
and the effective user inside "contains" is neither usera, the
invoker, nor the definer ctxsys. Instead, it is the owner of the
index, which may be some third user.
While this is subtle and interesting in a geeky way, the change to
owner's rights is of little interest to the Text user except for user
datastore procedures (see below) and temp
tablespace. In 9.2, since the effective user was ctxsys when
creating the index, it used ctxsys's temp tablespace. In 10g, it
will use the index owner's temp tablespace.
User Datastore Restrictions
The restrictions on user datastore procedures have been relaxed.
Now, any user can own the user datastore procedure.
In 9.2, the effective user during indexing was ctxsys. ctxsys
was a super-user with ALL permissions, so malicious user datastore code
could wreak havoc. To encourage DBA review of user datastore
code, only ctxsys could own user datastore procedures. This made
installation of text applications more complex because the ctxsys
password had to be known to create the user datastore procedure.
In 10g, thanks to owner's rights during
indexing, actions in the user datastore during text indexing are
limited by the index owner's permissions -- it is no longer possible
for a malicious user to hijack text indexing to give himself more
permissions or to do anything that cannot already be done through
direct SQL calls. This eliminates the need for the security
restriction on user datastore ownership, so it has been relaxed.
Now, any user can own the user datastore procedure. In fact, Text
development recommends that all user datastore procedures should be
owned by non-ctxsys users -- ctxsys should not own application code or
objects. Information on migrating existing user datastores can be
found below.
This relaxation is not limited to user datastores. Any
procedure
used in a preference attribute -- procedure filter, user lexer, etc. --
can now be owned by any user.
The elimination of the restriction does present a small change in
behavior when defining a preference. In previous versions, since
the user datastore procedure had to be owned by ctxsys, something like:
exec ctx_ddl.set_attribute('myuds','procedure','proc');
meant ctxsys.proc. Now, however, ctx_ddl will follow ctx name
resolution and store a fully qualified name in preference
metadata. To avoid potential problems, Text development
recommends always using OWNER.PROCEDURENAME style when creating user
datastore preferences.
One last security note: owner's rights does not abrogate your
responsibility to think carefully about the security ramifications of
user datastore procedures. For instance, a malicious user can
still define an invoker's rights user datastore procedure and trick a
DBA user into creating an index using it. In this procedure the
malicious user effectively gains DBA permissions. As with any
arbitrary submitted code, user datastores should always be reviewed for
security.
Migrating Existing User Datastores
With the new security regime, Text development recommends changing
existing indexes with user datastores to non-ctxsys-owned
procedures. Here is a step-by-step procedure to modify an
existing text index. This will not trigger a complete index
rebuild, so should have minimal burden for production systems.
-
create a new user datastore procedure
The new procedure should be owned by the index owner, or by whatever
application code schema is appropriate for your application. It
should duplicate the behavior of the existing user datastore
procedure. If you no longer have the user datastore procedure
creation script, you can simply duplicate the existing user datastore
procedure -- use ctx_report.describe_index to see which
stored procedure is currently being used, then select from user_source
as ctxsys to get the code.
-
create a replacement user datastore preference
If you still have the user datastore preference used when you created
the index, you can simply update that preference by using ctx_ddl.set_attribute.
Otherwise, you should create a new user datastore preference using the
new
procedure name as the PROCEDURE attribute value. As a shortcut, you can
use ctx_report.create_index_script on the existing index,
and adapt
the section of the generated code which creates the user datastore
preference.
-
modify the index metadata
Now rebuild the index to replace the user datastore:
alter index myindex rebuild parameters('replace metadata datastore newuserdatastorepref');
Don't worry -- this will NOT reindex the documents. The new replace
metadata syntax only updates the index metadata. It should take seconds
to run rather than hours.
Cross-User CTX Operations
DBA
users can now create/modify/drop CTX objects owned by other users, and
sync/optimize other users' indexes.
Since 8.1.5, Text restricted operations on synthetic
objects -- preferences, section groups, stoplists, etc. -- to the
object owner; usera could create new preferences for himself, set
attributes and drop those preferences, but could not touch userb's
objects. This is too restrictive for some applications, which do all
administration -- including text administration -- from one account
which may not be the same as the application code owner schema.
In 10g, DBA's with the CREATE ANY TABLE privilege are allowed to
create/modify/drop preferences owned by any user. This applies to
all synthetic objects -- preferences, stoplists, section groups, index
sets, and policies. Yes, it's a little strange to need CREATE ANY
TABLE, but there are no database permissions for CTX objects (or for
synthetic objects in general) so we picked the closest database priv.
sync and optimize, which in previous versions were limited to the
index owner, have also been expanded. Mimicking the old,
deprecated ALTER INDEX method of sync and optimize, the PL/SQL
interface now allows any user with ALTER ANY INDEX to sync or optimize
any index.
CTX_OUTPUT Restrictions
CTX_OUTPUT.START_LOG
filenames can no longer contain slashes and are limited to 256 bytes
To improve security, log file names can no longer
contain slashes, and are limited to 256 bytes in length.
Additionally, an existing log file will be read and format-verified
before being overwritten by a new logfile.
Indexing Objects
MULTI_COLUMN_DATASTORE Attribute FILTER
The
MULTI_COLUMN_DATASTORE can handle a mix of text and binary columns,
automatically sending the binary columns through the INSO filter
The multi-column datastore in 9i worked by simply
concatenating the columns together; it was therefore limited to columns
which could be converted implicitly to text. The new FILTER
attribute of the multi-column datastore expands support to binary data
-- you specify which columns have binary data, and those columns are
sent through the INSO filter before concatenation.
FILTER is a comma-separated list of filter indicators which parallel
the columns in the column list. If a column needs filtering, use
a Y in its position. For instance:
ctx_ddl.create_preference('myfil','multi_column_datastore'); ctx_ddl.set_attribute('myfil','columns','cola, colb, colc'); ctx_ddl.set_attribute('myfil','filter', 'Y,N,N');
cola will be filtered, colb and colc will not. Currently the INSO
filtering in the datastore is hard-coded to a fixed timeout of 120,
HTML output. Additionally, either the column gets filtered for
every row, or it does not get filtered for every row -- there is no
capability for per-row filter indication, like the format column in the
general create index.
MULTI_COLUMN_DATASTORE Attribute DELIMITER
The
tags that the MULTI_COLUMN_DATASTORE places between columns are now
optional
The multi-column datastore places, by default, HTML-like
tags around each column. The output seen by the filter looks
like, for instance:
<columna> these are the contents of columna </columna> <columnb> these are the contents of columnb </columnb>
In 10g, the new DELIMITER attribute can be set to COLUMN_NAME_TAG,
which preserves this behavior, or NEWLINE, which uses a newline instead
of these tags. NEWLINE can be used in situations where the tags
are not needed or may be confusing -- columns with XML fragments which
are already tagged, or a single value split into multiple columns (like
master-detail, but horizontally).
INSO_FILTER Attribute OUTPUT_FORMATTING
The INSO
filter is now capable of producing unformatted (non-HTML) output, which
is faster to produce than HTML output.
In previous versions, the INSO filter transformed binary
documents into (supposedly) WYSIWYG HTML. In 10g, the INSO filter
has a new OUTPUT_FORMATTING boolean which, when set to FALSE, produces
unformatted plain text instead. This is helpful mostly because
our tests show that the INSO filter can produce the plain text faster
than the HTML -- up to 30 times faster (except for PDF -- there is no
performance improvement for PDF because PDF is already using plain text
output, as of 9i). However, it is also helpful in cases where the
HTML tags were undesired. For instance, you can now use the INSO
filter with the NULL_SECTION_GROUP. Or use it with
AUTO_SECTION_GROUP and not worry about adding all the HTML tags as stop
sections. Or use it in a user datastore (using policy-based document services) that supposed
to output XML without worrying about validity.
And did I mention that it is up to 30 times faster?
MAIL_FILTER
The
MAIL_FILTER converts multipart mail messages into a single concatenated
text document.
The new MAIL_FILTER object takes email messages and
converts them to text documents. It knows the conventions of
RFC-822 and 2045-2049, which is a fancy way of saying that it can deal
with MIME. The short explanation is that the mail filter splits
the mail into parts, converts each part to text, then concatenates them
together. In more detail, the basic operation of the mail filter
is:
- separate the headers from the body
- parse the headers (that is, glue together multi-line headers,
handle non-ascii character set encoding, remember details of
Content-Type, Content-Transfer-Encoding, etc.)
- if there are headers which the user wants to preserve, transform
them into tagged format (<SUBJECT>blah blah blah</SUBJECT>)
- split the body into parts
- for each body part:
- if the body part is itself multipart, recursively parse it
- otherwise, if the body part is encoded, decode it
- if the body part needs filtering, filter it
- convert the textual body part to the database character set
- append the body part text to the output
Preserving
Header Fields for Section Search
By default the mail filter throws away the mail headers. You
can
nominate certain header fields to preserve in the output by using the
INDEX_FIELDS attribute, which takes a colon-separated list (why
colon-separated instead of comma-separated? Colon is guaranteed
not to be part of a mail header field name, while comma is not).
For instance:
ctx_ddl.create_preference('mymf','mail_filter'); ctx_ddl.set_attribute('mymf','index_fields','subject:from'); ... Input document Output Document -------------- ------------------------------- Subject: Oracle Text is too fast! <SUBJECT>Oracle Text is too fast!</SUBJECT> To: otlist@oracle.com <FROM>test.user@someplace.com</FROM> From: test.user@someplace.com Content-Type: text/plain; charset=US-ASCII X-Return-Path: blah blah blah blah blah
The Mail
Filter Configuration File
The mail filter does most of the heavy lifting automatically --
splitting up multi-level multi-part messages, converting from the
message character set(s) to the database character set, handling
RFC-2047 encoded strings, decoding from base64 and quoted printable,
etc. However, its behavior is partially user-adjustable, through
the mail filter configuration file.
The mail filter configuration file resides in ?/ctx/config. By
default, it is named drmailfl.txt, but you can create your own file and
specify its name in the system parameter MAIL_FILTER_CONFIG_FILE (use
ctx_adm.set_parameter). The configuration applies to all mail
filters in the system -- we do not support index-specific configuration
in this version. The file should be in the database character set
-- if you have more than one database in more than one character set,
ensure that the mail filter configuration file is legal in both
charsets, or use two files -- one for each database.
So what's in the mail filter configuration file? Two parts --
content-type handling specification, and character set name mapping.
Configuring Content-Type Handling
In the content-type handling section, you specify what the mail
filter
should do with various content-types. You have three choices:
- INCLUDE - concatenate the part to the output
- IGNORE - skip over the part
- INSOFILTER - INSO filter the part, then concatenate to the output
All you do is list the content-type (type or subtype), some
whitespace,
then the handling for that content type on the right-hand side. For
instance, let's take a look at the default config file:
[behavior] text INCLUDE application IGNORE application/msword INSOFILTER image IGNORE audio IGNORE video IGNORE model IGNORE
All subtypes of text -- for instance text/plain -- are simply
concatenated to the output (charset filtering to the database character
set happens automatically). Most subtypes of application --
application/binkydrawing, for instance -- are discarded. The
specific subtype application/msword, however, gets INSO filtered, then
concatenated to the output. Images, audio, video and 3D models
are simply discarded -- they are not useful for text indexing.
Tuning
INSO Filtering
The INSO filter in the mail filter can be tuned. The
INSO_TIMEOUT
and INSO_OUTPUT_FORMATTING attributes of the mail filter work just like
the TIMEOUT and OUTPUT_FORMATTING attributes of the INSO_FILTER, and
their values control the INSO invocations within the mail filter.
Currently timeout type is hardcoded to FIXED.
Adding
character-set mappings
The second part of the mail filter configuration file is an area for
mapping external character set names to oracle character set
names. By default the mail filter can map IANA character set
names (e.g. US-ASCII) to oracle character set names (e.g.
US7ASCII). In this section you can add additional mappings --
external name on the left, oracle name on the right. For
instance, in the default file:
[charsets] UTF-16 AL16UTF16 UTF-16BE AL16UTF16 UTF-16LE AL16UTF16LE
Changing
the Mail Filter Configuration File
When changing the mail filter configuration file, its best to do an
"alter system flush shared_pool" to force reloading of the index
metadata (and thereby re-reading of the new mail filter config file)
MDATA Sections
The
MDATA section is used for handling non-tokenized metadata, with
isolated transactional updates.
Metadata in Oracle Text 9i is generally handled through the field
section. For instance, the author of a document may be coded like
this:
<author>William Shakespeare</author>
Usually you would see a field section defined for author, and
queries
like:
(get thee to a nunnery) & (william shakespeare within author)
This works pretty well, but has some limitations:
- the metadata gets tokenized
- This prevents exact matching. A search for "william
shakespeare within author" will also hit a document authored by william
shakespeare, jr.
- Browsing does not handle multi-word values. Using
ctx_query.browse_words, you can find out that you have authors with
"william" in their name, but not that "william shakespeare" is an
author.
- Special characters serve as token breakers. For instance,
say you have part numbers in a section. A part number XZ-456
would get tokenized into XZ and 456. This is now
indistinguishable from, say, XZ+456, or XZ-999 (if stop numbers is
enabled). To prevent this, you'll have to add printjoins to the
lexer, which may compromise the main body of the documents.
- Phrase queries are slightly less efficient than single-word
queries. A query for the phrase "William Shakespeare" involves
fetching two tokens, merging their document occurrences, then
doublechecking the word occurrences for each of the documents in the
intersect. A theoretical query for WilliamShakespeare, on the
other hand, involves fetching one token, and the word occurrence
information is irrelevant. Additionally, the posting lists for
william and shakespeare will tend to be longer than the posting list
for WilliamShakespeare. In short, fewer tokens make queries a bit
faster because there's less work to be done.
- the metadata can't be changed in isolation
- let's face it -- usually field section data is coming from
another column, the document being indexed is synthetic, coming from
pregeneration, user datastore, or multi-column datastore. When
the underlying relational data changes, the only way to propagate the
change is to invalidate the existing document, then reindex the entire
document. This makes the document unfindable between the time it
is invalidated and the time it is synched, and the unneeded reindexing
of the document main body causes excess fragmentation.
10g offers the MDATA section, which is not tokenized and can be
changed, transactionally, without synching the entire document.
Defining
MDATA Sections
To add an MDATA section to a section group, call:
ctx_ddl.add_mdata_section(<sectiongroup>,<sectionname>,<tag>)
The semantics and restrictions are the same as zone or field
sections
-- you can map multiple tags to one section name, section names can't
conflict between section types, you can add them to basic, html, xml,
and news section groups only, etc. The one difference is that you
can have up to 100 mdata sections.
How MDATA
is Indexed
Unlike normal tokens, MDATA values are plucked out of the text by
the
sectioner and passed directly to the inversion engine, bypassing the
lexer. Let's look at an example, which will contrast normal text,
field sections, and mdata sections:
doG <MDATA> T.S. Elliot </MDATA> HoUse <FIELD> T.S. Elliot </FIELD>
This text comes out of the filter and gets processed by the section
group. The normal text (the words which occur outside of tags)
gets sent to the lexer. Both field and mdata section tags get
consumed by the section group. But field section text gets sent
to the lexer, while mdata text is consumed by the section group.
So the lexer sees:
doG HoUse T.S. Elliot
Let's assume that we're using the default, basic lexer. This
breaks up the text into tokens by looking for non-alphanumerics, then
uppercases each resulting token. Normal text gets a type of 0 and
the field section text gets a type of (let's say) 16. So the
engine will produce the following from the lexer (assuming default
stoplist, where "S" is a stopword):
DOG 0 DOC1 POS1 HOUSE 0 DOC1 POS2 T 16 DOC1 POS1 ELLIOT 16 DOC1 POS3
Note that because the lexer never sees the MDATA text, you can hit
phrases around it -- "dog house" as a query will hit this document,
even though "dog" and "house" in the text are separated by an mdata
section; MDATA is totally invisible to text search. Normally,
you'd see metadata all at the top of a document or all at the bottom,
so most users won't even notice, but it's an interesting aside for
detail-oriented types.
Back in the sectioner, the MDATA text gets stripped of leading and
trailing whitespace and truncated to 64 bytes. The result is then
sent directly to the engine as a single value. For our example,
the resulting TOKEN_TEXT value will be:
T.S. Elliot
MDATA values get token types between 400 and 499, so we can expect
our
TOKEN_TYPE to be 400. As for TOKEN_INFO, MDATA values don't have
word position information -- they either exist or don't exist. So
our entry in $I for this is:
T.S. Elliot 400 DOC1
This is very different from how the field section was indexed.
MDATA sections can repeat -- each unique value gets indexed. So a
document like:
<MDATA>value 1</MDATA> <MDATA>value 2</MDATA> <MDATA>value 1</MDATA> <MDATA>value 3</MDATA>
produces
value 1 400 DOC 1 value 2 400 DOC 1 value 3 400 DOC 1
Querying
MDATA Sections
MDATA is queried using the new MDATA operator. The syntax is
MDATA(section, value), as in:
where contains(text, 'mdata(author, William Shakespeare)')>0
Pretty easy, but mdata searches have some differences from other
elements in the contains language.
First, unlike WITHIN searches, MDATA searches are equality searches --
you'll hit only the exact queried value, no more, no less. Take
our example document and the comparable queries:
Elliot WITHIN field MDATA(mdata, Elliot)
The first query hits the document, because the token "Elliot" is in
between the field start tag and end tag. The second query does
not hit the document, because the mdata value is "T.S. Elliot", not
"Elliot". On the other hand, the field within is unable to search
for "Elliot" and NOT hit "Elliot Ness", while the mdata search does
this implicitly. It all comes down to how you want to search your
data. Field sections should be used for cases
where word search is needed
(titles, for instance) and MDATA sections should be used where the
section
contents need to be treated as single value.
Second, the MDATA value does not go through the lexer at index time,
and neither does the query value. This means that the query must
match the indexed term exactly, without token normalization. For
instance:
t.s. elliot WITHIN field MDATA(mdata, t.s. elliot)
The first query hits the document. The input tokens (t, s,
elliot) go through the lexer, which uppercases them. The document
text also got uppercased by the lexer at index time, so the tokens hit
the document. The second query, on the other hand, does not hit
the document. "t.s. elliot" does not match "T.S. Elliot".
This may seem limiting at first, but it makes possible features like
section-specific case insensitivity. And, you could always
perform symmetric normalization yourself -- just uppercase the value in
the document and in your query. This is fairly easy to do if,
like most apps, the queries and documents are generated
(user/multi-column datastore, and a user query parser, e.g.).
Third, MDATA searches are not affected by stopwords.
MDATA(artist, The The) is perfectly legal. In our example, the
word "S" was a stopword, so:
T.S. Elliot WITHIN field
is actually looking for T <anyword> ELLIOT in between field
start
and end tags. This matches the document
<FIELD>T.R. Elliot</FIELD>
while the comparable MDATA search -- MDATA(mdata, T.S. Elliot) --
would
not.
Finally, MDATA searches cannot use expansion operators in this
version. The value in the mdata operator is used directly as the
query value -- any characters like %, $, ?, etc. are considered part of
the value. So the query
MDATA(mdata, T.S. Ell%)
does not find our example document. It would, however, match
the
document
<MDATA>T.S. Ell%</MDATA>
Isolated
MDATA Updates
Using new ctx_ddl functions, you can add and remove mdata values
from
an already-indexed document. For example, let's say our documents
have mdata to record the clubs to which a student belongs. An
example document:
<NAME>Bruce Aznable</NAME> <GROUP>Soccer</GROUP> <GROUP>Cooking</GROUP>
Bruce belongs to the soccer and cooking clubs. Now he joins
the drama club. With field sections, we would have to reindex the
document by updating it, and sync-ing. With mdata sections, we
simply do:
ctx_ddl.add_mdata('myidx','group','Drama','ABCD1234')
The parameters are: index name, mdata section name, new mdata
section
value, then rowid of the row which we're modifying. This updates
just the mdata for just this document, without syncing. After
this call, a query like
mdata(group, Drama)
Will pull up Bruce. Similarly, if Bruce decides to stop
cooking:
ctx_ddl.remove_mdata('myidx','group','Cooking','ABCD1234')
There is no update_mdata. If the soccer club changes their
name to the
football club, you'll have to:
ctx_ddl.remove_mdata('myidx','group','Soccer','ABCD1234'); ctx_ddl.add_mdata('myidx','group','Football','ABCD1234');
Which means that to do an update you'll have to know the existing
(old)
values. That shouldn't be too much of a problem, especially if
these are called from trigger code.
Under the covers, the add and remove calls insert "delta rows" into
$I. For instance, the add Drama call above results in:
TOKEN TYPE FIRST LAST COUNT INFO Drama -400 1234 1234 1 null
Note the negative token type, which makes it a delta row. At
query
time, each mdata operator opens a cursor to read its delta rows, which
are then woven into the token info for the normal rows. The end
effect is that the delta rows change the hitlist without having to
modify existing $I rows. Note that the add and remove calls do
not commit, which allow them to make transactional changes:
ctx_ddl.add_mdata('myidx','group','Football','ABCD1234'); contains(text, 'mdata(group,Football)')>0 -- pulls up the row ABCD1234 rollback; contains(text, 'mdata(group,Football)')>0 -- does not pull up the row ABCD1234
The delta rows are designed to be transient -- optimize (full or
rebuild) integrates them permanently by rewriting the normal
rows. Because each mdata operator opens a cursor, you probably
don't want to have hundreds of mdata operators in your query.
One last note on the mdata updates -- if you have a lot of values or a
lot of rows, there is a multiple-value/multiple-rowid version:
ctx_ddl.add_mdata('myidx','group',sys.odcivarchar2list('Football','Rugby'),sys.odciridlist('ABCD1234','EFGH5678'));
In each list you can pass several values. The call adds each
value in the list to each rowid in the list. This should be
faster than several add calls.
MDATA
Tricks
A companion document for advanced users is available on OTN ( http://otn.oracle.com/products/text/htdocs/mdata_tricks.htm)
It's a more technical, in-depth look at some tricks you can do with
MDATA -- getting around mdata limitations,
handling dynamic token normalization, mixed structured range criteria,
and efficient sort by structured column.
BASIC_LEXER Attribute NEW_GERMAN_SPELLING
The
basic lexer can automatically handle some of the differences due
to new German spelling
Over the last decade or so, German spelling reform has
changed the way German words are spelled. From an IR perspective,
this is as irritating as the Brits misspelling words like "harbour
centre" -- it's harder to find occurrences of a word because the
computer can't easily tell that the pre-reform spelling of a word
(Photo) is the same as the post-reform spelling (Foto). The
NEW_GERMAN_SPELLING attribute of the lexer, when set to TRUE, will
detect old spellings and transform them to the new spellings. For
example:
Document: Känguruh daß Potential Schifffahrt ... exec ctx_ddl.create_preference('ngslx','basic_lexer'); exec ctx_ddl.set_attribute('ngslx','new_german_spelling','true'); ... TOKEN_TEXT in $I: DAß DASS KÄNGURU KÄNGURUH POTENTIAL POTENZIAL SCHIFFFAHRT
The old spellings are indexed twice -- once in the old form, and
once
in the new form. Words already correctly spelled -- Schifffahrt
in our example -- are indexed only in new form. During query, any
entered old spellings are transformed to the new spelling:
contains(text, 'Schiffahrt')>0
hits the document. The old spelling (with 2 f's) is not in our
$I. However, at query time, it's automatically converted to the
new spelling (3 f's), which does exist. This means that you can
query in old or new spelling and hit documents which use either the old
or new spelling. Querying for just the new spelling or just the
old spelling is not currently possible.
So if old spellings are converted to new spellings at query time, why
are the old spellings indexed? For wildcard. The basic
lexer cannot convert partial old spellings to new spellings, so a query
like Potent% must expand on old spellings. On the other hand, a
query for Schiffa% will not hit this document, even though it matches
the old spelling of "Schifffahrt", which is in our document. So
it really only works as long as your documents are completely old
spelling, or completely new spelling.
This feature does not implement all Rechtschreibreform changes.
Single word to multi-word changes, in particular, are not
handled. For instance, Radfahren should now be written as
two words -- Rad fahren, and sequences like 8jährig get hyphenated
to 8-jährig, which normally would be lexed as two words.
However, the NGS basic lexer does not currently support this. It
also does not support capitalization changes -- Swimming-pool (old) to
Swimming-Pool (new) for instance. And, it's a little too
aggressive on ß changes -- it thinks that spaß should be
spelled "spass" -- but these would normally be handled through
alternate spelling, which is usually on for German indexing.
There is also possibly inconsistent behavior when combining new German
spelling with base letter. The base letter transform is applied
only to the new spelling.
Consider a lexer with new German spelling and base letter processing
the word "Känguruh". We know that this gets indexed as
"Känguruh" and "Känguru" because of new German
spelling. But after this only the new spelling gets base
lettered, so we index "Känguruh" and "Kanguru". This produces a
small hole in the query results. Compare this to a lexer with
only base letter on:
QUERY: Känguruh Kanguruh Känguru Kanguru NGS + BL hit no hit hit hit BL ONLY hit hit no hit no hit
With only base letter, we can hit on Känguruh or Kanguruh, but
not
the new spellings. With new German spelling on, we can hit on the
new spellings, but we lose the
ability to hit on "Kanguruh". At first glance this may
not seem so bad, since the native German speaker would not query for
"Kanguruh" -- they'd never just drop the umlaut from the a.
However, consider "Varieté". Under the new German spelling
rules, this becomes "Varietee":
QUERY: Varieté Variete Varietee NGS + BL hit no hit hit BL ONLY hit hit no hit
Each lexer has a problem: NGS+BL won't hit on "Variete", which the
average German may type, since é is not in the German alphabet
(in the same way that an American might type "manana" instead of
"mañana"), and the BL ONLY won't hit on the new spelling,
Varietee.
Alternate spelling coexists a bit better. A lexer with alternate
spelling German and new German spelling indexes "Kaenguruh" as
"Kaenguruh" and "Känguru", and can find this document by any of
the four queries "Kaenguruh", "Kaenguru", "Känguruh" or
"Känguru".
When alternate spelling German AND base letter AND new German spelling
are on, you still suffer from the foreign base letter query problem
discussed above. Other than that, the query semantics are not
surprising.
BASIC_LEXER Attribute OVERRIDE_BASE_LETTER
OVERRIDE_BASE_LETTER
allows you to switch the direction of alternate spelling vowel
transforms
9.2 allowed users to set BASE_LETTER and
ALTERNATE_SPELLING at the same time. Unfortunately, this produces
linguistically undesirable behavior in some cases, due to our
normalization method. Consider three documents:
DOC1: Brüder DOC2: Bruder DOC3: Brueder
In Text, each word is sent through the alternate speller
first.
The alternate spelling transformer looks for two-letter alternate
spellings and transforms them to single-letters. The original
form is also saved for wildcard matching, in case there are
changes. So our table would get lexed:
ORIGINAL ALT. SPELL DOC1: Brüder -> Brüder DOC2: Bruder -> Bruder DOC3: Brueder -> Brueder Brüder
Now the token is sent through base lettering. Note that if the
alternate speller saved an original form, that form does NOT go through
the base lettering. Only the transformed version goes
through. So our table is:
ORIGINAL ALT. SPELL BASE LETTER DOC1: Brüder -> Brüder -> Bruder DOC2: Bruder -> Bruder -> Bruder DOC3: Brueder -> Brueder Brüder -> Brueder Bruder
When someone searches for "Brüder", it goes through the same
process -- becoming a search for "Bruder". This matches all the
documents, because the form "Bruder" was indexed for all. The
problem is that this is incorrect. "Bruder" and "Brüder" are
different words. Searching for "Brüder" should match
"Brüder" and "Brueder", but not "Bruder". Searching for
"Bruder" should match only "Bruder".
To get such semantics, 10g has the OVERRIDE_BASE_LETTER
attribute. This reverses the alternate spelling transform so that
instead of changing two letters to one, it changes one letter to
two (vowels only -- ß is always transformed to ss). Let's
go through our table again:
ORIGINAL ALT. SPELL DOC1: Brüder -> Brüder Brueder DOC2: Bruder -> Bruder DOC3: Brueder -> Brueder
Note that the alternate spelling now hits doc 1 instead of doc
3.
Now the base letter:
ORIGINAL ALT. SPELL BASE LETTER DOC1: Brüder -> Brüder Brueder -> Brüder Brueder DOC2: Bruder -> Bruder -> Bruder DOC3: Brueder -> Brueder -> Brueder
The ü in doc 1 is not changed to u because the saved original
form
is not sent through the base letterer -- only the transformed
version. Now we get the semantics we want. At query time,
"Bruder" is unchanged by alternate spelling or base lettering, and
matches only doc 2. "Brüder" and "Brueder", on the other
hand, get transformed to "Brueder", which is the normalized form of
docs 1 and 3.
So it's a little confusing under the covers, but the short story is
that if you're indexing with BASE_LETTER and ALTERNATE_SPELLING on,
you're probably better off with OVERRIDE_BASE_LETTER also on.
JAPANESE_LEXER Attribute DELIMITER
The
DELIMITER attribute signals the Japanese lexer to honor common Japanese
printjoins
The Japanese lexer attempts to split Japanese text into
words. Normally, it breaks on any non-Japanese characters.
ABCD&EFG, for instance (assuming ABCD and EFG are Japanese
characters) gets split automatically at the &, even if DE is a word
(you can add D&E to the lexicon to prevent this). Middle dot
and forward slash are often used in Japanese between related words --
similar to the function of hyphen in English compound words.
Setting the DELIMITER attribute to ALL effectively declares these two
characters as printjoins. So ABC/DEF gets lexed as ABC/DEF
instead of ABC and DEF. It's a little confusing, because by
turning DELIMITER on, you're actually saying don't delimit, so just think of it
as a join setting.
User-extensible Japanese & Chinese Lexicons
The
lexicons used by the Japanese and Chinese lexers are extensible.
The Japanese and Chinese lexers are driven by lexicons
-- think of them as big word lists. In 10g you can extend and
override the lexicons with your own words -- neologisms,
company-specific terminology, etc. Here's how to do it:
- convert the existing lexicon to text. You do this by using
ctxlc:
ctxlc -ja -ocs ja16euc > jalex
This command takes the Japanese lexicon in ?/ctx/data/jalx and converts
it to text in the ja16euc character set. The result gets directed to
standard out, so we've used UNIX piping to direct it to a file
called jalex. The output character set (ocs) is whatever NLS
character set name you want to work in. For Chinese, use the -zht
switch instead of -ja. Don't worry about the switch being called
zht -- the lexicon is not tied to traditional or simplified -- that's
all controlled by -ocs.
-
Edit the lexicon file to add your own terms, delete existing
terms, etc.
-
Compile your wordlist to binary:
ctxlc -ja -ics ja16euc -n -i jalex
-ics is the character set of the input file, and -i is the text file to
compile to binary. (-n is needed because we're creating new files,
rather than modifying old ones).
This will generate new lexicon files in the current directory:
drold.dat droli.dat drolk.dat
-
Now back up the files in ?/ctx/data/jalx (or ?/ctx/data/zhlx) if
you wish, then replace them with
the newly-generated files. You will have to append JA or ZHT to the
name of the file to conform to the naming convention. For
instance::
mv drold.dat $ORACLE_HOME/ctx/data/jalx/droldJA.dat
- Flush the shared pool to force reload of the new lexicon.
This is not technically needed right now, but may be in the future.
Chinese Lexers New Supported Character Sets
The
Chinese lexers now support new character sets.
The Chinese lexers (VGRAM and lexical) now support any
database character set for Chinese, including the newer ZHS32GB18030
character set.
WORLD_LEXER
The
WORLD_LEXER automatically handles multi-language corpora and documents
by Unicode codepoint detection.
The new WORLD_LEXER indexes multi-lingual
documents and corpora on unicode databases. It processes a
document by looking at the codepoints of the characters, and following
an appropriate segmentation strategy:
- Words made of Chinese, Japanese, and Thai characters get bigram
segmented -- they are split into overlapping sequences of 2
characters. Japanese kana is vgram segmented.
- Words made of Arabic characters get whitespace segmented, then
post-processed with Arabic-specific simple normalization and
decompounding:
- Removal of vowel marks
- Separation of definite articles, pronoun suffixes,
prepositions, conjunctions, and particles from the base word. For
instance, the Arabic translation for "his teacher" is written as a
single sequence. The World lexer separates this into "his" and
"teacher"
- case-folding, verb stemming and gender normalization are not
currently handled
- Everything else (Latin characters, Cyrillic, Greek, etc) gets
whitespace segmented
Since the segmentation strategy is determined by codepoint, you do
not
need to identify the language of the document, like you do with the
multi-lexer. Furthermore, you can index multi-lingual documents
-- the world lexer will shift segmentation strategy as needed to deal
with the multiple languages. In keeping with its automatic
nature, the world lexer has no attributes to worry about. To use
it, just create a preference:
ctx_ddl.create_preference('mylex','world_lexer'); ... create index .. parameters ('lexer mylex');
On the query side, you don't have to worry about query-time language
settings. There's only one lexer, unlike the multi-lexer's collection
of multiple lexers.
On the other hand, the world lexer's lack of attributes means that it
has no language-specific features. You cannot do fancy things
like German decompounding, Japanese segmentation, or Swedish alternate
spelling. You can't define printjoins, use themes, or do
stem-form indexing. So while the multi-lexer is more work to set
up, it is a lot more flexible. Here's a short comparison between
the world lexer and the multi-lexer:
Feature
|
MULTI_LEXER
|
WORLD_LEXER
|
Definition and Setup
|
use ctx_ddl API to
set up individual lexers and collect them into a multi-lexer
|
no programming needed
|
Multi-lingual
corpora (mono-lingual documents)
|
Supported
|
Supported
|
Multi-lingual
documents
|
Not Supported
|
Supported
|
Database charset
|
Any
|
Unicode only
(AL32UTF8 and UTF8)
|
Set attributes of
each language
|
Supported
|
Not Supported
|
Document language ID
|
User-identified, per
document
|
Not needed
|
Query language
identification
|
User-identified, per
query |
Not needed
|
Arabic
|
Keyword search only
|
Arabic-specific
features
|
Chinese/Japanese
|
VGRAM or Segmentation
|
VGRAM only
|
Japanese Stemming
Japanese
Stemming is now supported when using the JAPANESE_LEXER
The JAPANESE setting for stemming in the basic wordlist
implements Japanese inflection expansion. It pretty much works
like stemming in other languages -- enter in a form of a word and it
finds all inflection forms of that word in the documents.
However, there are some differences:
- Japanese stemming is supported only for the segmentation lexer
- Stemming is supported only for verbs and adjectives. Since
this includes Japanese-special cases like noun-verb and adjective-verb
(unten suru, shizuka datta) and Japanese nouns don't generally have
plural form, this should be practically sufficient.
- The input to stemming must start with kanji and end with a
regular kana ending. For instance, TANOshikatta, where TANO is a
kanji and the rest is written in kana
The Japanese stemmer is homegrown technology (unlike our other
stemmers
which we license). If you have any feedback on its behavior, let
us know.
Japanese Fuzzy
Japanese
Fuzzy is now supported for kana fuzzy matching.
The JAPANESE setting for fuzzy match in the basic
wordlist implements Japanese fuzzy. The input at query time
should be a kana sequence. The fuzzy matcher will expand this to
similar kana strings in the document corpora using character similarity
clusters. This should work with both the segmentation and vgram
Japanese lexers. It's probably of most use when finding
inconsistent Japanization of imported words. For instance,
BAIOLIN or VAIOLIN for violin.
Index Creation and Maintenance
Direct Path Inserts During Index Creation
Index
creation should be noticeably faster due to the use of direct path
inserts.
During an index creation, Text inverts the documents and
collects the token lists in memory. When memory is full, it
flushes the memory to the index tables. In 9i, we did this using
normal SQL inserts. In 10g, we've switched to direct path
inserts. This is much faster, and you can even avoid undo/redo
log use if you've created your index tables with nologging.
Currently this happens only during an index creation -- sync still
uses SQL for insertion. This preserves your ability to sync and
query at the same time -- if sync were to use direct path loading, the
$X index would go invalid for small windows of time, and concurrent
queries would fail.
Parallel Local Domain Text Index Creation
Local
domain text indexes can now be built in parallel using the PARALLEL
keyword in the create index statement
Local domain text indexes in 9i, like b-tree indexes,
can be built in parallel using the DBMS_PCLXUTIL helper package.
In 10g you can also use the parallel keyword in the create index
statement. The index gets built with inter-partition parallelism
-- the system will work on multiple partitions at one time, with no
more than one slave per partition. You can still use
DBMS_PCLXUTIL for intra- and inter-/intra- partition parallelism.
Metadata-only Index Rebuild
The
METADATA keyword in alter index rebuild changes the index metadata
without reindexing.
Normally a change to the index metadata necessitates a
rebuild. For instance, changing the lexer to turn base_letter on
makes the existing index data invalid -- the documents have to be
relexed or queries will get inconsistent results. However, some
changes don't require a rebuild. Changing the stemmer setting,
switching from single-lexer to multi-lexer, changing the user datastore
procedure name (while keeping the same code), etc. For these
situations, 10g has the METADATA keyword, which changes a rebuild
operation from a complete index rebuild to a metadata-only index
rebuild.
Let's say you want to change from a single-language lexer to the
world lexer. First you create your replacement preference:
exec ctx_ddl.create_preference('rlex','world_lexer');
Then rebuild your index using the REPLACE operation, and include the
METADATA keyword:
alter index myidx rebuild parameters ('replace metadata lexer rlex');
This removes the existing lexer and plugs in the replacement lexer,
but
leaves the index tables alone.
The main caveat is that this operation gives you enough rope to shoot
yourself in the foot. If go and do something inconsistent with
existing index data -- like changing the stopword list or turning on
prefix indexing -- it won't stop you. So you should either be
certain that your changes are backwardly compatible with existing data,
or accept possible query inconsistencies.
Another caveat, possibly obvious, is that while you can change the
storage preference using the METADATA keyword, it has no effect because
the index tables are not touched by this operation. It would
affect a future "alter index .. rebuild;" command, though, since at
that time it will read the updated metadata and rebuild the index
tables.
The ODCI interface does not allow partitioned indexes to be rebuilt.
You can use replace metadata on individual partitions to, for instance,
change the sync interval. But you can't alter index rebuild for the
overall index -- you'll get the error ORA-14086: a partitioned
index may not be rebuilt as a whole. However, we have a PL/SQL
procedure which works around ODCI. Just call it with the index
name and the parameters string, like this:
exec ctx_ddl.replace_index_metadata('myidx','replace metadata lexer rlex');
Automatic Sync/Sync on Commit
The SYNC
keyword of create index parameters can set up a dbms job or force sync
after every commit
To keep their context indexes up-to-date, customers
usually have a dbms_job which periodically wakes up and syncs the
index. The new SYNC keyword can do this step for you at create
index time:
create index myidx on mytable(mycolumn) indextype is ctxsys.context parameters ('sync (every "SYSDATE+5/1440")');
The create index code will automatically create a new dbms_scheduler
job that syncs this index every 5 minutes. The double-quoted
interval that follows EVERY is passed directly through to
dbms_scheduler -- please refer to the PL/SQL Packages and Types
Reference for information on the syntax and examples of use. You
can also set memory and parallel details of the sync call:
create index myidx on mytable(mycolumn) indextype is ctxsys.context parameters ('sync (every "SYSDATE+5/1440" MEMORY 15M PARALLEL 2)');
Of course this is mainly a convenience feature -- you could do all
this yourself.
But what if every 5 minutes isn't good enough for you? What if
you really really need super up-to-the-minute updates? Then
either use transactional queries
or the second option for the sync: COMMIT:
create index myidx on mytable(mycolumn) indextype is ctxsys.context parameters ('sync (on commit)');
That's all there is -- with COMMIT you cannot set memory or
parallel. In this mode, the syncs happen automatically when
inserts or updates to the base table are committed. Three
caveats:
- since usually a transaction has only one or two document changes,
you can expect this to increase fragmentation and thus decrease query
performance
- since only one
session can sync at a time this will serialize concurrent
commits
- the sync is actually done in a two step process. First the
transaction commits, then the sync occurs. This means that, in the
window of time between the transaction commit and the sync completion,
the documents are visible to other users but not searchable.
Sync on commit is not mutually exclusive with transactional query.
You can have both of them on if you want. But if you were to choose
one, which to choose? The next section, on transaction query, draws a
more detailed comparison, but the short answer is that
most customers who need up-to-date query should use EVERY sync and
transactional query, instead of ON COMMIT sync.
The last mode of SYNC is MANUAL, which maintains current behavior --
that is, don't sync on commit, don't create a scheduler job.
Transactional Queries
Transactional
queries use per-row evaluation to deliver statement-level consistent
contains queries.
Customers sometimes tell us that they need immediate results -- when
they insert or update a row, it needs to be searchable right
away. The context index, like most information retrieval
products, does not do this by default. Deletes are immediate, but
inserts and updates are asynchronous; the document is unavailable until
the next sync. Practical experience shows that this is usually
acceptable. The requirement for "right away" search in most
systems is more theoretical than necessity -- if the application can
live without this requirement, or find some alternative way of
fulfilling it, the app generally runs better. But for those
situations where it is unavoidable, context indexes now have the option
of being fully transactional.
To use this new feature, specify the transactional keyword during
create index:
create index myidx on mytable(mycolumn) indextype is ctxsys.context parameters ('transactional');
You can also alter index rebuild .. replace metadata transactional
to make an existing index transactional.
Updates and inserts on a transactional index get logged in
dr$pending, like normal indexes, but the rowids also get logged in a
second table, dr$unindexed. During query, the $I produces its
hits, then each rowid in dr$unindexed is evaluated, individually, to
see if it is a match -- that is, we fetch the document text, filter it,
section it, invert it, then run the contains query on that document's
data -- somewhat like a highlight operation. The set of rowids
from dr$unindexed that match is combined with the results from the $I
scan and the merged set is returned from contains.
Like dr$pending, dr$unindexed is fully transactional -- you can
savepoint, rollback, etc. and the effects are immediate. So you
can do stuff like:
select count(*) from table where contains(text, 'someword')>0; -- 0 hits insert into table values ('someword'); select count(*) from table where contains(text, 'someword')>0; -- 1 hit (the one we just entered) rollback; select count(*) from table where contains(text, 'someword')>0; -- 0 hits
So why do we still need sync? The cost for evaluating a single
unindexed rowid can be fairly high -- depends on your index.
Perhaps a user datastore procedure needs to be executed. Maybe
the INSO filter will have to be run. Even the presence of one
unindexed rowid can slow down query response time. As
dr$unindexed gets more and more rows, the queries will get slower and
slower. A sync indexes the documents, adds the data to the
$I index, then removes them from dr$unindexed. Since searching $I
is more scalable than evaluating documents one at a time, your queries
should be faster after the sync. So best practice (if you
absolutely MUST have instant searching) is to have transactional query
and a reasonable periodic sync -- as infrequently as possible to
control fragmentation, but frequent enough that dr$unindexed doesn't
get hairy.
If some of your queries need instant searching, but others don't
(like count_hits estimates, etc) you can turn off the dr$unindexed scan
at the session level:
exec ctx_query.disable_transactional_query := TRUE;
Future contains queries will not scan dr$unindexed. To turn
transactionality back on, set the variable to FALSE. A bit
confusing to set to false to turn something on, but think of it as
turning off the override.
How does transactional query compare to sync on commit?
|
Transactional
|
Sync on Commit
|
can query documents
immediately after insert or update
|
YES
|
NO
|
can query documents
immediately after commit
|
YES
|
YES (small window
where documents are visible but not queryable through contains)
|
impacts query speed
|
YES -- slower
because dr$unindexed must be scanned
|
NO -- not directly,
but increased fragmentation can lead to slower queries
|
increases index
fragmentation
|
NO
|
YES
|
Transactional query is slower because of dr$unindexed, true, but the
sync on commit, unless dealing with large transactions, will increase
fragmentation, which may slow down queries anyway. And
transactional gives you more immediate gratification than sync on
commit -- you can query right after the update or insert, instead of
having to wait for the commit. So I think in general
transactional is better than sync on commit. It would be much
better than sync for those systems where documents change very
frequently when they are young -- for instance, an editorial system
where writers are saving drafts and making edits. Transactional
would keep up with the frequent changes, then the companion periodic
sync would capture the reasonably "final" version, while sync on commit
would have to index every saved version, leading to dramatically
increased fragmentation.
However, transactional query and sync on commit are not mutually
exclusive. If you want to turn both of them on, you can. Personally, I
think sync EVERY with transactional is better, because you get the
benefits of transactional consistency without the fragmentation of sync
commit, but your application mileage may vary.
Token Type Optimization
The
TOKEN_TYPE mode of optimize optimizes all tokens of a specific type
The new TOKEN_TYPE mode does a FULL optimize, but
limited to all tokens of a specific token type. You can think of
it as an expansion of the TOKEN mode, which targets a specific (token,
token_type) . You invoke it like a TOKEN optimize, but specify
TOKEN_TYPE for the mode:
exec ctx_ddl.optimize_index('myidx','TOKEN_TYPE',token_type=>1);
This optimizes all theme tokens (type 1). For mode you can use
the symbol CTX_DDL.OPTLEVEL_TOKEN_TYPE instead of a literal
'TOKEN_TYPE', if you want. For the token_type parameter, you can
use the numeric token type code, if you know it, or
ctx_report.token_type if you don't (see example below). Like TOKEN
optimize, TOKEN_TYPE runs
fully to completion, so don't specify maxtime. It optimizes all
tokens of the given type, so don't specify a token parameter,
either. You can run TOKEN_TYPE optimization in parallel, so you
may specify a parallel_degree parameter if you wish.
Behind the scenes, the token_type optimize goes through almost the
same code as a FULL optimize -- read the $N table, scan the $I table,
rewrite the rows by removing garbage and defragging small rows.
The main difference is that the TOKEN_TYPE mode will simply ignore any
row that is not of the specified type.
This mode is probably most useful for keeping field sections and
mdata sections up-to-date. For instance, suppose you have an
mdata section named "symbol" that holds stock symbols, on which you
want to maintain speedy performance. Periodically, run something
like:
exec ctx_ddl.optimize_index('myidx','TOKEN_TYPE', token_type=>ctx_report.token_type('myidx','MDATA symbol'));
This will keep all values of the MDATA section symbol defragged and
optimal.
Rebuild Optimization
The
REBUILD mode optimizes an index by building a fresh shadow copy of the
index, then swapping it with the existing index.
FULL optimize updates the $I table in place. This
can be slow, since we must use SQL inserts and deletes, which are
slower than direct path writing. On top of that, the $X index
slows down any DML on $I. The new REBUILD mode of optimize is
designed to get around these problems. You invoke it like a fast
optimize, but specify REBUILD for the mode:
exec ctx_ddl.optimize_index('myidx','REBUILD');
There is currently no symbol CTX_DDL.OPTLEVEL_REBUILD, so you'll
just
have to write out the literal string REBUILD. Rebuild optimize
optimizes the entire index in one shot, so there is no need to specify
maxtime, token, and token_type parameters -- you'll just get a syntax
error. Rebuild optimize, like FULL, can be run in parallel, so
you may specify a parallel_degree parameter if you wish.
So what does it do? Broadly, it creates an empty $I, copies the
contents of current $I to the new $I, optimizing each row as it copies,
then swaps the new $I and the old $I. This can be faster than
FULL because FULL has to insert rows using SQL, then delete old rows,
all the time updating $X. REBUILD can insert using direct path
load, does not have to delete old rows, and does not have to update a
b-tree index.
Here's what REBUILD optimize does, specifically:
-
Create two shadow $I tables. Shadow 1 is named
DR$<idx>MI, while shadow 2 is named DR$<idx>NI. Both
will use the storage clause for $I, so if the initial extent on that
storage clause is really high, you could run out of space here.
The MI table will be partitioned, but have only one partition. It
needs to be partitioned so that later we can do a swap with the
existing $I. We use special internal calls so that this works
even on standard edition, which does not have partitioning.
Obviously this precludes your use of storage preferences which would
partition the $I table.
-
Create a log table DR$<idx>$L -- this table is used to
monitor changes to the existing $I table during the optimization.
It gets no storage clause, so uses the index owner's default
tablespace's default storage.
-
Create a copy of the current $N table, named
DR$<idx>MN. This serves as a snapshot of deleted docids.
-
Create a monitor trigger, DR$<idx>$T, which logs the rowid
of all inserts to the existing $I in the $L table
-
Copy the existing $I table to the shadow I tables. This is
done in one complex SQL statement -- roughly,
insert into <shadow I> select * from TABLE(optimize_index(CURSOR(select * from $I))
Let's break this down. The innermost part is the CURSOR. This
selects all the rows from $I and passes them through the optimize_index
function. The optimize function reads the snapshot $N table, then
takes the $I rows coming from the cursor and optimizes them -- removing
garbage rows, defragging small rows, etc. It uses essentially the
same code as the FULL optimize. The table function returns
optimized $I rows which are then shunted by the insert into statement
into the shadow tables. (Mostly to shadow 1. Shadow 2 gets
those rows with really long token_info, for technical reasons).
The insert into uses the efficient direct path insertion, so the I/O
costs of populating the shadow tables is minimal. (You do not
need to have NOLOGGING in your I_TABLE_CLAUSE in your storage
preference, but it might help performance if you do, or if it is part
of the tablespace default). At the end of
this statement, the shadow tables hold an optimized version of the
existing $I table data.
-
Copy all rows from shadow 2 to shadow 1
-
Create a shadow $X index on shadow 1
-
Copy any new rows in $I to shadow 1. While we were doing
the optimize, the $I might have changed. Perhaps there was a
concurrent sync, or ctx_ddl.add_mdata. Our shadow 1 does not have
these new rows. Luckily, their rowids are logged in $L (remember
the monitor trigger created in step 4?). So it's a simple matter
to copy the new rows from $I to shadow 1 -- insert into shadow1 as
select from $I, $L where $L.logged_rowid = $I.rowid.
-
Swap shadow 1 and $I, using alter table exchange partition
-
We've optimized all the docids in the snapshot $N table, but
there could have been deletes or updates while we were doing the
optimize, so we can't just truncate the $N table. We need to
delete all the docids in our snapshot table from the $N.
Logically, that's what we do here. Only practically, it is
easier to create a shadow $N table (DR$<idx>NN) using create
table as select current $N table MINUS snapshot $N table.
-
Swap the second shadow N table and the $N table
-
Drop all shadow objects, which have old data in them
The implication of steps 8 and 10 is that after a rebuild optimize,
your index may not be totally optimal -- DML that happens during the
optimize cannot be optimized. But this kind of thing is
unavoidable, and no interesting index stays pristine for very long,
anyway.
So, when do you use FULL vs REBUILD vs TOKEN? Use FULL and its
time-limited operation for continuing maintenance, like a cron job that
kicks off every night for 4 hours. This can be supplemented by a
cron that runs some TOKEN or TOKEN_TYPE optimizes for continuing
targeted maintenance, keeping especially dynamic field sections, mdata
sections, or text tokens optimal. Use REBUILD for periodic
maintenance -- like running it once a month or something, if FULL if
unable to keep up with daily activity.
Improved Parallel Optimize Mechanism
The
internal mechanisms used for parallelization of optimize_index have
been improved.
The method of parallelization used for optimize_index in
9i was susceptible to incomplete coverage if insufficient parallel
slaves were available. In 10g we've rewritten the underlying
mechanisms so that it adapts to the number of parallel slaves actually
spawns, and, in the extreme case, can fall back to serial optimization
seamlessly.
Query
Internal Optimizations
Query
I/O performance has been improved
Text queries now use low-level database calls to do
index table reads. This makes the I/O faster, which makes your
text queries faster. You don't need to do anything -- it's
automatic.
MDATA Operator
The
MDATA operator searches MDATA sections.
This is a cross-reference. MDATA is a new operator
in our query language that does equality search on tagged document
metadata. Read all about it above.
Transactional Queries
Text
queries can now have statement-level consistency
This is a cross-reference. Queries on indexes
created with the TRANSACTIONAL keyword have statement-level consistency
-- inserts and updates are immediately searchable. Read all about
it above.
Reverse Docid Sorting
The
default ordering of text queries is, once again, forward docid.
Reverse docid is available as an option.
In 9.2.0, the natural order of rows returned from Text index,
in the absence of an order by clause, was in reverse docid order.
The
intent was to provide a rough first-in-last-out sorting, so that
generally the most recent documents were returned first. For
various technical reasons, we've reverted this in 10g to forward docid
order. To provide backward compatibility for existing
applications
relying on reverse docid, we have an internal switch. The switch
modifies the index metadata so that text queries return results in
reverse docid order:
exec drvxmd.set_reverse_docid_switch('indexowner','indexname','ON');
this turns reverse docid ordering ON for the index
indexowner.indexname. This will simulate 9i reverse docid
ordering, except in the cases of progressive
relaxation or transactional query
-- this switch is intended as a backward-compatibility coping mechanism
only, and these are new query models.
If your application relies on reverse docid, you really should think
about moving to a different model. The current recommended
solution for date-sorted text query is a local partitioned text index
on a date-partitioned base table. Experimental and customer
data show that this is generally the most-performant solution with the
current version of the product. Alternatively, MDATA can be used
for conjunctive partitioning -- see the companion document 10g
MDATA
Tricks on OTN.
COUNT_HITS Result Caching
Certain
contains queries remember the total number of documents hit by the text
query so that subsequent count_hits can use the cached result
When you do a text search on the web these days, you
usually get about a page of "top" results and an estimate of the number
of hits. Personally I think the number of hits is useless and
cliched, but people seem to like it. Since 8.1.5 we've had
ctx_query.count_hits to deliver the estimated hit count. However,
even in estimate mode, count_hits re-runs the query, which means you
usually end up running the query twice -- once to get the "top" hits
and once to get the total hitcount.
10g introduces a hit count cache. When you do the contains to
get the top hits, behind the scenes Oracle Text remembers the total
number of hits. When you then do the count_hits, instead of
re-running the query, we just get the cached hitcount.
This technique does not always guarantee correct results, so there
are a few restrictions.
- The order should be contains, then count_hits -- count_hits won't
cache results for a later contains
-
The contains must be one that produces all text hits. For
instance, a query like:
select * from (select /*+FIRST_ROWS(50)*/ * from table where contains(text,'dog',1)>0 order by score(1) desc) where rownum <= 50
actually produces all text hits internally, even though only 50 are
produced. A query like:
select * from table where contains(text, 'dog')>0 where rownum <= 10
does not produce all hits. Nor does a mixed query that uses
function invocation, for example.
-
The query strings must match. If you do a contains for
"dog and ENG within LANGUAGE" and then a count_hits for "dog", it won't
use the cached results.
-
The count_hits must be an exact => false
-
There cannot be any sync completions, index rebuilds, partition
operations, etc. between issuing the contains and issuing the
count_hits.
You don't have to worry about correctness.
If you fail to meet a restriction, the count_hits
just switches automatically to the old behavior of re-running the
query -- you'll always get correct results, not random answers or error
messages.
Progressive Relaxation
Progressive
relaxation mechanically rewrites and re-executes queries to increase
recall as needed
Progressive relaxation is a technique that uses query
variation to maximize precision while guaranteeing recall. Think
of the user-entered query as simply a sequence of words. You can
put these words together in a variety of ways to form text queries,
some more precise than others. You run the variants sequentially,
from most precise to least precise, until you get the number of hits
you need. Let's look at a concrete example.
Let's say your clothing website has a small text field
labeled "search". An end user enters "leather bomber
jacket". In the best case, there is a product with this exact
phrase. So first, you try running it as a phrase: "leather
bomber jacket". If it hits, you stop. Otherwise, you
relax the query. Maybe the three words appear, but not in that
order: "leather and bomber and jacket". If you
still don't have enough hits, you could try for an accum: "leather,bomber,jacket"
then perhaps a stem query: $leather and $bomber and $jacket,
etc. The idea is that you keep relaxing the precision until you
get the recall you need.
Obviously you can do this yourself with PL/SQL code. But why do
all that work when there's the new progressive relaxation
feature? All you need to do is specify the series of queries you
want to run. Oracle Text handles the mechanics of running each
one in order until you get the number of hits you need. Our
example above would be written:
select /*+FIRST_ROWS(10)*/ item_id from stuff_for_sale where contains(description, '<query><textquery><progression> <seq>leather bomber jacket</seq> <seq>leather and bomber and jacket</seq> <seq>leather, bomber, jacket</seq> <seq>$leather and $bomber and $jacket</seq> </progression></textquery></query>')>0 where rownum <= 10;
There may be overlap between the variants. For instance, you
could have one document that has the phrase "leather bomber
jacket". That's not enough to satisfy our 10 rows buffer,
so we move to the second one -- "leather and bomber and jacket".
Well, this also matches the document that we already have. Will
we get the document twice? No. The progressive relaxation code
keeps track of produced results and eliminates duplicates.
Notice the FIRST_ROWS and the rownum limit. This allows Oracle
Text to stop after producing the number of rows you need. If you
get 10 hits on the phrase variant, there's no point in running the
relaxed versions, right?
If you need to know which variant hit, you can select the
score. Like accum, progressive relaxation partitions the score
into ranges. Our example has four queries, so each query gets
assigned a range of 25 points (100/4). So if a particular document has
score 78, this is in the top range (76-100), so we know that it was hit
by query 1 (the phrase).
To make things even easier, the variants can be mechanically
generated for you. We can rewrite our example above as:
select /*+FIRST_ROWS(10)*/ item_id from stuff_for_sale where contains(description, '<query><textquery> leather bomber jacket <progression> <seq><rewrite>transform((TOKENS, "", "", " "))</rewrite></seq> <seq><rewrite>transform((TOKENS, "", "", " and "))</rewrite></seq> <seq><rewrite>transform((TOKENS, "", "", " , "))</rewrite></seq> <seq><rewrite>transform((TOKENS, "$", "", " and "))</rewrite></seq> </progression></textquery></query>')>0 where rownum <= 10;
Let's break down the first rule:
<rewrite>transform((TOKENS, "", "", " "))</rewrite>
The transform operator takes in four arguments. For the first
argument, you should pass in TOKENS, meaning you want to split the
input into constituent tokens. The next three arguments are
concatenation inputs: what you want prepended to each token, what you
want appended to each token, and what you want in between. So we
want nothing in front, nothing in back, and a space in between: the
phrase "leather bomber jacket". The second rule puts
nothing in front, nothing in back, and AND in between: leather
and bomber and jacket. And so on.
With rewrite, your front-end query application does not need any
fancy code -- just prepend the "<query><textquery>"
and append the "<progression><seq>...", both
of which are static strings, and the Oracle Text query engine does the
rewriting.
The syntax is pretty fussy. You need the double parens for
transform, and you actually need the whitespace between the
arguments. If you wrote it like this:
<rewrite>transform((TOKENS,"",""," "))</rewrite>
You'd get a parse error or possibly worse. So don't do it.
The major limitation is that you can have only the transform inside
the rewrite tag. You can't do progressive section expansion like
<rewrite>transform((TOKENS, "", "", " and ")) within title</rewrite> <rewrite>transform((TOKENS, "", "", " and ")) within head</rewrite> <rewrite>transform((TOKENS, "", "", " and ")) within body</rewrite>
or add extraneous limiters like:
<rewrite>transform((TOKENS, "", "", " and ")) and is_a_book</rewrite>
Future versions may have this capability, but for now you'll have to
do it manually (non-rewrite progression/seq). Also, highlighting
does not support query template, so you'll have to determine which rule
hit, transform from rule to text query, if needed, then send the text
query to highlighting. Future versions of the product should be
able to support passing the query template directly to ctx_doc.markup.
COUNT Score Algorithm
The
COUNT score algorithm is a simple occurrence-based relevance ranking.
The default score algorithm is kind of complex.
Say you search for "dog". The score you get for a document
containing the word depends on the number of times dog appears in the
document, but also the number of documents in the collection that
contain dog, and the number of indexed documents. So it's
difficult to predict the expected score for a particular document given
a particular query. This can get problematic if you're trying to
do something fancy.
The COUNT score algorithm makes most scores predictable. It
counts the number of occurrences and returns that as the score.
So if you query for "dog" and "dog" appears 14 times in the document,
your score is 14. You specify the count score algorithm using the
algorithm attribute of the score tag in the query template:
select score(1), id from table where contains(text, '<query><textquery>dog</textquery> <score algorithm="SCORE"/></query>',1)>0;
Count score takes into account position-based limiters. For
instance, if you query for "cat dog" and "cat" appears 18 times, "dog"
appears 5 times, and "cat dog" appears 3 times, then your score is
3. If you query for "cat within BOLD" (assuming BOLD is a zone
section) and "cat" appears 20 times, but only 5 times in bold, then
your score is 5.
Most operators already have defined numeric behavior -- for
instance, AND returns the minimum score of its children -- so the
expected score for even a complex query on a given document is
predictable. The only things to watch out for are those operators
which redefine the score -- like FUZZY, which can return a score based
on spelling nearness, NEAR, whose score depends on the term distance,
and ACCUM, which remaps scores to ranges based on the number of
distinct children hit.
Is COUNT score better than the default algorithm? There's no
blanket answer. It really depends on your application and the
data. Since it's pretty painless, why not try it out with some
sample queries and compare the precision/recall? Additionally,
there are things which COUNT can do that the default algorithm
cannot. For instance, stay stable over product versions and
document corpus changes. Evaluate queries like "document must
have at least 5 occurrences of dog": dog > 4. And COUNT scores
are fairly comparable across partitions. Even across indexes in
some cases.
Query Language Specification
You can
specify the sub-lexer to use for query lexing without changing
the
NLS_LANG
setting.
8.1.7 introduced the multi-lexer, which routes documents
and queries to one of a collection of language-specific lexers.
On the index side each document was identified with a language
tag. On the query side, the nls_language setting determined the
sub-lexer to use. For single sessions that have requests in
multiple languages -- like you find in web applications, where the
session is shared by several front-end sessions -- the nls_language has
to be reset before each request. In 10g there's an easier way:
the LANG attribute in the query template.
To demonstrate, let's set up a simple table with one "english" row and
one "german" row:
create table ltest(l varchar2(10), t varchar2(80)); insert into ltest values ('en','foo bar apples'); insert into ltest values ('d', 'foobar äpfel');
Now our multi-lexer. The only difference between our "english"
and "german" lexer is going to be that the german lexer has : defined
as a skipjoin:
ctx_ddl.create_preference('elx','basic_lexer'); ctx_ddl.create_preference('dlx','basic_lexer'); ctx_ddl.set_attribute('dlx','skipjoins',':'); ctx_ddl.create_preference('mlx','multi_lexer'); ctx_ddl.add_sub_lexer('mlx','default','elx'); ctx_ddl.add_sub_lexer('mlx','german','dlx','xxz');
We'll create the index using "l" as the language column, "mlx" as the
lexer, and a wordlist
preference that uses the AUTO stemmer setting. When I do:
select * from ltest where contains(t, 'foo:bar')>0;
I get the english row. My session language is english, which does not
have a specific
sub-lexer, so the default lexer gets used. That breaks "foo:bar" at the
colon, so this
becomes a query for the phrase "FOO BAR", which matches only the
english row. I could
set the session language to german, but let's use the new-fangled way:
select * from ltest where contains(t, '<query><textquery lang="german">foo:bar</textquery></query>')>0;
the LANG attribute of textquery overrides my session language, and
tells the multi-lexer
to use the german lexer. This lexer has colon as a skipjoin, so
"foo:bar" becomes "FOOBAR".
This matches the german row.
The lang attribute has two caveats: first, you must specify a
language name or abbreviation.
Notice in our multi-lexer setup we defined xxz as an alternate value
for german. If you tried to use lang="xxz" in the query above, you
would get the english row -- the alternate value is
not supported at query time.
Second, the language specification does not affect the stemmer or
fuzzy matcher. Remember
that we indexed using the AUTO stemmer. If we did:
select * from ltest where contains(t, '<query><textquery lang="german">$Apfel</textquery></query>')>0;
we would get no results -- the stemmer picks up only the session
language, which is
english, and uses the english stemmer. Since "Apfel" is not an English
word, this probably expands to something like Apfel=Apfels, which hits
nothing. However, if I change the session language to german and issue
a non-template query $Apfel, the german stemmer gets used, and expands
Apfel to, among others, its plural of Äpfel. This hits the german
row.
Classification and Clustering
RULE_CLASSIFIER Attribute PRUNE_LEVEL
The
PRUNE_LEVEL attribute of the RULE_CLASSIFIER sets the aggressiveness of
decision tree pruning.
The RULE_CLASSIFIER examines groups of documents and
creates ctxrule rules. The process is actually:
- Create a full decision tree. Conceptually, the tree routes
documents using a
series of token tests. If the token is in the document, then we follow
the left branch. If the token is not in the document, then we follow
the right branch. For instance, let's use the tree:
dog / \ / \ cat frog / \ / \ G1 G2 G3 G4
and the document:
the dog sits on the wall
We start at the root, which is DOG. This document has that token, so we
follow the left branch. Next is a test for CAT. This document does not
have this token, so we follow the right branch, leading to the result
G4,
which means that this document belongs in group 4. In reality, the
mechanics
are a bit different, but this is a valid conceptual framework.
The decision trees created in this first step are
perfect. That is, every document that belongs in the group gets routed
to the group by the tree. Every document that does not belong to the
group is excluded by the tree.
- The tree is pruned. Perfect trees are problematic for two
reasons. First, they result in large rules which would decrease
performance of routing. Second, they are too sensitive. They may
exclude new documents which really should be part of the group. To
avoid these problems, some leaf nodes of the tree are removed. This
results in a more relaxed tree which empirically produces better
routing.
The pruning step 2 can now be influenced with the PRUNE_LEVEL attribute
of the RULE_CLASSIFIER. You set the PRUNE_LEVEL to an abstract
measurement of aggressiveness, from 0 to 100. The higher the
PRUNE_LEVEL, the more aggressive the pruning. The resulting rules will
be less strict, meaning that the group gets fewer false negatives and
more false positives than rules produced by a less aggressive prune
setting.
Like most things in the classification/clustering space, you probably
need to play with it to really get a feel for the setting that's best
for your situation; there's no hard-and-fast, best-for-everyone rule of
thumb.
SVM_CLASSIFIER
The
SVM_CLASSIFIER implements support vector machine classification.
The SVM_CLASSIFIER is a new object for classifier
preferences which implements support vector machine classification. The
existing classifier, RULE_CLASSIFIER, builds decision trees and
populates a rule table with inspectable queries. SVM_CLASSIFIER instead
implements a complex statistical model with rules that are not
human-readable. However, the method for using SVM_CLASSIFIER is nearly
the same as the RULE_CLASSIFIER (see the 9.2.0 Technical Overview for
an example of using the RULE_CLASSIFIER).
To use the SVM_CLASSIFIER, first create a classifier preference:
exec ctx_ddl.create_preference('mysvm','svm_classifier');
You can also set attributes like any other preference -- see the
documentation for the full list of attributes and their meanings. Once
you have the SVM classifier preference, simply call the new generic
train API:
PROCEDURE train ( index_name in varchar2, -- the name of the index holding the documents docid in varchar2, -- the name of the docid column cattab in varchar2, -- the name of the category table catdocid in varchar2, -- the name of the docid column in the category table catid in varchar2, -- the name of the category id column in the category table restab in varchar2, -- the name of the result table preference in varchar2 -- the classifier preference );
exec ctx_cls.train('myindex','did','cat_tab','did','cid','svmtab','mysvm');
Like RULE classifier, SVM classifier populates a table of
rules. However, the result table structure for the SVM classifier
is slightly different than for the RULE classifier:
create table svmtab ( cat_id number, type number, rule blob );
Although the rule is a non-inspectable BLOB, you can still create a
CTXRULE index on the output for routing documents. However, you
must provide the classifier preference during ctxrule index creation:
create index svmx on svmtab(rule) indextype is ctxsys.ctxrule parameters ('classifier mysvm');
The theory behind SVM is not in the scope of this document. See
technet for further information if you are curious. Broadly, it
provides higher-accuracy classification, although the rules are
non-modifiable and non-inspectable, so cannot be easily modified if
they are inaccurate. Like most things in classification and clustering,
you really need to try it out to see what works best for your situation.
Clustering
Clustering
organizes a set of documents into a hierarchy of groups of similar
documents..
Clustering can be seen as the "next step up" in a
progression of automatic organization feature sets. 9i introduced the
ctxrule index type. You wrote a set of queries, and incoming documents
were routed to the queries which matched. If you didn't know how to
write the best queries for ctxrule, version 9iR2 introduced
classification. Given groups of documents, classification determines
which words or themes best characterize each group, and creates queries
that can be fed into ctxrule for routing of future documents. Now, if
you don't even know how to organize your corpus into groups to feed
into classification, version 10g introduces clustering to do it for
you. Given a set of documents, clustering organizes them into a
hierarchy of groups of similar documents. This can then be fed into
classification, and that fed to ctxrule, so you can go from unorganized
corpus to routing system with minimal human work.
The clustering process is the unsupervised classification of
patterns into groups
(clusters). A cluster is simply defined such that any document is "more similar" to other documents
in its cluster than it is to documents not in its cluster.
"More similar" basically means "sharing more attributes". As with
training, attributes may consist of simple words (or tokens), word
stems, and themes where supported. We can distinguish two main
clustering techniques: hierarchical and partitional. The first one
produces a nested series of partitions and the second produces only
one. For hierarchical clustering Oracle uses a top-down partition
approach and k-means for partitional clustering. Next we'll take a look
at how to use k-means clustering
KMEAN Clustering
How to
use the KMEAN clustering method
k-Mean is an example of a partitional clustering algorithm that
produces a single partition of the data set (sometimes called flat
output). k-Mean is a trade-off between quality and scalability.
The k-mean algorithm works in "feature space", where each dimension
is represented by one of the features (such as a word or theme) from
the document set. It assigns k centers, or "protypes" -- one for each
cluster. It then calculates the distance from each prototype in feature
space for each document, and assigns the document to the cluster to
which it is closest. It then calculates a new prototype for each
cluster based on the mean of each document assigned to it, and repeats
the process.
while not done { for each document d { find cluster c whose centroid is most similar to d assign d to cluster c } for each cluster c { compute the centroid of c using its documents } }
Now let's take a look at how to use the clustering features in
Oracle Text:
Start with a table of documents. This is just like setting up for
a context index -- the documents don't have to be stored in the
document table -- you can use file datastore, user datastore, etc. if
you want. The only restriction is that the document table should have a
numeric unique identifier column. Our example is fairly simple:
create table news (pk number primary key, title varchar2(2000)); insert into news values(1,'NASA Announces major Mars rover finding'); insert into news values(2,'Mars rover maneuvers for rim shot'); insert into news values(3,'NASA Mars Odyssey THEMIS image: typical crater'); insert into news values(4,'Mars rovers ready to cruise');
Next, create a context index on the document table, using
appropriate preferences. The preferences affect the tokenization of the
documents, and thus the potential shared features upon which the
clusters are based. For instance, turning themes ON in the lexer may
result in clusters that are based on a theme. Having field or mdata
sections in the section group can lead to clusters based on tagged
metadata. In our example, we'll just go with the defaults:
create index news_idx on news(title) indextype is ctxsys.context;
The index does not
have to be populated; if you are not interested in searching the
corpus, create the index NOPOPULATE to save time
and disk space.
Next, we create the result tables. Clustering has two result tables
-- the document assignment table, which tells you which documents
belong to which clusters, and the cluster description table, which
holds the cluster hierarchy. Both can be named anything you want, but
they must have the following structure:
create table clu_restab ( docid number, -- the document id clusterid number, -- the cluster id score number -- the score of the document in the cluster ); create table clu_clusters ( clusterid number, -- the cluster id descript varchar2(4000), -- a description of the cluster label varchar2(200), -- a suggested cluster label sze number, -- number of documents in the cluster quality_score number, -- score of the cluster, higher is better parent number -- the id of the parent cluster. Negative means no parent. );
Now we're ready to cluster. The package for clustering is called
ctx_cls.clustering:
PROCEDURE clustering ( index_name in varchar2 -- the name of the text index docid in varchar2 -- the name of document id column in the collection table doctab_name in varchar2 -- the name of document assignment table for clusters clstab_name in varchar2 -- the name of the cluster description table pref_name in varchar2 default NULL -- the name of the cluster preference, if any );
The pref_name argument can take a clustering preference. The
clustering preference has all sorts of knobs which you can use to
attempt to influence the clustering output:
Attribute
|
Datatype
|
Default
|
Min
|
Max
|
Description
|
MAX_DOCTERMS
|
integer
|
50
|
10
|
8192
|
the maximum number of distinct
terms representing one document |
MAX_FEATURES
|
integer |
3000
|
1
|
500000
|
the maximum number of overall
distinct features
|
THEME_ON
|
boolean
|
FALSE
|
|
|
Consider themes as features?
(INDEX_THEMES must be on in lexer)
|
TOKEN_ON
|
boolean
|
TRUE
|
|
|
Consider tokens as features?
(INDEX_TOKENS must be on in lexer)
|
STEM_ON
|
boolean
|
FALSE
|
|
|
Consider stem tokens as
features? (INDEX_STEM must be on in lexer)
|
MEMORY_SIZE
|
integer |
500
|
10
|
4000
|
Approximate memory size for
clustering, in MB
|
SECTION_WEIGHT
|
integer |
2
|
0
|
100
|
the occurrence multiplier for
adding a term in a field section as a normal term. For example, by
default, the term cat in "<A>cat</A>" is a field section
term and is treated as a normal term with occurrence equal to 2, but
you can specify that it be treated as a normal term with a weight up to
100. SECTION_WEIGHT is only meaningful when the index policy specifies
a field section. |
CLUSTER_NUM
|
integer |
200
|
2
|
20000
|
the maximum number of clusters
to be generated |
We'll set a maximum of 2 clusters in our example to illustrate:
exec ctx_ddl.create_preference('ycluster','kmean_clustering'); exec ctx_ddl.set_attribute('ycluster','token_on','yes'); exec ctx_ddl.set_attribute('ycluster','cluster_num',2);
Now we call the procedure:
exec ctx_cls.clustering('news_idx','pk','clu_restab','clu_clusters','ycluster');
In 10gR1, the cluster package may have some internal nodes, which
are not useful and have zero size. For clarity, we can simply delete
them:
delete from clu_clusters where sze=0; commit;
Now let's take a look at the output, starting with the clusters:
select clusterid cid, label, descript from clu_clusters;
CID LABEL DESCRIPT --- ------------ ------------------------------------------------------------ 2 CRUISE CRUISE,ROVERS,MARS,READY 3 ANNOUNCES ANNOUNCES,MANEUVERS,THEMIS,RIM,NASA,ROVER,MARS,SHOT,MAJOR,FI NDING,IMAGE,TYPICAL,ODYSSEY,CRATER
Clustering labeling is a difficult problem. Oracle Text picks the
most important word from the description column as the label for a
cluster. The description column provides more information about the
content of the cluster.
Next the documents assigned to each cluster (a document can be in
more than one cluster):
select clusterid, score, title from clu_restab, news where clu_restab.docid = news.pk order by clusterid, score desc
The following script shows both cluster and document information:
set serverout on size 1000000 begin for csr1 in (select clusterid, label from clu_clusters) loop dbms_output.put_line('Cluster ID : '||csr1.clusterid||', Label: '||csr1.label); for csr2 in (select n.title, c.score from news n, clu_restab c where n.pk = c.docid and c.clusterid = csr1.clusterid and c.score > 0.01 order by score desc) loop dbms_output.put_line('Clust:'||csr1.clusterid||' Score:'||csr2.score||': '||csr2.title); end loop; end loop; end; /
TEXTK Clustering
TEXTK
clustering is an experimental clustering algorithm.
TEXTK is an UNDOCUMENTED, second clustering method
available for
experimental use. If you're curious, you can try both KMEAN_CLUSTERING
and TEXTK_CLUSTERING objects and determine which is better for your
needs. The interface for using TEXTK_CLUSTERING is the same as the
above section, but, during clustering preference creation, specify the
TEXTK_CLUSTERING object instead:
exec ctx_ddl.create_preference('ycluster','textk_clustering');
The textk clustering has the same attributes as kmean clustering,
but has two additional attributes::
Attribute
|
Datatype
|
Default
|
Min
|
Max
|
Description
|
MIN_SIMILARITY
|
integer
|
1
|
1
|
20
|
the minimum similarity score for
each leaf cluster
|
HIERARCHY_DEPTH
|
integer |
1
|
1
|
20
|
the maximum depth of the cluster
hierarchy
|
Miscellaneous Other Features
Policy-Based CTX_DOC Services
The
policy-based services allow text processing of ad hoc data, without a
document table or text index.
The doc services (ctx_doc.filter, ctx_doc.tokens,
ctx_doc.themes, etc.) take in an index name and a document
identified (primary key or rowid). That limits their operation to
indexed data, by which we mean that there has to be a record in a
table, and that table has to have a text index on it. This
restriction is too limiting for some customers, who just want to do
text processing on ad hoc data. 9.0.1 introduced ctx_doc.ifilter
command, which filters ad hoc binary data to text. While this
served the immediate need, it was limited to just filtering, and only
INSO filtering. 10g introduces policy-based document services,
which is a generalization of the ifilter model to all document services.
The policy services are similar to the in-memory version of the
regular services. Other than the name of the function, the only
difference is that the second argument is the data to be processed,
rather than a document identifier. Let's compare two filter calls
which do the exact same thing:
declare output clob; begin ctx_doc.filter('myindex','1',output); end;
declare output clob; input blob; begin select data into input from mytable where id = 1; ctx_doc.policy_filter('myindex',input,output); end;
In both cases we are filtering "data" from row 1, using whatever
filter is defined for index "myindex", and placing the result in local
CLOB variable "output". The only difference is that in the first
block of code we pass in the document id, and the document content is
read from the table, while in the second block we pass in the document
content directly. This particular example is not too interesting,
but the point is that you could get the data from anywhere -- it could
come from another table, be passed in from a webpage or uploaded
document, or be generated by some other procedure. You no longer
need to actually store data in a table in order to do text processing.
Some miscellaneous notes on these:
- Policy variants are available for all doc services: themes,
tokens, gist, filter, highlight and markup.
-
As the example shows, even though these are called policy-based
services, the first argument can be either an index name or a policy
name. When using an index, the index's datastore is ignored -- you are
passing in the data.
-
The data passed in can be a varchar2, a clob, blob, or bfile.
-
There are no result-table versions of the policy services --
only
in-memory results.
-
In addition to the document data itself, you can also pass in
values for "language", "format", and "charset", in case your policy or
index uses multi-lexer, charset filter, etc. Just specify it
using the appropriate parameter name, as in:
ctx_doc.policy_filter('myindex',input,output,language=>'French');
Since policy_filter replicates ctx_doc.ifilter, ifilter is now
deprecated and should not be used in new code.
INPATH and HASPATH Highlighting
INPATH
and HASPATH are now supported in highlighting
Nuff said.
CTX_REPORT XML Output
CTX_REPORT
calls can produce XML-formatted output.
The reports produced by CTX_REPORT are formatted text by
default. We've now added a new, optional last parameter
(report_format) to each call that can take in a format selector.
Currently you can ask for 'TEXT' or 'XML'. When setting the
output format to XML, you get the report formatted in XML style --
suitable for passing through XSLT for presentation, or
XMLType.extractNode if you just want a particular piece of data.
Let's check out an example:
select ctx_report.index_size('myidx',null,'XML') from dual;
<CTXREPORT> <INDEX_SIZE> <SIZE_INDEX_NAME> USERA.MYIDX </SIZE_INDEX_NAME> <SIZE_OBJECT> <SIZE_OBJECT_NAME> USERA.DR$MYIDX$I </SIZE_OBJECT_NAME> <SIZE_OBJECT_TYPE> TABLE </SIZE_OBJECT_TYPE> ...
I've added indentation -- you don't get any in the real
output. It's all the same to a machine, which is probably the
consumer of these things. Humans can continue to use the
formatted text versions.
Unfortunately we don't have formal DTD's or schemas for the output
-- the best thing to do if you intend to use the XML output is run it,
check out the output, then extrapolate your XPaths (or whatever it is
you are doing) based on that.
CTX_OUTPUT Events Parallel Propagation
Active
CTX_OUTPUT events are now propagated to parallel slaves.
9iR2 introduced logfile events -- you set events and the
ctx_output logfile gets more information, like the token currently
being optimized, the rowid of the row currently being indexed,
etc. This is mostly useful when debugging problems.
Unfortunately, these events were not propagated during parallel
operations -- parallel sync, parallel optimize, etc. So you would
only get basic information in the logfile of each parallel slave.
In 10g the active events are propagated to the parallel slaves, so if
you have the print rowid event on when you execute a parallel sync,
you'll see the rowids printed in each parallel slave.
CTX_OUTPUT.EVENT_INDEX_PRINT_TOKEN
The
CTX_OUTPUT.EVENT_INDEX_PRINT_TOKEN event prints out tokens during the
dump phase of indexing
The new EVENT_INDEX_PRINT_TOKEN ctx_output event will
cause create index and sync to print out each token as it is being
written to $I. For instance:
exec ctx_output.start_log('idx'); exec ctx_output.add_event(ctx_output.event_index_print_token); exec ctx_ddl.sync_index('myidx');
Then in the log you will see something like:
17:02:59 01/15/04 Writing index data ($I) to database. 17:02:59 01/15/04 WRITING TOKEN AACHEN, 0 17:02:59 01/15/04 WRITING TOKEN AARDVARK, 0 17:02:59 01/15/04 WRITING TOKEN ABACUS, 0
This is very useful for targeted optimization. With careful
logging and a bit of logfile parsing code, you can determine which
tokens are sub-optimal because of the sync, and do token-mode optimize
on just those tokens. Or, you could monitor vocabulary in new
documents and detect new, previously-unknown tokens -- if you have
mdata or field sections, you could use this technique to send alerts
when new authors or stock symbols pop up, for example.
Tracing
The
tracing facility allows low-level high-resolution timing of events for
advanced diagnosis.
The new 10g tracing facility profiles internal Oracle
Text operations. We believe that this is mostly going to be used
under support direction when diagnosing performance problems. The
idea is that you add a trace in your session, which activates timing or
counting code that we've added around some of our internal
operations. After the overall operation (create index, select) is
complete, you can look at these results to see, for instance, where
time is being spent during a slow query.
You add a trace using:
exec ctx_output.add_trace(CTX_OUTPUT.TRACE_QRY_I_ROWS); exec ctx_output.add_trace(CTX_OUTPUT.TRACE_QRY_R_TIME);
Here we are telling the engine to count the number of $I rows that
are read, and time the reads from $R. Note that logging does not
have to be on to add traces. Now we do a query:
select * from mytable where contains(text, 'apple')>0;
During query processing, the traces are counting and timing.
After the query, we can see the values of the traces by looking at the
view ctx_trace_values:
select * from ctx_trace_values;
TRC_ID TRC_VALUE ------ --------- 8 1 10 63
This tells us that 1 $I row was read (TRACE_QRY_I_ROWS =
8) and that docid to rowid resolution took 63 microseconds. In
addition to getting the trace values from the view, we can get them
programatically:
exec :n := ctx_output.get_trace_value(CTX_OUTPUT.TRACE_QRY_R_TIME);
or we can dump the values to the logfile:
exec ctx_output.log_traces;
When we run the next query, the traces will simply accumulate.
For instance, if we ran the same query again, and queried
ctx_trace_values, we would see 2 for trace 8 and something like 128 for
trace 10. This way, you can get an aggregate time over a set of
operations. Alternatively, you can zero out the traces as desired:
exec ctx_output.reset_trace(8);
Here are the traces available in this version:
Trace
|
ID
|
Meaning
|
TRACE_IDX_USER_DATASTORE
|
1
|
time to execute user datastore
procedure
|
TRACE_IDX_INSO_FILTER
|
2
|
time to execute inso filter
|
TRACE_QRY_XX_TIME
|
3
|
time to execute select on $X
|
TRACE_QRY_XF_TIME
|
4
|
time to fetch from $X
|
TRACE_QRY_X_ROWS
|
5
|
number of rows fetched from $X
|
TRACE_QRY_IF_TIME
|
6
|
time to fetch from $I
|
TRACE_QRY_IR_TIME
|
7
|
time to read oversized $I lobs
|
TRACE_QRY_I_ROWS
|
8
|
number of $I rows read
|
TRACE_QRY_I_SIZE
|
9
|
total size of $I rows read
|
TRACE_QRY_R_TIME
|
10
|
time to read $R table
|
TRACE_QRY_CON_TIME
|
11
|
overall contains time
|
TRACE_QRY_K_TIME
|
12
|
time to read from $K (function
invocation)
|
One caveat: like logging, tracing is not currently supported on MTS.
Query Log Analysis
The
query log analysis package allow you to see what your users are looking
for.
Many search sites track the queries that users enter --
you can see what people commonly search for, tune your optimization,
etc. 10g introduces a product-level feature for tracking queries
and analyzing the logfiles. It all starts with logging the
queries. In sessions that run text queries, start a query log:
exec ctx_output.start_query_log('qlog');
This works like the existing ctx_output.start_log, so we're creating a
logfile named "qlog" in the log directory. The difference is that
only queries will be logged, in a special output format.
Query logging and regular logging are exclusive -- you can either do
normal logging or query logging, but not both at the same time.
Like normal logging, query logging is not currently supported on
MTS. Once the log is started, subsequent select .. where
contains statements (only contains -- not count_hits)
will add entries to the logfile. If you take a look, they look
like little XML fragments:
<QuerySet><TimeStamp>14:27:25 01/15/04 </TimeStamp><IndexName>MYINDEX</IndexName> <Query>apple</Query><ReturnHit>Yes</ReturnHit><QueryTag></QueryTag></QuerySet>
So let's say you set this up, and now your logfile has several
hundred queries in it. Now you want to analyze the logfile.
ctx_report.query_log_summary reads a logfile, compiles the
top n queries, and returns the results in a PL/SQL table -- each item
in the table tells you the query string (query) and the number of times
it was run (times). By default, it tells you the 50 most
frequently-run queries that hit documents:
declare otab ctx_report.query_table; begin ctx_report.query_log_summary('qlog','myindex',otab); for i in 1..qlog.count loop dbms_output.put_line(otab(i).query || ' run '||otab(i).times||' times'); end loop; end;
This must be run as CTXSYS, or you'll get an ORA-942 error.
This might produce output like:
apple run 10 times aardvark run 5 times ...
You can change the size of the output buffer using argument
row_num. For instance, to get the top 100 results:
ctx_report.query_log_summary('qlog','myindex',otab,row_num=>100);
There are also two boolean parameters which affect the output.
most_freq selects which queries you get in the report. If TRUE,
the default, you get the most frequently executed <row_num>
queries. If FALSE, you get the least frequently executed
<row_num> queries instead. has_hit selects whether you get
queries that had hits, or queries that had no hits. Using has_hit
=> FALSE lets you see which queries had 0 results. If your
users are frequently querying for "raincoats" and finding nothing,
either your store doesn't carry raincoats (and maybe you should start)
or you need to massage the query to get more recall -- stemming or
thesaurus, perhaps.
COPY_POLICY
CTX_DDL.COPY_POLICY
creates a clone of an existing policy.
The new copy policy call clones a policy -- a new policy
is created with the same attribute settings of an existing
policy. Syntax:
ctx_ddl.copy_policy('existing_policy','new_policy');
This was implemented for internal customers, but it could be useful
when prototyping -- say you want to determine the effect of an index
metadata change. Create a copy of the existing index (you can
enter in an index or policy name as "existing_policy"), make your
change to the clone, and test, perhaps using the new policy-based document services.
The invoking user must be the owner of both policies, or have CREATE
ANY INDEX database privileges.
Stretch Viewer
Visualization
The
Stretch Viewer applet creates a visual representation of a hierarchy.
The StretchViewer visualization applet takes a tree
structure described in a particular
format and displays it as an applet. The StretchViewer jar file is
located in $ORACLE_HOME/ctx/sample/stretch. You
need
to create a directory "classes_g" in your Web server document directory
(for example "htdocs" in Apache) and then copy the jar file to it.
Data format
- Every node data has to have a absolute path name. In the example,
Categories is the root node. The other nodes are followed
by '/'
- textcolor RGB representation such as #336699 can be used instead
of blue,
red, magenta, orange.
- url The associate URL to the leaf node.
Let's define the test.txt example that defines the tree structure:
Categories textcolor=red Categories/External websites textcolor=blue Categories/External websites/Oracle url=http://www.oracle.com textcolor=black Categories/External websites/OTN url=http://otn.oracle.com textcolor=black Categories/Internal websites textcolor=blue Categories/Internal websites/Benri url=http://benri.us.oracle.com textcolor=black Categories/Internal websites/Aria url=http://aria.us.oracle.com textcolor=black
You invoke the applet using the following HTML snippet. Make sure
you
have
the correct path file and machine names correctly:
<object classid="clsid:8AD9C840-044E-11D1-B3E9-00805F499D93" width="100%" height="500" align="baseline" codebase="http://java.sun.com/products/plugin/1.3/jinstall-13-win32.cab#Version=1,3,0,0"> <param name="code" value="oracle.apps.iam.am.stretchviewer.StretchViewer.class"/> <param name="ImageDirectory" value="images/"/> <param name="codebase" value="http://yourmachinename/classes_g"/> <param name="DATA" value="http://yourmachinename/test.txt"/> <param name="DATALENGTH" value="66308"/> <param name="MODE" value="horizontal"/> <param name="BACKGROUND" value="white"/> <param name="STICKCOLOR" value="lightGray"/> <param name="AUTOEXPAND" value="true"/> <param name="CONTRACTABLE" value="true"/> <param name="TARGET" value="_top"/> </object>
The object classid and codebase depends on your current Java Plug
in.
We
recommend using Sun's plug in that works in mostly all browsers
including
IE.
The parameter codebase specifies the location of the
StretchViewer
jar
file in your web server. The rest of the parameters allow you to change
the
look and feel of the applet.
|