Advanced MDATA - tips and tricks
Advanced MDATA - tips and tricks
About this Document
This document is aimed at developers and other technically-minded people wishing to take full
advantage of the new mdata feature in Oracle Text, at the Oracle 10g release.
It assumes a working knowledge of Oracle Text features. The document
presents a variety of techniques, along with code samples. It is
important to read the code samples as you work through the document -
the main text does not cover all the details that can be understood by
reading the code and comments.
The examples aim to be self-contained - in that they contain the complete code
necessary to demonstrate the feature in question. However, they may not be complete -
for example in a number of places a trigger would be necessary to keep the
indexes up to date, but this has been omitted for brevity. In all cases, you are
likely to see errors at the beginning of the file the first time you run an
example, as the various tables and objects are dropped before being created.
Introduction
MDATA in 10g is a new feature for handling document metadata. It introduces a new section type, MDATA_SECTION.
Mdata is most likely to be used to avoid or mitigate "mixed query" situations, where a query has to combine
textual and structured clauses in a single query. In most cases, it is more efficient to do the search
as a text query only - with the mdata part of the text query taking on the role of the scructured clause.
In some respects, mdata sections are similar to ordinary zone or field sections, but with some important differences.
MDATA sections are not tokenized. So "Black Dog" is indexed as a
single string, rather than split into "black" and "dog" as would be
the case for a field or zone section. MDATA can be updated (in a
transactional manner) without affecting the rest of the index. So if a
library application keeps track of the number of copies of a book it
has in stock, then modifying that information does NOT require you to
reindex the whole record.
As a simple example, see the file simple.sql
This creates a simple "library stock" application. It has a table with
a CLOB column containing XML fragments.
There are two normally-text-indexed fields, title and author. There are also two fields - stocklevel and status - which are indexed using the new MDATA section group.
We index the table and then search for a book:
select book_info from library_stock
where contains (book_info,
'irving within author and mdata(status, In Stock)') > 0;
Here we can see the basic mdata search syntax. The mdata function takes the name of the section, and then the string to search for. Leading and trailing whitespace is ignored, but otherwise the string MUST be an exact match. "in stock" or "Stock" would not be a match here. Note that no quotes are required or allowed in the string, unless present in the original.
We can look at the some of the data held in the index. Tokens for MDATA sections have token_types in the range 400 to 499 (or -400 to -499 for changes, as we'll see shortly)
select token_text, token_type, token_first, token_last from
dr$lib_index$i where token_type >= 400 or token_type <= -400;
TOKEN_TEXT TOKEN_TYPE TOKEN_FIRST TOKEN_LAST TOKEN_COUNT
------------------------------ ---------- ----------- ---------- -----------
0 401 3 3 1
1 401 1 1 1
12 401 2 2 1
In Stock 400 1 2 2
Out of Stock 400 3 3 1
Next, we run a PL/SQL block which updates the metadata, using the procedures
ctx_ddl.remove_mdata and ctx_ddl.add_mdata. Note that this does not affect the data in the table at all - it ONLY affects the information in the index, and therefore queries that run against that index.
Running our select on the index again, we see that there are new rows for the modified sections. These have TOKEN_TYPE values in the range -400 to -499. These are DELTA, or changed records, and avoid the need to perform a (possibly expensive) update to existing columns.
TOKEN_TEXT TOKEN_TYPE TOKEN_FIRST TOKEN_LAST TOKEN_COUNT
------------------------------ ---------- ----------- ---------- -----------
0 -401 1 1 1
0 401 3 3 1
1 -401 1 1 0
1 401 1 1 1
12 401 2 2 1
In Stock -400 1 1 0
In Stock 400 1 2 2
Out of Stock -400 1 1 1
Out of Stock 400 3 3 1
The example then goes on to show that the data is unaffected by a SYNC of the index, but the negative token types go away once the index is optimised.
Working with XMLType Columns
The file simplexml.sql shows a variation on the previous example. The data is held in a XMLType column, rather than a CLOB. Other than that, and the fact that we can now extract individual elements (such as the title) from the XML text, the example is the same as the one before.
Working with Individual Columns
One problem with the previous two examples is that the metadata in the index gets out of step with the data in the table. Initially we are indexing the data in the XML text. However, as soon as we start changing the metadata, we are changing ONLY tghe data in the index. If we reduce the stocklevel from 1 to 0, the XML text in the table will still read 1. If, instead, we change the XML text in the table, then our table and index will be in step - but we'd have to reinndex the entire XML document, which is something we're trying to avoid.
So simple_uds stores the information in structured columns, and
assembles the XML-like text "on the fly" during indexing. This is
done using a User Datastore procedure (if you're not familiar with
this, see the Oracle Text Developers' Guide for more information).
The procedure my_datastore_doc takes the contents of the title, author and stocklevel columns, and combines them into an XML fragment, together with a generated tag for status - either "In Stock" or "Out of Stock" (this is an improvement on the previous versions, which hard-coded the status flag even though it should really be derived from stocklevel).
A User Datastore index is created on a single column - here we've
chosen to use the TITLE column. We therefore need triggers to deal
with updates to the other columns. An update to AUTHOR (which is
conventionally text-indexed) will trigger a change to the TITLE column
(actually just a copy to itself) to force a re-index.
A second trigger deals with updates to the STOCKLEVEL column. This
alters the metadata in the index, without requiring the reindexing of
the row. If STOCKLEVEL changes to or from zero, the value of STATUS is
set to "In Stock" or "Out of Stock" as appropriate.
Why do we bother having separate STOCKLEVEL and STATUS mdata sections? Well, it's not strictly necessary. We cannot do range searches on the STOCKLEVEL value. MDATA only allows equality searches, so we can't do mdata(stocklevel, > 0) for example. However we CAN do something like:
select id, title, stocklevel from library_stock
where contains (title, '(irving within author) not (mdata(stocklevel, 0))') > 0
However, it is useful to demonstrate the use of derived MDATA, even if it's not completely necessary for this example.
Multiple Values
There is no need to restrict MDATA to a single value per document.
The example simple_mult.sql shows a situation where roles are indexed in a many-to-many relationship with users - a user may have multiple roles, and a role may be assigned to multiple users.
This is simply a matter of repeating the data - logically at least we could have
<rolename>dba</rolename>
<rolename>admin</rollname>
In fact this example shows a new technique in that the rolenames are never stored in the data at all - they are added only via the ctx_ddl.add_mdata call. This has the advantage that there is no possibility of the index mdata getting out of sync with the table data, and there is no need for triggers. The main disadvantage is that you must load all the mdata AFTER completing the index build.
Getting Around MDATA limitations - (1) Length
The maximum length of an MDATA value is 64 characters. Anything beyond this point will be quietly truncated, both at indexing time and during the query. The effect of this is that when searching for long values you might sometimes get unexpected results.
The next example, lengthlimits.sql, shows a way around this problem.
In the first part of the example, we create a simple table containing 71 character "code" values. The two codes vary only in the last character. We then create a user datastore to generate an XML fragment, similar to the last example (note: triggers are omitted for brevity). The user datastore is used to create an index.
Now we're going to find that if we search for one of the values, it
finds both, since they've been truncated at 64 characters in the
index:
select code from test_length
where contains (text,
'mdata(code,
a123456789b1234567889c123456789d123456789e123456789f123456789g123456789x)'
) > 0;
CODE
--------------------------------------------------------------------------------
a123456789b1234567889c123456789d123456789e123456789f123456789g123456789x
a123456789b1234567889c123456789d123456789e123456789f123456789g123456789y
So we need a workaround. The workaround consists of a lookup table,
which contains short values to be indexed, and the long values which
are actually in the table. We must do a lookup in this table at
indexing time - in our user datastore, and at query time via a simple
PL/SQL lookup function. That way, the values that are indexed are
well within the 64 character limit. You can see this in a lot more
detail by following the full example.
The lookup table approach can be used for other types of
expansion. For example, although mdata only allows exact matching, you
can provide your own stem operator by storing stem expansions in a
lookup table. See the example file stems.sql.
Getting Around MDATA limitations - (2) Maximum Sections
There is a limit of 100 mdata sections. If you try to exceed this, you will get a message like:
ERROR at line 1:
ORA-20000: Oracle Text error:
DRG-11110: maximum number of mdata sections exceeded
ORA-06512: at "CTXSYS.DRUE", line 160
ORA-06512: at "CTXSYS.CTX_DDL", line 224
ORA-06512: at line 1
But what if you need more than 100 sections? Given the exact-match nature of mdata sections,
instead of using
<SEC1>dog</SEC1>
<SEC2>cat</SEC2>
<SEC3>rabbit</SEC3>
you can use
<SEC>1 dog</SEC>
<SEC>2 cat</SEC>
<SEC>3 rabbit</SEC>
And the search just has to prepend the section number to the search string.
This can be extended indefinitely.
(There is no example code for this section)
Logical Partitioning
It is often useful to be able to divide a table into logical "chunks" or partitions.
For example, I might want to be able to search on information in the last week,
the last month, or the last year. You can do this using a real partitioned table,
which is very efficient when you only want to search in that partition - but there
are some disadvantages:
Data must be regularly moved between partitions, as it becomes too old
for the current partition Searches across more than one partition are
less efficient than if all the data was held in a single table.
So a common technique in previous versions was to use a field section:
<DATERANGE>THISWEEK</DATERANGE>
and you could then search for " ... contains (col, ( (terms) and THISWEEK within daterange ) > 0"
This approach works well, except that you need to reindex each document every time
you modify the DATERANGE element. Using mdata avoids this problem. The query would
now become " ... contains ( (terms) and MDATA(daterange, THISWEEK) ) > 0"
Mixed Queries with Range Predicates
"Mixed" queries, containing text operators and a structured range search can
be expensive to process. It is often necessary for the kernel to a large number
of table or index lookups to satisfy the query. This is why we strive to have
the whole query satisfied from the text index. Sometimes, however, this is not
possible. Consider the query
contains (text, 'oracle') > 0
and date between '12-JAN-2004' and '03-FEB-2004'
Now we can't completely solve that date clause from the text index, at least
without using complicated algorithms described elsewhere) but we can use the
text index as a "primary filter", by storing (say) the month and year in an
mdata section and using a query like:
contains (text, 'oracle
and ( mdata(month, JAN2004) OR mdata(month FEB2004) )') > 0
and date between '12-JAN-2004' and '03-FEB-2004'
This way, the mdata restrictions mean the text search part of the query will
only fetch rows from January and February 2004, and the structured part of
the search will have very much less work to do to prune these rows to the
desired set of dates.
Performance note: Each mdata operator uses a database cursor, so you
should not use too many of them. Searching over several years by
specifying all the months is probably not a good idea. Instead, you
can store different levels of mdata:
<DECADE>2000-2009</DECADE>
<YEAR>2004</YEAR>
<QUARTER>1-2004</QUARTER>
<MONTH>FEB2004</MONTH>
Or use a single repeating field and code the values so they can be
distinguished:
<DATERANGE>2000-2009</DATERANGE>
<DATERANGE>YEAR2004</DATERANGE>
<DATERANGE>Q1-2004</DATERANGE>
<DATERANGE>FEB2004</DATERANGE>
Sorting
Given the logical partitioning in the previous section, we can see how this might be extended to help sorting
- mimicing the way real partitions make sorting more efficient.
Let's assume we want to search by date, and we need to fetch the first
50 hits. The technique would be to run a query on
DOG and MDATA(daterange, FEB2004) (current month)
then if you don't get enough hits from this to fill our hitlist, we can move on to
DOG and MDATA(012004,dp) (last month)
and so on. Each invocation needs an ORDER BY to do the fine-grain sorting, but the number of rows being sorted is minimal.
In testing, this technique performs relatively well. it is optimal when you only need one partition, so if you know the average date spread of rows returned, they can size the partitions correctly.
The following strategy is even better:
- Issue query without any sorting criteria, and fetch the first 50 rows
- If you don't get 50 rows, sort what you have and return. Otherwise,
- Iterative partition scanning as above
You can use a similar technique with multi-resolution partitions (as discussed in the logical partition section)
to improve partition scanning:
- Get first 50 rows of first, largest grain partition (i.e. YEAR), no sort
- If we've got 50 rows, try next smallest partition (i.e. HALF), no sort ... etc. This minimizes the number of invocations of the same query.
Another variation is to use progressive relaxation. This technique -
described in more detail here -
allows you to execute a series of query "sequences" intended to return hits in the order the sequences are executed.
The benefits of progressive relaxation are only really seen when a set
number of rows are fetched, and the fetches stop before all rows have
been fetched.
The file progrelax.sql demonstrates this, using the query:
select score(1), book_info from test_table
where contains (month_info, '
<query>
<textquery>
<progression>
<seq> mdata(month, 122003) </seq>
<seq> mdata(month, 112003) </seq>
<seq> mdata(month, 102003) </seq>
<seq> mdata(month, 092003) </seq>
<seq> mdata(month, 082003) </seq>
<seq> mdata(month, 072003) </seq>
<seq> mdata(month, 062003) </seq>
<seq> mdata(month, 052003) </seq>
<seq> mdata(month, 042003) </seq>
<seq> mdata(month, 032003) </seq>
<seq> mdata(month, 022003) </seq>
<seq> mdata(month, 012003) </seq>
</progression>
</textquery>
</query>
',1) > 0 and rownum <= 5;
Note that (roughly speaking, anyway) only enough mdata cursors will be opened to provide the number
of rows fetched - so the presence of twelve mdata queries here does not necessarily require twelve
seperate cursors.
MDATA Scoring - bug 3407570
In Oracle 10g Release 1, there is a bug with the score returned from mdata.
It SHOULD return 0 if no match, or 100 if there is a match. However it actually
returns a lower value than 100 - normally 1.
This causes problems with AND queries, since AND always scores the
LOWER of the two sides. So a search for
'(lord of the rings) and mdata(mediatype, DVD)'
will always return a score of 1, since the mdata score overrides the score
from the phrase search.
To workaround this, you can force the score of the mdata clause to the
maximum (100) like so:
'(lord of the rings) and mdata(mediatype, DVD)*10*10'
Note the maximum score multiplier is 10 - so you have to do it twice
as above.
Last modified: 16-April-2004 by Roger Ford
|