Advanced MDATA - tips and tricks

Advanced MDATA - tips and tricks

About this Document

This document is aimed at developers and other technically-minded people wishing to take full advantage of the new mdata feature in Oracle Text, at the Oracle 10g release.

It assumes a working knowledge of Oracle Text features. The document presents a variety of techniques, along with code samples. It is important to read the code samples as you work through the document - the main text does not cover all the details that can be understood by reading the code and comments.

The examples aim to be self-contained - in that they contain the complete code necessary to demonstrate the feature in question. However, they may not be complete - for example in a number of places a trigger would be necessary to keep the indexes up to date, but this has been omitted for brevity. In all cases, you are likely to see errors at the beginning of the file the first time you run an example, as the various tables and objects are dropped before being created.

Introduction

MDATA in 10g is a new feature for handling document metadata. It introduces a new section type, MDATA_SECTION. Mdata is most likely to be used to avoid or mitigate "mixed query" situations, where a query has to combine textual and structured clauses in a single query. In most cases, it is more efficient to do the search as a text query only - with the mdata part of the text query taking on the role of the scructured clause.

In some respects, mdata sections are similar to ordinary zone or field sections, but with some important differences.

MDATA sections are not tokenized. So "Black Dog" is indexed as a single string, rather than split into "black" and "dog" as would be the case for a field or zone section. MDATA can be updated (in a transactional manner) without affecting the rest of the index. So if a library application keeps track of the number of copies of a book it has in stock, then modifying that information does NOT require you to reindex the whole record.

As a simple example, see the file simple.sql This creates a simple "library stock" application. It has a table with a CLOB column containing XML fragments.

There are two normally-text-indexed fields, title and author. There are also two fields - stocklevel and status - which are indexed using the new MDATA section group. We index the table and then search for a book:

select book_info from library_stock 
where contains (book_info, 
'irving within author and mdata(status, In Stock)') > 0;
Here we can see the basic mdata search syntax. The mdata function takes the name of the section, and then the string to search for. Leading and trailing whitespace is ignored, but otherwise the string MUST be an exact match. "in stock" or "Stock" would not be a match here. Note that no quotes are required or allowed in the string, unless present in the original.

We can look at the some of the data held in the index. Tokens for MDATA sections have token_types in the range 400 to 499 (or -400 to -499 for changes, as we'll see shortly)

select token_text, token_type, token_first, token_last from
dr$lib_index$i where token_type >= 400 or token_type <= -400;

TOKEN_TEXT		       TOKEN_TYPE TOKEN_FIRST TOKEN_LAST TOKEN_COUNT
------------------------------ ---------- ----------- ---------- -----------
0				      401	    3	       3	   1
1				      401	    1	       1	   1
12				      401	    2	       2	   1
In Stock			      400	    1	       2	   2
Out of Stock			      400	    3	       3	   1
Next, we run a PL/SQL block which updates the metadata, using the procedures ctx_ddl.remove_mdata and ctx_ddl.add_mdata. Note that this does not affect the data in the table at all - it ONLY affects the information in the index, and therefore queries that run against that index. Running our select on the index again, we see that there are new rows for the modified sections. These have TOKEN_TYPE values in the range -400 to -499. These are DELTA, or changed records, and avoid the need to perform a (possibly expensive) update to existing columns.
TOKEN_TEXT		       TOKEN_TYPE TOKEN_FIRST TOKEN_LAST TOKEN_COUNT
------------------------------ ---------- ----------- ---------- -----------
0				     -401	    1	       1	   1
0				      401	    3	       3	   1
1				     -401	    1	       1	   0
1				      401	    1	       1	   1
12				      401	    2	       2	   1
In Stock			     -400	    1	       1	   0
In Stock			      400	    1	       2	   2
Out of Stock			     -400	    1	       1	   1
Out of Stock			      400	    3	       3	   1
The example then goes on to show that the data is unaffected by a SYNC of the index, but the negative token types go away once the index is optimised.

Working with XMLType Columns

The file simplexml.sql shows a variation on the previous example. The data is held in a XMLType column, rather than a CLOB. Other than that, and the fact that we can now extract individual elements (such as the title) from the XML text, the example is the same as the one before.

Working with Individual Columns

One problem with the previous two examples is that the metadata in the index gets out of step with the data in the table. Initially we are indexing the data in the XML text. However, as soon as we start changing the metadata, we are changing ONLY tghe data in the index. If we reduce the stocklevel from 1 to 0, the XML text in the table will still read 1. If, instead, we change the XML text in the table, then our table and index will be in step - but we'd have to reinndex the entire XML document, which is something we're trying to avoid.

So simple_uds stores the information in structured columns, and assembles the XML-like text "on the fly" during indexing. This is done using a User Datastore procedure (if you're not familiar with this, see the Oracle Text Developers' Guide for more information).

The procedure my_datastore_doc takes the contents of the title, author and stocklevel columns, and combines them into an XML fragment, together with a generated tag for status - either "In Stock" or "Out of Stock" (this is an improvement on the previous versions, which hard-coded the status flag even though it should really be derived from stocklevel).

A User Datastore index is created on a single column - here we've chosen to use the TITLE column. We therefore need triggers to deal with updates to the other columns. An update to AUTHOR (which is conventionally text-indexed) will trigger a change to the TITLE column (actually just a copy to itself) to force a re-index.

A second trigger deals with updates to the STOCKLEVEL column. This alters the metadata in the index, without requiring the reindexing of the row. If STOCKLEVEL changes to or from zero, the value of STATUS is set to "In Stock" or "Out of Stock" as appropriate.

Why do we bother having separate STOCKLEVEL and STATUS mdata sections? Well, it's not strictly necessary. We cannot do range searches on the STOCKLEVEL value. MDATA only allows equality searches, so we can't do mdata(stocklevel, > 0) for example. However we CAN do something like:

select id, title, stocklevel from library_stock
  where contains (title, '(irving within author) not (mdata(stocklevel, 0))') > 0
However, it is useful to demonstrate the use of derived MDATA, even if it's not completely necessary for this example.

Multiple Values

There is no need to restrict MDATA to a single value per document. The example simple_mult.sql shows a situation where roles are indexed in a many-to-many relationship with users - a user may have multiple roles, and a role may be assigned to multiple users. This is simply a matter of repeating the data - logically at least we could have
<rolename>dba</rolename>
<rolename>admin</rollname>
In fact this example shows a new technique in that the rolenames are never stored in the data at all - they are added only via the ctx_ddl.add_mdata call. This has the advantage that there is no possibility of the index mdata getting out of sync with the table data, and there is no need for triggers. The main disadvantage is that you must load all the mdata AFTER completing the index build.

Getting Around MDATA limitations - (1) Length

The maximum length of an MDATA value is 64 characters. Anything beyond this point will be quietly truncated, both at indexing time and during the query. The effect of this is that when searching for long values you might sometimes get unexpected results.

The next example, lengthlimits.sql, shows a way around this problem. In the first part of the example, we create a simple table containing 71 character "code" values. The two codes vary only in the last character. We then create a user datastore to generate an XML fragment, similar to the last example (note: triggers are omitted for brevity). The user datastore is used to create an index.

Now we're going to find that if we search for one of the values, it finds both, since they've been truncated at 64 characters in the index:

select code from test_length
  where contains (text,
  'mdata(code,
  a123456789b1234567889c123456789d123456789e123456789f123456789g123456789x)'
  ) > 0;

CODE
--------------------------------------------------------------------------------
a123456789b1234567889c123456789d123456789e123456789f123456789g123456789x
a123456789b1234567889c123456789d123456789e123456789f123456789g123456789y

So we need a workaround. The workaround consists of a lookup table, which contains short values to be indexed, and the long values which are actually in the table. We must do a lookup in this table at indexing time - in our user datastore, and at query time via a simple PL/SQL lookup function. That way, the values that are indexed are well within the 64 character limit. You can see this in a lot more detail by following the full example.

The lookup table approach can be used for other types of expansion. For example, although mdata only allows exact matching, you can provide your own stem operator by storing stem expansions in a lookup table. See the example file stems.sql.

Getting Around MDATA limitations - (2) Maximum Sections

There is a limit of 100 mdata sections. If you try to exceed this, you will get a message like:
ERROR at line 1:
ORA-20000: Oracle Text error:
DRG-11110: maximum number of mdata sections exceeded
ORA-06512: at "CTXSYS.DRUE", line 160
ORA-06512: at "CTXSYS.CTX_DDL", line 224
ORA-06512: at line 1
But what if you need more than 100 sections? Given the exact-match nature of mdata sections, instead of using
<SEC1>dog</SEC1>
<SEC2>cat</SEC2>
<SEC3>rabbit</SEC3>
you can use
<SEC>1 dog</SEC>
<SEC>2 cat</SEC>
<SEC>3 rabbit</SEC>
And the search just has to prepend the section number to the search string. This can be extended indefinitely.

(There is no example code for this section)

Logical Partitioning

It is often useful to be able to divide a table into logical "chunks" or partitions. For example, I might want to be able to search on information in the last week, the last month, or the last year. You can do this using a real partitioned table, which is very efficient when you only want to search in that partition - but there are some disadvantages:

Data must be regularly moved between partitions, as it becomes too old for the current partition Searches across more than one partition are less efficient than if all the data was held in a single table.

So a common technique in previous versions was to use a field section:

<DATERANGE>THISWEEK</DATERANGE>
and you could then search for " ... contains (col, ( (terms) and THISWEEK within daterange ) > 0" This approach works well, except that you need to reindex each document every time you modify the DATERANGE element. Using mdata avoids this problem. The query would now become " ... contains ( (terms) and MDATA(daterange, THISWEEK) ) > 0"

Mixed Queries with Range Predicates

"Mixed" queries, containing text operators and a structured range search can be expensive to process. It is often necessary for the kernel to a large number of table or index lookups to satisfy the query. This is why we strive to have the whole query satisfied from the text index. Sometimes, however, this is not possible. Consider the query
 contains (text, 'oracle') > 0
 and date between '12-JAN-2004' and '03-FEB-2004'
Now we can't completely solve that date clause from the text index, at least without using complicated algorithms described elsewhere) but we can use the text index as a "primary filter", by storing (say) the month and year in an mdata section and using a query like:
 contains (text, 'oracle
   and ( mdata(month, JAN2004) OR mdata(month FEB2004) )') > 0
 and date between '12-JAN-2004' and '03-FEB-2004'
This way, the mdata restrictions mean the text search part of the query will only fetch rows from January and February 2004, and the structured part of the search will have very much less work to do to prune these rows to the desired set of dates.

Performance note: Each mdata operator uses a database cursor, so you should not use too many of them. Searching over several years by specifying all the months is probably not a good idea. Instead, you can store different levels of mdata:

<DECADE>2000-2009</DECADE>
<YEAR>2004</YEAR>
<QUARTER>1-2004</QUARTER>
<MONTH>FEB2004</MONTH>
Or use a single repeating field and code the values so they can be distinguished:
<DATERANGE>2000-2009</DATERANGE>
<DATERANGE>YEAR2004</DATERANGE>
<DATERANGE>Q1-2004</DATERANGE>
<DATERANGE>FEB2004</DATERANGE>

Sorting

Given the logical partitioning in the previous section, we can see how this might be extended to help sorting - mimicing the way real partitions make sorting more efficient.

Let's assume we want to search by date, and we need to fetch the first 50 hits. The technique would be to run a query on

DOG and MDATA(daterange, FEB2004)  (current month)
then if you don't get enough hits from this to fill our hitlist, we can move on to
DOG and MDATA(012004,dp) (last month)
and so on. Each invocation needs an ORDER BY to do the fine-grain sorting, but the number of rows being sorted is minimal. In testing, this technique performs relatively well. it is optimal when you only need one partition, so if you know the average date spread of rows returned, they can size the partitions correctly.

The following strategy is even better:

  1. Issue query without any sorting criteria, and fetch the first 50 rows
  2. If you don't get 50 rows, sort what you have and return. Otherwise,
  3. Iterative partition scanning as above
You can use a similar technique with multi-resolution partitions (as discussed in the logical partition section) to improve partition scanning:
  1. Get first 50 rows of first, largest grain partition (i.e. YEAR), no sort
  2. If we've got 50 rows, try next smallest partition (i.e. HALF), no sort ... etc. This minimizes the number of invocations of the same query.
Another variation is to use progressive relaxation. This technique - described in more detail here - allows you to execute a series of query "sequences" intended to return hits in the order the sequences are executed.

The benefits of progressive relaxation are only really seen when a set number of rows are fetched, and the fetches stop before all rows have been fetched.

The file progrelax.sql demonstrates this, using the query:

select score(1), book_info from test_table 
where contains (month_info, '
<query>
  <textquery>
    <progression>
      <seq> mdata(month, 122003) </seq>
      <seq> mdata(month, 112003) </seq>
      <seq> mdata(month, 102003) </seq>
      <seq> mdata(month, 092003) </seq>
      <seq> mdata(month, 082003) </seq>
      <seq> mdata(month, 072003) </seq>
      <seq> mdata(month, 062003) </seq>
      <seq> mdata(month, 052003) </seq>
      <seq> mdata(month, 042003) </seq>
      <seq> mdata(month, 032003) </seq>
      <seq> mdata(month, 022003) </seq>
      <seq> mdata(month, 012003) </seq>
   </progression>
 </textquery>
</query>
',1) > 0 and rownum <= 5;
Note that (roughly speaking, anyway) only enough mdata cursors will be opened to provide the number of rows fetched - so the presence of twelve mdata queries here does not necessarily require twelve seperate cursors.

MDATA Scoring - bug 3407570

In Oracle 10g Release 1, there is a bug with the score returned from mdata. It SHOULD return 0 if no match, or 100 if there is a match. However it actually returns a lower value than 100 - normally 1.

This causes problems with AND queries, since AND always scores the LOWER of the two sides. So a search for

  '(lord of the rings) and mdata(mediatype, DVD)'
will always return a score of 1, since the mdata score overrides the score from the phrase search.

To workaround this, you can force the score of the mdata clause to the maximum (100) like so:

  '(lord of the rings) and mdata(mediatype, DVD)*10*10'

Note the maximum score multiplier is 10 - so you have to do it twice as above.


Last modified: 16-April-2004 by Roger Ford


E-mail this page
Printer View Printer View
Oracle Is The Information Company About Oracle | Oracle RSS Feeds | Careers | Contact Us | Site Maps | Legal Notices | Terms of Use | Privacy