Advanced MDATA - tips and tricks

About this Document

It assumes a working knowledge of Oracle Text features. The document presents a variety of techniques, along with code samples. It is important to read the code samples as you work through the document - the main text does not cover all the details that can be understood by reading the code and comments.

The examples aim to be self-contained - in that they contain the complete code necessary to demonstrate the feature in question. However, they may not be complete - for example in a number of places a trigger would be necessary to keep the indexes up to date, but this has been omitted for brevity. In all cases, you are likely to see errors at the beginning of the file the first time you run an example, as the various tables and objects are dropped before being created.

Introduction

In some respects, mdata sections are similar to ordinary zone or field sections, but with some important differences.

MDATA sections are not tokenized. So "Black Dog" is indexed as a single string, rather than split into "black" and "dog" as would be the case for a field or zone section. MDATA can be updated (in a transactional manner) without affecting the rest of the index. So if a library application keeps track of the number of copies of a book it has in stock, then modifying that information does NOT require you to reindex the whole record.

As a simple example, see the file simple.sql This creates a simple "library stock" application. It has a table with a CLOB column containing XML fragments.

There are two normally-text-indexed fields, title and author. There are also two fields - stocklevel and status - which are indexed using the new MDATA section group. We index the table and then search for a book:

                               
select book_info from library_stock 
where contains (book_info, 
'irving within author and mdata(status, In Stock)') > 0;

                            

We can look at the some of the data held in the index. Tokens for MDATA sections have token_types in the range 400 to 499 (or -400 to -499 for changes, as we'll see shortly)

                               
select token_text, token_type, token_first, token_last from
dr$lib_index$i where token_type >= 400 or token_type <= -400;

TOKEN_TEXT                     TOKEN_TYPE TOKEN_FIRST TOKEN_LAST TOKEN_COUNT
------------------------------ ---------- ----------- ---------- -----------
0                                     401           3          3           1
1                                     401           1          1           1
12                                    401           2          2           1
In Stock                              400           1          2           2
Out of Stock                          400           3          3           1

                            
                               
TOKEN_TEXT                       TOKEN_TYPE TOKEN_FIRST TOKEN_LAST TOKEN_COUNT
------------------------------ ---------- ----------- ---------- -----------
0                                    -401           1          1           1
0                                     401           3          3           1
1                                    -401           1          1           0
1                                     401           1          1           1
12                                    401           2          2           1
In Stock                             -400           1          1           0
In Stock                              400           1          2           2
Out of Stock                         -400           1          1           1
Out of Stock                          400           3          3           1

                            

Working with XMLType Columns

simplexml.sql

Working with Individual Columns

So simple_uds stores the information in structured columns, and assembles the XML-like text "on the fly" during indexing. This is done using a User Datastore procedure (if you're not familiar with this, see the Oracle Text Developers' Guide for more information).

The procedure my_datastore_doc takes the contents of the title, author and stocklevel columns, and combines them into an XML fragment, together with a generated tag for status - either "In Stock" or "Out of Stock" (this is an improvement on the previous versions, which hard-coded the status flag even though it should really be derived from stocklevel).

A User Datastore index is created on a single column - here we've chosen to use the TITLE column. We therefore need triggers to deal with updates to the other columns. An update to AUTHOR (which is conventionally text-indexed) will trigger a change to the TITLE column (actually just a copy to itself) to force a re-index.

A second trigger deals with updates to the STOCKLEVEL column. This alters the metadata in the index, without requiring the reindexing of the row. If STOCKLEVEL changes to or from zero, the value of STATUS is set to "In Stock" or "Out of Stock" as appropriate.

Why do we bother having separate STOCKLEVEL and STATUS mdata sections? Well, it's not strictly necessary. We cannot do range searches on the STOCKLEVEL value. MDATA only allows equality searches, so we can't do mdata(stocklevel, > 0) for example. However we CAN do something like:

                               
select id, title, stocklevel from library_stock
  where contains (title, '(irving within author) not (mdata(stocklevel, 0))') > 0

                            

Multiple Values

simple_mult.sql
                               
<rolename>dba</rolename>
<rolename>admin</rollname>

                            

Getting Around MDATA limitations - (1) Length

The next example, lengthlimits.sql, shows a way around this problem. In the first part of the example, we create a simple table containing 71 character "code" values. The two codes vary only in the last character. We then create a user datastore to generate an XML fragment, similar to the last example (note: triggers are omitted for brevity). The user datastore is used to create an index.

Now we're going to find that if we search for one of the values, it finds both, since they've been truncated at 64 characters in the index:

                               
select code from test_length
  where contains (text,
  'mdata(code,
  a123456789b1234567889c123456789d123456789e123456789f123456789g123456789x)'
  ) > 0;

CODE
--------------------------------------------------------------------------------
a123456789b1234567889c123456789d123456789e123456789f123456789g123456789x
a123456789b1234567889c123456789d123456789e123456789f123456789g123456789y

                            

So we need a workaround. The workaround consists of a lookup table, which contains short values to be indexed, and the long values which are actually in the table. We must do a lookup in this table at indexing time - in our user datastore, and at query time via a simple PL/SQL lookup function. That way, the values that are indexed are well within the 64 character limit. You can see this in a lot more detail by following the full example.

The lookup table approach can be used for other types of expansion. For example, although mdata only allows exact matching, you can provide your own stem operator by storing stem expansions in a lookup table. See the example file stems.sql.

Getting Around MDATA limitations - (2) Maximum Sections

                               
ERROR at line 1:
ORA-20000: Oracle Text error:
DRG-11110: maximum number of mdata sections exceeded
ORA-06512: at "CTXSYS.DRUE", line 160
ORA-06512: at "CTXSYS.CTX_DDL", line 224
ORA-06512: at line 1

                            
                               
<SEC1>dog</SEC1>
<SEC2>cat</SEC2>
<SEC3>rabbit</SEC3>

                            
                               
<SEC>1 dog</SEC>
<SEC>2 cat</SEC>
<SEC>3 rabbit</SEC>

                            

(There is no example code for this section)

Logical Partitioning

Data must be regularly moved between partitions, as it becomes too old for the current partition Searches across more than one partition are less efficient than if all the data was held in a single table.

So a common technique in previous versions was to use a field section:

                               
<DATERANGE>THISWEEK</DATERANGE>

                            
" ... contains (col, ( (terms) and THISWEEK within daterange ) > 0"" ... contains ( (terms) and MDATA(daterange, THISWEEK) ) > 0"

Mixed Queries with Range Predicates

                               
contains (text, 'oracle') > 0
 and date between '12-JAN-2004' and '03-FEB-2004'

                            
                               
contains (text, 'oracle
   and ( mdata(month, JAN2004) OR mdata(month FEB2004) )') > 0
 and date between '12-JAN-2004' and '03-FEB-2004'

                            

Performance note: Each mdata operator uses a database cursor, so you should not use too many of them. Searching over several years by specifying all the months is probably not a good idea. Instead, you can store different levels of mdata:

                               
<DECADE>2000-2009</DECADE>
<YEAR>2004</YEAR>
<QUARTER>1-2004</QUARTER>
<MONTH>FEB2004</MONTH>

                            
                               
<DATERANGE>2000-2009</DATERANGE>
<DATERANGE>YEAR2004</DATERANGE>
<DATERANGE>Q1-2004</DATERANGE>
<DATERANGE>FEB2004</DATERANGE>

                            

Sorting

Let's assume we want to search by date, and we need to fetch the first 50 hits. The technique would be to run a query on

                               
DOG and MDATA(daterange, FEB2004)  (current month)

                            
                               
DOG and MDATA(012004,dp) (last month)

                            

The following strategy is even better:

  1. Issue query without any sorting criteria, and fetch the first 50 rows
  2. If you don't get 50 rows, sort what you have and return. Otherwise,
  3. Iterative partition scanning as above
  1. Get first 50 rows of first, largest grain partition (i.e. YEAR), no sort
  2. If we've got 50 rows, try next smallest partition (i.e. HALF), no sort ... etc. This minimizes the number of invocations of the same query.
here

The benefits of progressive relaxation are only really seen when a set number of rows are fetched, and the fetches stop before all rows have been fetched.

The file progrelax.sql demonstrates this, using the query:

                               
select score(1), book_info from test_table 
where contains (month_info, '
<query>
  <textquery>
    <progression>
      <seq> mdata(month, 122003) </seq>
      <seq> mdata(month, 112003) </seq>
      <seq> mdata(month, 102003) </seq>
      <seq> mdata(month, 092003) </seq>
      <seq> mdata(month, 082003) </seq>
      <seq> mdata(month, 072003) </seq>
      <seq> mdata(month, 062003) </seq>
      <seq> mdata(month, 052003) </seq>
      <seq> mdata(month, 042003) </seq>
      <seq> mdata(month, 032003) </seq>
      <seq> mdata(month, 022003) </seq>
      <seq> mdata(month, 012003) </seq>
   </progression>
 </textquery>
</query>
',1) > 0 and rownum <= 5;

                            

MDATA Scoring - bug 3407570

This causes problems with AND queries, since AND always scores the LOWER of the two sides. So a search for

                               
 '(lord of the rings) and mdata(mediatype, DVD)'

                            

To workaround this, you can force the score of the mdata clause to the maximum (100) like so:

                               
 '(lord of the rings) and mdata(mediatype, DVD)*10*10'

                            

Note the maximum score multiplier is 10 - so you have to do it twice as above.


Last modified: 16-April-2004 by Roger Ford


Left Curve
Popular Downloads
Right Curve
Untitled Document
Left Curve
More Database Downloads
Right Curve



KScope 14 RHS Banner