It assumes a working knowledge of Oracle Text features. The document presents a variety of techniques, along with code samples. It is important to read the code samples as you work through the document - the main text does not cover all the details that can be understood by reading the code and comments.
The examples aim to be self-contained - in that they contain the complete code necessary to demonstrate the feature in question. However, they may not be complete - for example in a number of places a trigger would be necessary to keep the indexes up to date, but this has been omitted for brevity. In all cases, you are likely to see errors at the beginning of the file the first time you run an example, as the various tables and objects are dropped before being created.
In some respects, mdata sections are similar to ordinary zone or field sections, but with some important differences.
MDATA sections are not tokenized. So "Black Dog" is indexed as a single string, rather than split into "black" and "dog" as would be the case for a field or zone section. MDATA can be updated (in a transactional manner) without affecting the rest of the index. So if a library application keeps track of the number of copies of a book it has in stock, then modifying that information does NOT require you to reindex the whole record.
As a simple example, see the file simple.sql This creates a simple "library stock" application. It has a table with a CLOB column containing XML fragments.
There are two normally-text-indexed fields, title and author. There are also two fields - stocklevel and status - which are indexed using the new MDATA section group. We index the table and then search for a book:
select book_info from library_stock where contains (book_info, 'irving within author and mdata(status, In Stock)') > 0;
We can look at the some of the data held in the index. Tokens for MDATA sections have token_types in the range 400 to 499 (or -400 to -499 for changes, as we'll see shortly)
select token_text, token_type, token_first, token_last from dr$lib_index$i where token_type >= 400 or token_type <= -400; TOKEN_TEXT TOKEN_TYPE TOKEN_FIRST TOKEN_LAST TOKEN_COUNT ------------------------------ ---------- ----------- ---------- ----------- 0 401 3 3 1 1 401 1 1 1 12 401 2 2 1 In Stock 400 1 2 2 Out of Stock 400 3 3 1
TOKEN_TEXT TOKEN_TYPE TOKEN_FIRST TOKEN_LAST TOKEN_COUNT ------------------------------ ---------- ----------- ---------- ----------- 0 -401 1 1 1 0 401 3 3 1 1 -401 1 1 0 1 401 1 1 1 12 401 2 2 1 In Stock -400 1 1 0 In Stock 400 1 2 2 Out of Stock -400 1 1 1 Out of Stock 400 3 3 1
So simple_uds stores the information in structured columns, and assembles the XML-like text "on the fly" during indexing. This is done using a User Datastore procedure (if you're not familiar with this, see the Oracle Text Developers' Guide for more information).
The procedure my_datastore_doc takes the contents of the title, author and stocklevel columns, and combines them into an XML fragment, together with a generated tag for status - either "In Stock" or "Out of Stock" (this is an improvement on the previous versions, which hard-coded the status flag even though it should really be derived from stocklevel).
A User Datastore index is created on a single column - here we've chosen to use the TITLE column. We therefore need triggers to deal with updates to the other columns. An update to AUTHOR (which is conventionally text-indexed) will trigger a change to the TITLE column (actually just a copy to itself) to force a re-index.
A second trigger deals with updates to the STOCKLEVEL column. This alters the metadata in the index, without requiring the reindexing of the row. If STOCKLEVEL changes to or from zero, the value of STATUS is set to "In Stock" or "Out of Stock" as appropriate.
Why do we bother having separate STOCKLEVEL and STATUS mdata sections? Well, it's not strictly necessary. We cannot do range searches on the STOCKLEVEL value. MDATA only allows equality searches, so we can't do mdata(stocklevel, > 0) for example. However we CAN do something like:
select id, title, stocklevel from library_stock where contains (title, '(irving within author) not (mdata(stocklevel, 0))') > 0
The next example, lengthlimits.sql, shows a way around this problem. In the first part of the example, we create a simple table containing 71 character "code" values. The two codes vary only in the last character. We then create a user datastore to generate an XML fragment, similar to the last example (note: triggers are omitted for brevity). The user datastore is used to create an index.
Now we're going to find that if we search for one of the values, it finds both, since they've been truncated at 64 characters in the index:
select code from test_length where contains (text, 'mdata(code, a123456789b1234567889c123456789d123456789e123456789f123456789g123456789x)' ) > 0; CODE -------------------------------------------------------------------------------- a123456789b1234567889c123456789d123456789e123456789f123456789g123456789x a123456789b1234567889c123456789d123456789e123456789f123456789g123456789y
So we need a workaround. The workaround consists of a lookup table, which contains short values to be indexed, and the long values which are actually in the table. We must do a lookup in this table at indexing time - in our user datastore, and at query time via a simple PL/SQL lookup function. That way, the values that are indexed are well within the 64 character limit. You can see this in a lot more detail by following the full example.
The lookup table approach can be used for other types of expansion. For example, although mdata only allows exact matching, you can provide your own stem operator by storing stem expansions in a lookup table. See the example file stems.sql.
ERROR at line 1: ORA-20000: Oracle Text error: DRG-11110: maximum number of mdata sections exceeded ORA-06512: at "CTXSYS.DRUE", line 160 ORA-06512: at "CTXSYS.CTX_DDL", line 224 ORA-06512: at line 1
<SEC1>dog</SEC1> <SEC2>cat</SEC2> <SEC3>rabbit</SEC3>
<SEC>1 dog</SEC> <SEC>2 cat</SEC> <SEC>3 rabbit</SEC>
(There is no example code for this section)
Data must be regularly moved between partitions, as it becomes too old for the current partition Searches across more than one partition are less efficient than if all the data was held in a single table.
So a common technique in previous versions was to use a field section:
" ... contains (col, ( (terms) and THISWEEK within daterange ) > 0"
" ... contains ( (terms) and MDATA(daterange, THISWEEK) ) > 0"
contains (text, 'oracle') > 0 and date between '12-JAN-2004' and '03-FEB-2004'
contains (text, 'oracle and ( mdata(month, JAN2004) OR mdata(month FEB2004) )') > 0 and date between '12-JAN-2004' and '03-FEB-2004'
Performance note: Each mdata operator uses a database cursor, so you should not use too many of them. Searching over several years by specifying all the months is probably not a good idea. Instead, you can store different levels of mdata:
<DECADE>2000-2009</DECADE> <YEAR>2004</YEAR> <QUARTER>1-2004</QUARTER> <MONTH>FEB2004</MONTH>
<DATERANGE>2000-2009</DATERANGE> <DATERANGE>YEAR2004</DATERANGE> <DATERANGE>Q1-2004</DATERANGE> <DATERANGE>FEB2004</DATERANGE>
Let's assume we want to search by date, and we need to fetch the first 50 hits. The technique would be to run a query on
DOG and MDATA(daterange, FEB2004) (current month)
DOG and MDATA(012004,dp) (last month)
The following strategy is even better:
The benefits of progressive relaxation are only really seen when a set number of rows are fetched, and the fetches stop before all rows have been fetched.
The file progrelax.sql demonstrates this, using the query:
select score(1), book_info from test_table where contains (month_info, ' <query> <textquery> <progression> <seq> mdata(month, 122003) </seq> <seq> mdata(month, 112003) </seq> <seq> mdata(month, 102003) </seq> <seq> mdata(month, 092003) </seq> <seq> mdata(month, 082003) </seq> <seq> mdata(month, 072003) </seq> <seq> mdata(month, 062003) </seq> <seq> mdata(month, 052003) </seq> <seq> mdata(month, 042003) </seq> <seq> mdata(month, 032003) </seq> <seq> mdata(month, 022003) </seq> <seq> mdata(month, 012003) </seq> </progression> </textquery> </query> ',1) > 0 and rownum <= 5;
This causes problems with AND queries, since AND always scores the LOWER of the two sides. So a search for
'(lord of the rings) and mdata(mediatype, DVD)'
To workaround this, you can force the score of the mdata clause to the maximum (100) like so:
'(lord of the rings) and mdata(mediatype, DVD)*10*10'
Note the maximum score multiplier is 10 - so you have to do it twice as above.
Last modified: 16-April-2004 by Roger Ford