As Published In

Oracle Magazine
September/October 2004
TECHNOLOGY: Oracle Text

Optimize Your Text Retrieval
By Ron Hardman

Fine-tune Oracle Text applications with CTX_REPORT.

Next to its employees, information is a company's biggest asset. How company data is managed can provide strategic advantage or put a company at a huge disadvantage against competitors. Information retrieval (IR) focuses on accurately and efficiently managing data, turning it into useful information that improves the business. Oracle's IR solution is Oracle Text.

This article provides an overview of Oracle Text, creates and tests an Oracle Text index, and shows you how to optimize your Oracle Text applications by using the reporting package, CTX_REPORT.

Introducing Oracle Text

Oracle Text is included in both the Standard and Enterprise editions of Oracle9i Release 1 and Release 2 and Oracle 10g. Its functionality is a core component of iFS, Ultra Search, and XML DB.

What can you do with Oracle Text? Let's start with searching. Not basic SQL searching; I'm talking about full-text retrieval against virtually any datatype (including all LOB types). The indexing and searching abilities of Oracle Text are not restricted to data stored in the database. It can index and search documents stored on the file system and index more than 150 document types, including Microsoft Word, PDF, and XML.

Oracle Text search functionality includes fuzzy, stemming (search for mice and find mouse), wildcard, proximity, results ranking, and keyword highlighting. You can even add a thesaurus to make a search on widget and find documents that contain the word gadget.

Not Your Average Index

Creating an Oracle Text index is completely different from creating any other database index. To demonstrate, I'll first create a user with the necessary permissions and a sample table with seed data for an online store. To do this, I run the contents of Listing 1 as the SYS or SYSTEM user. In Listing 1, note that the CTXAPP role is granted to TEXT_USER. This role provides the permissions required to create the index and all dependent structures.

Before creating the index, I need to determine which features I want to use and create the necessary structures to support the index. I will create a CONTEXT index, the most frequently used type of Oracle Text index. I do not need multilingual features, a thesaurus, or themes to be generated, so a simple index creation is sufficient. The basic index creation is broken into two steps:

Step 1: Create the preferences. Preferences tell Oracle Text how my index should be stored, how data should be filtered, the language(s) that will be indexed, and how fuzzy and stemming queries should be treated. I use the CTX_DDL built-in package to create the preferences and set the attributes. The CTX_DDL package is owned by the user CTXSYS and provides the ability to maintain preferences, stoplists, and attributes. The TEXT_USER schema (user) employed for the examples has EXECUTE permissions on CTX_DDL, because it was granted the CTXAPP role.

The following creates the preferences I want to use:

BEGIN CTX_DDL.CREATE_PREFERENCE(
'english_lexer','basic_lexer');

CTX_DDL.SET_ATTRIBUTE(
'english_lexer','index_themes',
'no');
END;
/

The CREATE_PREFERENCE procedure establishes the name and type of LEXER to be used. A LEXER decides how text is broken apart for indexing, and it can be set for different languages. Here I use the BASIC_LEXER object type, which supports most Western European, white-space-delimited languages. I use the SET_ATTRIBUTE procedure to instruct Oracle Text not to generate themes.

Step 2: Create the index. Having created my LEXER preference, I can create my index as follows:

CREATE INDEX song_index ON songs(title)
  INDEXTYPE IS CTXSYS.CONTEXT
  PARAMETERS('LEXER english_lexer 
    STOPLIST ctxsys.default_stoplist');

The index creation includes a clause for setting the INDEXTYPE. I specified CTXSYS.CONTEXT as the type of index to create. CTXSYS is the Oracle Text schema, and this schema owns Oracle Text built-in features. The CTXAPP role granted to my user (TEXT_USER) provides permissions to use this index, but it must still be fully qualified with the schema name. CONTEXT is the type of index I am creating.

I need to include only two parameters in the PARAMETERS() clause for this example. The first parameter is the LEXER, which is required. I created a LEXER preference in Step 1, and I provide the name of that preference as the parameter value here.

The second parameter in this example is the STOPLIST. Stoplists provide Oracle with a list of words not to index. Typical words to exclude are of, the, a, is, and so on. In the example, the default stoplist is provided as the parameter value. This stoplist is shipped with Oracle Database and is owned by the CTXSYS user. The CTXAPP role granted to my user provides permissions to use this stoplist, but it must be fully qualified with the schema name.

Testing the Index

I can test my index by running a query using the CONTAINS() operator, as shown in Listing 2.

The first thing you probably notice in Listing 2 is the use of the SCORE() operator. SCORE() ranks search results by relevance and provides a numeric value in the result set that allows me to determine which values are the best matches to the search criteria. In this query, SCORE() is given a label of 1 (inside the parentheses) that corresponds to the matching CONTAINS() search. The last argument in the CONTAINS() clause has the same label value, 1. If there are multiple CONTAINS() operators, separate labels can be used with SCORE() and CONTAINS() to determine the relevance for each. The value provided as the label is irrelevant to the score.

Note that in Listing 2, I do not use a wildcard (%) character when I search for a word. This is because Oracle Text creates tokens in its indexing process. These tokens, generated based on the LEXER defined in the index creation, are words and phrases found in the text. When a query is issued with the CONTAINS() operator, the Oracle Text index tables—not the document where the text originated—are searched for a match. Because I am searching tokens (words or phrases) and not blocks of text, a wildcard is not necessary if I supply the full term. If a token matches the keyword, the source document is retrieved as a match.

However, there are occasions when a wildcard character would be used. If I have two documents, one with the word pen and the other with the word pencil, I would need to search for pen% in order to get both records.

Another type of query that uses the CONTAINS() operator is called stemming. Some words, although originating from the same root or stem, are not spelled similarly enough for a wildcard to be of any use. For example, a search for mice% will not return mouse. If I know the root form of a word, all variants of that word can be matched, as shown in the query in Listing 3.

Because mice and mouse share the same stem, the query in Listing 3 returns both rows with the same score. I can find either form by adding a $ in front of the keyword mice. I can also include stemming in my application design so it is transparent to the end user. I discuss this more in the next section.

Fine-Tuning My Text-Based Application

Tuning and administering Oracle Text applications have historically been ad hoc processes. Developers have hunted through the data dictionary and developed scripts in an attempt to monitor and maintain Oracle Text indexes. Oracle9i Release 2 introduced a set of reports, generated by the CTX_REPORT package, specifically for Oracle Text indexes. All reports in Oracle9i Release 2 were output in text format (CLOB). Oracle added XML-style output to many of its CTX_REPORT reports in Oracle 10g.

One of the most useful reports for tuning Oracle Text applications is QUERY_LOG_SUMMARY. This report lets you analyze your indexes and queries to maximize effectiveness by checking how queries match data. Let's use the table and index I created previously to see how to use QUERY_LOG_SUMMARY.

In my online store's SONGS table (created in Listing 1), I have a song called "My Mouse Won't Work Blues." This is a Top 40 hit, so I know it should be flying off the shelves, but nobody is buying it. One place to look for reasons is the QUERY_LOG_SUMMARY report. Using the QUERY_LOG_SUMMARY report, I can see if people are performing searches for similar titles but not the exact title, meaning that the song is never returned as a match.

The first thing I need to do is begin logging the queries. I do this with the CTX_OUTPUT package, as follows:

exec ctx_output.start_query_log(
  'text_index.txt');

With the query log turned on, I will run the same incorrect query 100 times as follows:

DECLARE
   v_title VARCHAR2(50);
BEGIN
  FOR y IN 1..100
  LOOP
    BEGIN
      SELECT title
      INTO v_title
      FROM songs
        WHERE ID = 3
        AND CONTAINS (title, 
                    'mice') > 0;

      EXCEPTION
      WHEN NO_DATA_FOUND
      THEN 
         NULL;
    END;
  END LOOP;

END;
/

After the querying has finished, I can turn logging off, as follows:

exec ctx_output.end_query_log;

Using the QUERY_LOG_SUMMARY procedure, I can evaluate the log file contents using the following:

SET SERVEROUTPUT ON
-- Get the report
-- Run this as CTXSYS

DECLARE
  v_report CTX_REPORT.QUERY_TABLE;
BEGIN
   CTX_REPORT.QUERY_LOG_SUMMARY(
        'text_index.txt',
        'song_index',
        v_report,
        5,
        TRUE,
        FALSE);

  FOR y in 1..v_report.count
  LOOP
    DBMS_OUTPUT.PUT_LINE(
                'The query for'
                ||v_report(y).query
                ||' ran and failed '
                ||v_report(y).times
                ||' times.');
  END LOOP;
EXCEPTION 
  WHEN OTHERS
  THEN
    DBMS_OUTPUT.PUT_LINE(sqlerrm);
END;
/

This returns the following result:

The query for mice ran and failed 100 times. 

The reason for the search failure is obvious. ID 3 contains the word mouse, not mice. Having identified the problem with my keyword search, I can correct the query by adding stemming ($), as described in the last section:

SELECT title
  FROM songs
    WHERE ID = 3
    AND CONTAINS (title, '$mice') > 0;

TITLE
-------------------------
My Mouse Won't Work Blues

This search returns the record I want, so I add stemming to the online store's search—I specify that all user searches include the '$' by default. This results in successful searches on mouse and mice, and soon the CDs are selling.

In addition to making stemming searches transparent to the user, I could also include a menu item or an advanced search screen where stemming can be selected. I could also add an administrative option for user-specific default search settings.

Generating XML Output

The INDEX_STATS procedure supports both formatted text and XML output. The following code creates the INDEX_STATS report in text format:

-- Table to store our report
CREATE TABLE index_report (id 
            NUMBER(10), report CLOB);

-- Anonymous block to execute 
-- CTX_REPORT.INDEX_STATS and insert
-- the report into our table.
DECLARE
  v_report CLOB := null;
BEGIN
  CTX_REPORT.INDEX_STATS(
           index_name => 'SONG_INDEX',
           report => v_report,
           part_name => NULL,
           frag_stats => NULL,
           list_size => 20,
           report_format => NULL);

  INSERT INTO index_report (id, report)
    VALUES (1, v_report);

   COMMIT;

   DBMS_LOB.FREETEMPORARY(v_report);
END;
/

I can view the output with the following:

SET PAGES 9999 HEAD OFF LONG 64000
SPOOL index_stats.txt

SELECT report
FROM index_report
WHERE id = 1;

SPOOL OFF

Listing 4 displays the truncated output. Although this is a big improvement over ad hoc scripts, in Oracle Database 10g I am able to provide a more flexible way of displaying information. Now I rerun the INDEX_STATS report, this time specifying XML as the style, as shown in this code:
Next Steps

LEARN more about Oracle Text

DOWNLOAD sample code for this article

DECLARE
  v_report CLOB := null;
BEGIN
  CTX_REPORT.INDEX_STATS(
           index_name => 'SONG_INDEX',
           report => v_report,
           part_name => NULL,
           frag_stats => NULL,
           list_size => 20,
           report_format => 'XML');

  INSERT INTO index_report (report)
    VALUES (v_report);

  COMMIT;
  DBMS_LOB.FREETEMPORARY(v_report);
END;
/

SET PAGES 9999 HEAD OFF LONG 64000
SPOOL index_stats.xml
SELECT * FROM index_report;
SPOOL OFF
 

Check the XML output from your SELECT query to ensure that it contains only XML content. Any non-XML text in the output will result in an error when you open the browser.

The XML report generated by the SELECT query isn't the easiest to read. I can improve the output by using a stylesheet. I use XML Spy to generate a stylesheet; Figure 1 shows the same INDEX_STATS report in HTML format after XML Spy performs an XSL Transformation.

figure 1
Figure 1: INDEX_STATS report in HTML format

The XSL Transformation (XSLT) file, the HTML file, the XML file used as the datasource, and the XML Schema Definition (XSD) file are all available for download.

Conclusion

This article has demonstrated a few of the many features of Oracle Text. It has also shown how you can fine-tune Oracle Text applications by using reports generated by the CTX_REPORT package. These reports, new in Oracle9i Release 2 and enhanced in Oracle 10g, help with everything from creating missing index creation scripts to troubleshooting faulty queries, and they now provide the ability to report results in XML.


Ron Hardman (ron.hardman@oracle.com) is a technical specialist with Oracle and coauthor of Oracle 10g PL/SQL Programming (Oracle Press). He has worked with Oracle information retrieval technologies for eight years.



Please rate this document:

Excellent Good Average Below Average Poor


Send us your comments

E-mail this page
Printer View Printer View
Oracle Is The Information Company About Oracle | Oracle RSS Feeds | Careers | Contact Us | Site Maps | Legal Notices | Terms of Use | Privacy