Developer: Search
  DOWNLOAD
Oracle Database
  TAGS
search, contentmanagement, database, All

Building Full-Text Search Applications with Oracle Text


by Marko Asplund

Learn how to leverage the Oracle Database to implement full-text search in your applications.

Published April 2007

Oracle Text is a powerful search technology built into all Oracle Database editions, including the free Express Edition (XE). The development APIs provided by Oracle Text allow software developers to easily implement full-featured content search applications.

Oracle Text can be used to search structured and unstructured documents complementing the SQL wildcard matching. Oracle Text supports combining multiple search terms with basic Boolean operators (AND, OR, NOT, NEAR, and so forth), and it also has more-advanced features such as soundex and fuzzy searches as well as result ranking. The technology supports hundreds of file types, including Microsoft Office and PDF. Oracle Text is suitable for a wide variety of search-related use cases and storage structures. Application areas for Text include e-business, document and records management as well as issue tracking just to name a few. Retrievable text can reside in a structured form inside the database or in unstructured form either in a local file system or on the Web.

Oracle Text provides a complete SQL-based search API that consists of custom query operators, DDL syntax extensions, a set of PL/SQL procedures and database views. Text API gives the application developer full control over indexing, queries, security, presentation, and software configuration that is sometimes required. This is especially true if you're developing off-the-shelf, and not custom, software. With an off-the-shelf software product, you want to keep the software configuration as compact and simple as possible, even if it means doing a bit more up-front work in product development. Reducing the complexity of your application usually pays off later in the product lifecycle, especially in support, maintenance, and future product development phases.

Oracle Text also supports document level authorization, which is typically difficult to generalize while preserving high performance. With Text mixed queries combining both relational and unstructured data are well supported. For authorization this means you can combine full-text search and authorization in a single query. The amount of separate result sets and filtering phases required to get the final results can be minimized which simplifies application development. Oracle Text leaves the application developer's hands untied for doing performance optimizations.

Oracle Text is also programming-language agnostic and works equally well for PHP as well as Java applications.

Some time ago I needed to improve the search capabilities of an enterprise content management (ECM) system. I started the project by evaluating the use of Oracle Text. During the evaluation, Oracle Text proved to be a very viable technology to build the application search on: Oracle Text includes advanced search features, has extensive support for different file types, and is highly customizable and scalable. One drawback with the old search technology was that you would need to run a search on file content outside the database; then a database metadata search; authorize the results; and, finally, merge the separate result sets. With Oracle Text, you could do all of this in the database. The ECM system was already using Oracle Database for storing metadata. Because the technology was already in place, it was a natural choice and did not bring any additional cost to the customers.

A naive approach to implementing free-text search queries in a database could look something like this:

	  
SELECT * FROM issues
WHERE LOWER(author) LIKE '%word1%' AND LOWER(author) LIKE '%word2%' ...	 
 

Using this technique, each keyword needs to be separately matched against each column, where it could appear to match the keywords in any order. Relational databases were not, however, designed to efficiently execute queries like the above, and using this approach would result in a very nonscalable application. You could, of course, invent your own indexing and search solution, but then you might not end up using your resources optimally, especially if you've already paid for search technology as part of the database.

This article discusses the use of Oracle Text in an imaginary issue tracking application. In this application, users can create issues, which are composed of metadata and optional attachment files. The application leverages Oracle Text to implement a full-text feature for both the metadata as well as attachment file content.

The examples presented here have been tested on Oracle Database XE on Linux, but they should work equally well on other Oracle platforms.

Indexing Process and Searching

Oracle Text indexes retrievable data items before users are able to find content with search. This is a common approach used to ensure adequate search performance. The Oracle Text indexing process is modeled after a pipeline, where data items retrieved from a data store pass through a series of transformations before their keywords are added to the index. The indexing process is split into multiple phases, where each phase is handled by a separate entity and configurable by the application developer.

Oracle Text has different index types that are suitable for different purposes. For full-text search with large documents, the CONTEXT index is the appropriate index type. The indexing process includes the following phases:

  1. Data Retrieval: Data is simply fetched from a data store, for example, a Web page, database large object, or local file system, and passed as a stream of data to the next phase.
  2. Filtering: The filters are responsible for converting data in different file formats to plain text. The other components in the indexing pipeline only process plain text data and don't know about file formats such as Microsoft Word or Excel.
  3. Sectioning: The sectioner adds metadata about the structure of the original data item.
  4. Lexing: A stream of characters is split into words based on the language of the item.
  5. Indexing: In this final phase, the keywords are added to the actual index.

Once the index has been built, an application can use plain SQL queries to execute a search entered by an end user.

Setting Up Oracle Text

Oracle Text is installed with an Oracle Database XE installation by default. With other database editions, you need to install the Oracle Text feature yourself. Once the feature is present, you only need to create a normal database user and grant the CTXAPP role to the user. This will allow the user to execute certain index management procedures:

CREATE USER ot1 IDENTIFIED BY ot1;
GRANT connect,resource, ctxapp TO ot1;

File Index

Here you will create a text table required to index attachment file content stored in the issue tracking system. The attachment files are stored in the file system. In addition to the columns required by the application's data model, the text base table includes an absolute file path and a format column.

CREATE TABLE files (
  id NUMBER PRIMARY KEY,
  issue_id NUMBER,
  path VARCHAR(255) UNIQUE,
  ot_format VARCHAR(6)
);

INSERT INTO files VALUES (1, 1, '/tmp/oracletext/found1.txt', NULL);
INSERT INTO files VALUES (2, 2, '/tmp/oracletext/found2.doc', NULL);
INSERT INTO files VALUES (3, 2, '/tmp/oracletext/notfound.txt', 'IGNORE');

Here the values of the ot_format are interpreted by Oracle Text during indexing. A NULL value indicates that a filter will be automatically selected for the file, whereas IGNORE will skip the file entirely.

The text index can be created with the following statement:

CREATE INDEX file_index ON files(path) INDEXTYPE IS ctxsys.context
  PARAMETERS ('datastore ctxsys.file_datastore format column ot_format');

This statement will initiate the indexing process that will retrieve files from the file system using paths stored in the base table, filter, and index their contents. A CONTEXT index with case-insensitive, exact match semantics is created. The indexing process can be customized in several ways, for example, to support prefix and suffix matches.

Although filtering works fine most of the time without having to specify a file format for each file, having this column in your base table allows further control over the indexing process. The format column allows you, for example, to skip certain file types from being indexed. This can be very useful if you don't want to officially support the full spectrum of file formats supported by Oracle Text in your application.

Oracle Text can also be used for full-text search on metadata. In the sample application there's a table called issues for storing issue metadata. The table is defined like this:

CREATE TABLE issues (
    id NUMBER,
    author VARCHAR(80),
    summary VARCHAR(120),
    description CLOB,
    ot_version VARCHAR(10)
); 

The ot_version column is the index column, which can be used to force reindexing for certain documents. The table can be populated with test data:

INSERT INTO issues VALUES (1, 'Jane', 'Text does not make tea', 
    'Oracle Text is unable to make morning tea', 1);
INSERT INTO issues VALUES (2, 'John', 'It comes in the wrong color', 
    'I want to have Text in pink', 1);

User Index

Oracle Text can index data from different data sources. Oracle Text can be used in the issue tracking system to provide full-text search for issue metadata. By default, you can index values in a single column, but if you want to combine data from several tables, you need to create a custom PL/SQL filter procedure. I will demonstrate how to create such a procedure that will act as a storage abstraction. The indexing process will then iterate over all the rows in the text table, calling the filter procedure for each row. The filter procedure will return all the text to be indexed related to an issue.

-- declare indexing procedure
CREATE PACKAGE ot_search AS
  PROCEDURE issue_filter(rid IN ROWID, tlob IN OUT NOCOPY CLOB);
END ot_search;
/

-- define indexing procedure
CREATE PACKAGE BODY ot_search AS
  PROCEDURE issue_filter(rid IN ROWID, tlob IN OUT NOCOPY CLOB) IS 
  BEGIN 
    FOR c1 IN (SELECT author, summary, description FROM issues WHERE rowid = rid) 
      LOOP 
        dbms_lob.writeappend(tlob, LENGTH(c1.summary)+1, c1.summary || ' ');
        dbms_lob.writeappend(tlob, LENGTH(c1.author)+1, c1.author || ' ');
        dbms_lob.writeappend(tlob, LENGTH(c1.description), c1.description);
      END LOOP; 
  END issue_filter;
END ot_search;
/

-- define datastore preference for issues
BEGIN
ctx_ddl.create_preference('issue_store', 'user_datastore'); 
ctx_ddl.set_attribute('issue_store', 'procedure', 'ot_search.issue_filter'); 
ctx_ddl.set_attribute('issue_store', 'output_type', 'CLOB'); 
END;
/

-- index issues
CREATE INDEX issue_index ON issues(ot_version) INDEXTYPE IS ctxsys.context 
  PARAMETERS ('datastore issue_store');

Searching

The CONTAINS operator is used for searching CONTEXT indexes. In these examples, we only use simple Boolean operators to combine keywords, though the CONTAINS operator grammar does support more-advanced features such as soundex matches. For languages that are supported by Oracle Text, fuzzy matching and stemming are enabled by default. To leverage these advanced search features you simply need to use the fuzzy() or $ query operators, respectively, with the CONTAINS operator. Wildcard characters can be used in CONTAINS queries for prefix and suffix matching. Here are some simple example queries:

SELECT id FROM issues WHERE CONTAINS(ot_version, 'color AND pink', 1) > 0;
SELECT id FROM issues WHERE CONTAINS(ot_version, 'jane OR john', 1) > 0;

Index Maintenance

Because base table data is replicated by the index, the data needs to be periodically synchronized to the index. Index maintenance procedures can be found in the CTX_DDL PL/SQL package. Here is an example of how the index can be updated to reflect base table changes:

EXECUTE ctx_ddl.sync_index('issue_index', '2M');

The synchronization procedure takes the index name and amount of memory to use for the operation. It is also possible to have the database automatically execute this task at regular intervals. You can also choose to use the operating system or other scheduling facilities to initiate synchronization. For example on a Unix system the following shell script could be scheduled with the Cron job scheduling system to execute synchronization:

#!/bin/sh
export ORACLE_SID=orcl
export ORAENV_ASK=NO
source /usr/local/bin/oraenv

sqlplus ot1/ot1@XE > synch.log <<EOF
WHENEVER SQLERROR EXIT 5;
EXECUTE ctx_ddl.sync_index('issue_index', '2M');
EOF

The CTX_DDL package includes other useful procedures; for example, for index optimization, to eliminate index fragmentation and obsolete data.

Indexing errors can be tracked from the CTX_USER_INDEX_ERRORS view, when something goes wrong.

The database tracks document changes based on index column changes, so if you want to force Oracle Text to reindex certain documents, you can update the index column of the corresponding rows like this:

UPDATE files SET path=path WHERE id = 4;

This will update index keywords for file with id 4 when the index is synchronized.

That's it!


Marko Asplund is a technology consultant working on Web-based applications with Java. Marko's interests include enterprise software solutions and architectures, information security, and database technologies, among other things.

Send us your comments
E-mail this page
Printer View Printer View
Oracle Is The Information Company About Oracle | Oracle RSS Feeds | Careers | Contact Us | Site Maps | Legal Notices | Terms of Use | Privacy