HOW TO SEARCH ENABLE OUR ORACLE-BACKED 
WEBSITE WITH ORACLE TEXT


Shamim Alpha, Oracle Corporation

INTRODUCTION

ORACLE TEXT: INFORMATION RETRIEVAL SYSTEM INTEGRATED WITH RDBMS

contextcontainsarticlescontent

        indexing:
           create index text_index on articles(content) indextype is context;

        query:
           select pubdate, author, title
                  from articles
                  where pubdate between `01-APR-1998' and `10-APR-1998'
                        and contains (content, 'Oracle AND ABOUT(internet)')>0;

In the example above, we are looking for all the articles which were published between the first and tenth of April 1998, and contain the term `oracle' and talk about the Internet in the column content. Due to the integration of Oracle Text with the RDBMS, a text search can be combined with any other structured search clause, as is the case in the example above.

This example also demonstrates the ability of how Oracle Text goes beyond keyword matching. When we are trying to match exact occurrences of a term in text, we have to play a guessing game trying to formulate the appropriate query. We must somehow find the right query to retrieve exactly those pages which we care about and no others. If we are looking for information on strikes by auto workers, should we query on strikes, labor disputes, lockouts, picketing, or UAW? ABOUT query takes guessing out of text retrieval. It uses Oracle Text knowledge base to find all the documents related to the query terms in meaning, in addition to those which match query terms letter-by-letter. At the same time, ABOUT is capable of filtering out (push below more relevant documents in the list) the majority of irrelevant documents. For example, if the query term is baseball, most probably the user does not want to see a document which merely says "Person X was wearing a baseball hat". ABOUT lowers the ranking of such documents by determining that they lack additional evidence for meanings related to baseball. ABOUT determines the relative importance of different kinds of words and phrases based on information from the knowledge base and adjusts this estimate against the presence or absence of support from related terms in the text. By eliminating the need to guess exact terms for queries and by filtering out irrelevant text, ABOUT query improves the productivity of the end user in finding relevant texts. ABOUT uses an extensible knowledge base (can be thought of as ontologically augmented thesaurus) for this purpose. We will discuss the user extensible knowledge base in greater detail in a later section.

Oracle Text provides a set of full-text operators such as AND, OR, PROXIMITY, STEM, WILDCARD, WITHIN etc. and a set of XML specific operators such as HASPATH, INPATH etc. to support XPATH like expressions for advanced users to formulate highly specialized queries. For typical end users, ABOUT is the most preferred operator because of its very forgiving syntax and ability to use linguistic evidence to provide good relevance ranking.

TEXT SEARCH IS DIFFERENT FROM STRUCTURED SEARCH

pubdate between '01-APR-1998' and '10-APR-1998'.SCORE

As we mentioned before, relevance judgments of documents are highly subjective but generally agreed upon. Visitors also have to perform subjective evaluation of documents to determine if a given document is likely to be relevant before they peruse the complete document. It is not sufficient to display a list of documents considered relevant. We need to provide useful tools to the users so that they can quickly determine whether a document is likely to be relevant. Oracle Text provides highlighting and gisting functionalities for that purpose. Oracle Text provides hierarchical feedback functionality to suggest modifications to the queries in case the user is dissatisfied with the query hits.

Every website has enough peculiarities, however, to render a general relevance ranking scheme inappropriate. Improving relevance ranking quality is an on-going battle. Oracle Text provides a reasonable default solution. But, if we aim for the best solution for our website, we need to customize our Oracle Text installation and for that purpose we should be prepared to investigate our search system periodically. We can detect problems with the search function by examining query logs.

One may ask since search has reasonable default behavior, why should we spend our time customizing it? Our experience shows that working fine is not enough. There remains enough room for improvement if one is willing to devote time and energy analyzing the query logs and customizing search functionality.
 

HOW TO MAKE SEARCH FUNCTIONALITY BETTER

HELP THE USERS QUICKLY EVALUATE THE RELEVANCE OF THE DOCUMENTS IN THE HITLIST
Producing a long list of rather cryptic urls without any other information forces the users to go through the webpages, read all the content and decide whether some hit in the list is useful. Associating useful information such as title of the webpage, simple summary, highlighted version of the most relevant part of the webpage, list of most important concepts of the webpage, size of the document, most recent update date etc. with individual hits in the list can greatly reduce the time and effort required on the user's part to decide which hits are likely to be relevant for the query. All these functionality can be provided using Oracle Text built in features.
HELP THE USERS REFINE QUERIES USING FEEDBACK TOOLS
Search functionality involves gradual refinement. The same thought can be expressed in many different ways using many different syntax, choice of word etc. It is rather unlikely that we will always find the most relevant document at the first attempt. The query initially formulated by users can be more general or narrower than they intended to be. Query feedback tools suggest ways to fine tune the queries based on the content of the document collection, thus helping us solve the problem of proper formulation of the query. Task of feedback tool is not necessarily limited to suggesting ways of narrowing down and generalizing queries. Feedback can be used to help users with other issues such as misspelling, personalization etc. Oracle Text ctx_query.feedback provides a complete solution for query feedback using Oracle Text knowledge base.
ANALYZE THE QUERY LOG: FEEL THE PULSE OF USER NEEDS AND GRIEVANCES

The actual use of our text search might be very different from the usage we expect when we build the system. We discussed before that it is very expensive to fail to provide expected search functionality. Another factor that aggravates the situation is that we cannot expect the user to always formulate queries properly. As a user-centric website, we need to proactively compensate for user errors. Common problems we find from the query logs are as follows:

  • Misspellings
  • Search depending on synonyms or hierarchical information
  • Using unexpected syntax variations
FROM ANALYZING THE QUERY LOG TO MODIFYING EXTENSIBLE KNOWLEDGE BASE
ABOUTctxkbtc

A similar problem surfaces at a different level when users use synonyms. Oracle Text extensible knowledge base minimizes this problem for the general domain. However, synonyms used in specialized domains are not necessarily known to Oracle Text knowledge base. Conversely, terms considered to be synonyms in the general domain may not be synonyms within a specialized domain. For example, in general murder, man-slaughter, homicide can be considered synonyms. However, in the judiciary domain, those terms are not synonyms of each other. The extensibility API of Oracle Text extensible knowledge base helps us solve both problems by providing the ability not only to extend the knowledge base by adding synonyms but also to modify the knowledge base by removing synonyms.

Much of the strength of ABOUT comes from its ability to use hierarchical information from the knowledge base. When somebody searches for New England using ABOUT, we find from the knowledge base that New England consists of Connecticut, Massachusetts, Vermont etc., their synonyms such as Constitution State, Conn etc., spelling mistakes or variations such as Conecticut, Conneticut, etc. Mention of any of these variations as well as letter by letter match of New England will be considered an occurrence of New England. The hierarchical structure of a specific domain can be reflected in the extensible knowledge base by extending it appropriately. Following is  part of a thesaurus used to extend the knowledge with Oracle specific terminology

Dynamic SQL
                BT SQL - Structured Query Language
Spatial Cartridge
                USE Oracle9i Spatial
Oracle Spatial
                USE Oracle9i Spatial
Oracle9i Spatial
                BT Oracle9i
                RT computer multimedia
iFS
                USE Internet File System
IFS
                USE Internet File System
Internet File System
                BT databases
                RT Oracle9i
Embedded SQL
                BT SQL - Structured Query Language
SQLJ
                BT Embedded SQL
                RT Java
 

iFS USE Internet File System means that any occurrence of iFS should be treated as an occurrence of Internet File System and Internet File System is the preferred form.

SQLJ BT Embedded SQL means that Embedded SQL subsumes SQLJ, i.e. SQLJ is a type of Embedded SQL.

SQLJ RT Java means that SQLJ is related to Java.

We can create a similar hierarchy and association for the concepts in our domain and create a thesaurus. Once the knowledge base is modified using that thesaurus, ABOUT will use the information without the need for any syntax change. So there will be no further need to modify the application. All we need to do is to extend the knowledge base and recreate the Oracle Text index.

FROM ANALYZING THE QUERY LOG TO REWRITING THE QUERY

Oracle Text provides a sufficient number of query operators to be used as building blocks for domain specific query syntax. To illustrate the point let us discuss how we can support common web query syntax.

     "x y z" means phrase
     -x means x must be absent
     +x means x must be present
 

The following transformation will be necessary:
      Oracle Text already supports phrase operator as double quoted string.
      -x can be translated as NOT
      +x can be translated as the child of ACCUM with highest weight.

Example:
     +SQLJ -SQL  "stored procedures" can be transformed as
                  (SQLJ*10, "stored procedures") NOT SQL
 
 
 
 

HOW TO HELP PEOPLE FIND WHAT THEY ARE LOOKING FOR

ORGANIZE OUR WEBSITE CONTENT INTUITIVELY
A website organized based on intuitive content categories can complement search functionality very well. Search functionality can be misused by website builders to shift the burden of making sure people find necessary information from the website builders to website visitors. Just because the website has a search button should not mean that we should stop trying to organize the website intutively. On a well organized website search functionality should mostly be used to access archival or rarely accessed special interest content. Most of the people should be able to locate everything they want to find without using search functionality. Manually classifying content into categories is very time consuming and expensive. Oracle Text CTXRULE indextype could come very handy organizing the website based on content. A set of categorization rules could be expressed using the rich set of text retrieval operators supported by Oracle Text. Using MATCHES operator provided by CTXRULE indextype,  documents could be automatically associated with categories based on their content.
PUSH THE CONTENT TO THE USERS IN A TIMELY MANNER
Maintaining continued relation with users is the key to the success of a website. Instead of passively waiting for users to come back to our website, we should allow users to establish profiles and rules which will push content interesting to them as it appears on the website. Users could explicitly state their interest or their interests could be automatically deduced based on their navigation pattern. For both scenarios, we could establish content based rules which will push content to users either when they visit website or via electronic mail. Oracle Text CTXRULE indextype could help us achieving this goal.

OTHER CONSIDERATIONS

WE DO NOT NEED TO BUILD THE APPLICATION FROM SCRATCH
Oracle Ultra Search

WE MAY WANT TO CONSIDER SPECIALIZED SEARCH APPLICATION FOR CATALOGS
CTXCAT

SUMMARY

1.  Ease of use (index creation and query)
2.  Scalability, robustness, 24-7 availability, manageability and security
3.  High quality due to linguistic (knowledge based) capabilities
4.  Simple customization API

Oracle Text helps us search enable our Oracle database backed website without modifications. However, for the best results, we need to find out our customers needs and customize Oracle Text application accordingly. Oracle Text provides API's to customize it. As we have discussed before, we cannot afford to miss business opportunities by neglecting to make a conscious effort to provide the best possible search within our ability. What is the point spending hours to create a document if nobody can ever find it! Every minute we spend analyzing the query log and customizing our search, we do a great service to our customers and in turn to ourselves

Left Curve
Popular Downloads
Right Curve
Untitled Document
Left Curve
More Database Downloads
Right Curve



KScope 14 RHS Banner