Shamim Alpha, Oracle Corporation
create index text_index on articles(content) indextype is context;
select pubdate, author, title
where pubdate between `01-APR-1998' and `10-APR-1998'
and contains (content, 'Oracle AND ABOUT(internet)')>0;
In the example above, we are looking for all the articles which were published between the first and tenth of April 1998, and contain the term `oracle' and talk about the Internet in the column content. Due to the integration of Oracle Text with the RDBMS, a text search can be combined with any other structured search clause, as is the case in the example above.
This example also demonstrates the ability of how Oracle Text goes beyond keyword matching. When we are trying to match exact occurrences of a term in text, we have to play a guessing game trying to formulate the appropriate query. We must somehow find the right query to retrieve exactly those pages which we care about and no others. If we are looking for information on strikes by auto workers, should we query on strikes, labor disputes, lockouts, picketing, or UAW? ABOUT query takes guessing out of text retrieval. It uses Oracle Text knowledge base to find all the documents related to the query terms in meaning, in addition to those which match query terms letter-by-letter. At the same time, ABOUT is capable of filtering out (push below more relevant documents in the list) the majority of irrelevant documents. For example, if the query term is baseball, most probably the user does not want to see a document which merely says "Person X was wearing a baseball hat". ABOUT lowers the ranking of such documents by determining that they lack additional evidence for meanings related to baseball. ABOUT determines the relative importance of different kinds of words and phrases based on information from the knowledge base and adjusts this estimate against the presence or absence of support from related terms in the text. By eliminating the need to guess exact terms for queries and by filtering out irrelevant text, ABOUT query improves the productivity of the end user in finding relevant texts. ABOUT uses an extensible knowledge base (can be thought of as ontologically augmented thesaurus) for this purpose. We will discuss the user extensible knowledge base in greater detail in a later section.
Oracle Text provides a set of full-text operators such as AND, OR, PROXIMITY, STEM, WILDCARD, WITHIN etc. and a set of XML specific operators such as HASPATH, INPATH etc. to support XPATH like expressions for advanced users to formulate highly specialized queries. For typical end users, ABOUT is the most preferred operator because of its very forgiving syntax and ability to use linguistic evidence to provide good relevance ranking.
As we mentioned before, relevance judgments of documents are highly subjective but generally agreed upon. Visitors also have to perform subjective evaluation of documents to determine if a given document is likely to be relevant before they peruse the complete document. It is not sufficient to display a list of documents considered relevant. We need to provide useful tools to the users so that they can quickly determine whether a document is likely to be relevant. Oracle Text provides highlighting and gisting functionalities for that purpose. Oracle Text provides hierarchical feedback functionality to suggest modifications to the queries in case the user is dissatisfied with the query hits.
Every website has enough peculiarities, however, to render a general relevance ranking scheme inappropriate. Improving relevance ranking quality is an on-going battle. Oracle Text provides a reasonable default solution. But, if we aim for the best solution for our website, we need to customize our Oracle Text installation and for that purpose we should be prepared to investigate our search system periodically. We can detect problems with the search function by examining query logs.
One may ask since search has reasonable default behavior, why should we spend our time customizing it? Our experience shows that working fine is not enough. There remains enough room for improvement if one is willing to devote time and energy analyzing the query logs and customizing search functionality.
The actual use of our text search might be very different from the usage we expect when we build the system. We discussed before that it is very expensive to fail to provide expected search functionality. Another factor that aggravates the situation is that we cannot expect the user to always formulate queries properly. As a user-centric website, we need to proactively compensate for user errors. Common problems we find from the query logs are as follows:
A similar problem surfaces at a different level when users use synonyms. Oracle Text extensible knowledge base minimizes this problem for the general domain. However, synonyms used in specialized domains are not necessarily known to Oracle Text knowledge base. Conversely, terms considered to be synonyms in the general domain may not be synonyms within a specialized domain. For example, in general murder, man-slaughter, homicide can be considered synonyms. However, in the judiciary domain, those terms are not synonyms of each other. The extensibility API of Oracle Text extensible knowledge base helps us solve both problems by providing the ability not only to extend the knowledge base by adding synonyms but also to modify the knowledge base by removing synonyms.
Much of the strength of ABOUT comes from its ability to use hierarchical information from the knowledge base. When somebody searches for New England using ABOUT, we find from the knowledge base that New England consists of Connecticut, Massachusetts, Vermont etc., their synonyms such as Constitution State, Conn etc., spelling mistakes or variations such as Conecticut, Conneticut, etc. Mention of any of these variations as well as letter by letter match of New England will be considered an occurrence of New England. The hierarchical structure of a specific domain can be reflected in the extensible knowledge base by extending it appropriately. Following is part of a thesaurus used to extend the knowledge with Oracle specific terminology
BT SQL - Structured Query Language
USE Oracle9i Spatial
USE Oracle9i Spatial
RT computer multimedia
USE Internet File System
USE Internet File System
Internet File System
BT SQL - Structured Query Language
BT Embedded SQL
iFS USE Internet File System means that any occurrence of iFS should be treated as an occurrence of Internet File System and Internet File System is the preferred form.
SQLJ BT Embedded SQL means that Embedded SQL subsumes SQLJ, i.e. SQLJ is a type of Embedded SQL.
SQLJ RT Java means that SQLJ is related to Java.
We can create a similar hierarchy and association for the concepts in our domain and create a thesaurus. Once the knowledge base is modified using that thesaurus, ABOUT will use the information without the need for any syntax change. So there will be no further need to modify the application. All we need to do is to extend the knowledge base and recreate the Oracle Text index.
Oracle Text provides a sufficient number of query operators to be used as building blocks for domain specific query syntax. To illustrate the point let us discuss how we can support common web query syntax.
"x y z" means phrase
-x means x must be absent
+x means x must be present
The following transformation will be necessary:
Oracle Text already supports phrase operator as double quoted string.
-x can be translated as NOT
+x can be translated as the child of ACCUM with highest weight.
+SQLJ -SQL "stored procedures" can be transformed as
(SQLJ*10, "stored procedures") NOT SQL
1. Ease of use (index creation and query)
2. Scalability, robustness, 24-7 availability, manageability and security
3. High quality due to linguistic (knowledge based) capabilities
4. Simple customization API
Oracle Text helps us search enable our Oracle database backed website without modifications. However, for the best results, we need to find out our customers needs and customize Oracle Text application accordingly. Oracle Text provides API's to customize it. As we have discussed before, we cannot afford to miss business opportunities by neglecting to make a conscious effort to provide the best possible search within our ability. What is the point spending hours to create a document if nobody can ever find it! Every minute we spend analyzing the query log and customizing our search, we do a great service to our customers and in turn to ourselves