Translating AltaVista® syntax for OracleText
This sample code provides Java and PL/SQL procedures to convert queries from
"web syntax" (as used by AltaVista(R) for example) to OracleText boolean
syntax.
Contents
AltaVista® Syntax
The AltaVista search engine
uses an easy-to-remember search syntax, as follows:
+word - Word must be present
-word - Word must not be present
word - Word is optional
pref* - Word starting with string "pref"
title:word - Word appears in title (various other fields are supported)
Emulating the Syntax in OracleText
It is non-trivial to reproduce this syntax in standard boolean terms,
as used by OracleText. The problem is that boolean syntax works
only with pairs of words or phrases, whereas AltaVista® (AV) syntax works
at the individual word level. As a simple example, consider the AV search:
+cat dog
In OracleText syntax, we must represent this as:
cat OR ( cat AND dog)
Some more complicated examples (AV first, OracleText after)
+cat dog rabbit
cat OR ( cat AND (dog OR rabbit) )
cat -dog
cat NOT dog
+cat +dog rabbit fox -fish
(cat AND dog) OR ( (cat AND dog) AND (rabbit OR fox) ) NOT fish
Additionally, we must do some extra processing to ensure the scores
come out sensibly. I will not go into full detail here, but remember the
following rules:
- OR scores the higher of the two operands
- AND scores the lower of the two operands
- ACCUMULATE scores the sum of the two operands, but otherwise acts like OR
So the last search above is actually represented as:
((cat & dog) | ((cat & dog)*10*10 & (cat , dog , rabbit , fox)) ) NOT fish
Notes:
- "&" represents AND, "|" represents OR and "," represents ACCUMULATE.
- The construct "*10*10" is used to push the score of the preceding ANDed section very high. This should ensure that the score is actually that from the ACCUMULATE, rather than the AND, and means that every word contributes to the score.
The Java translator
Using Java functions in OracleText queries
These are the steps needed to use a Java-based user-written function
inside an OracleText query:
- Write and test your Java class in the development tool of your choice (eg. JDeveloper, Visual J++ or command-line JDK).
- Save your class to the database
- Create a PL/SQL 'wrapper' declaration for the Java
- Invoke your function inside the CONTAINS clause of a query
Here's a fully worked example:
Some notes on the "translate" procedure
- Searches of the form "field:word" will be translated into an OracleText section search: "word within field".
- The search "-dog" is not possible in ConText. Such a string will be converted into an empty search string.
- It is not possible to bracket terms: "+cat +(dog fox)" (meaning "cat", and either "dog" or "fox") is not allowed. This is true of AltaVista as well.
- Asterix ("*")is converted into the OracleText wild card - "%". This means it is not possible to allow weighting of terms. The developer may choose to implement this in some other way.
- All terms in the search are weighted equally. It is arguable that required terms should be weighted higher than optional terms. This would be easy to change in the "accumulate" section of the Java source if required.
PL/SQL Version
There are some significant differences between the Java translator,
and the PL/SQL version - aside from the language they are written in!
The PL/SQL version...
- Allows for user weighting of search terms (eg. 'dog*3')
- Does not support "*" as a wild-card - use "%" instead ('do%')
- Does not support explicit section searching (eg. 'title:dog')
- Optionally weights phrases appearing in the "homepage", "head" or user-defined sections
- Includes a theme search ('about(dog)') - this should only be included when for languages where theme searches are supported
Here's a fully worked example:
Using the optional section searches
The full syntax for the PL/SQL transaltor is:
FUNCTION pavtranslate RETURNS VARCHAR2
Argument Name Type In/Out Default?
------------------------------ ----------------------- ------ --------
QUERY VARCHAR2 IN DEFAULT
SECTION_FLAG BOOLEAN IN DEFAULT
SECTION1 VARCHAR2 IN DEFAULT
SECTION2 VARCHAR2 IN DEFAULT
If the section_flag is false, then search terms/phrases will score
very highly if they are found in SECTION1, and fairly highly if they
are in SECTION2. These sections default to "HOMEPAGE" and "HEAD" respectively.
AltaVista® is a registered trademark of AltaVista Company.
|