|
Documents are structured presentations of
information. We can define a document as an asset that contains
strucuture, content, and presentation. In this short paper
we describe how to search for content and structure at the
same time. All the features described in this note are available
on Oracle9i.
As a case study we present an online conference
proceedings where attendees can perform full text search on
the structure of the papers (e.g. title, author, abstract,
etc.)
XMLType
Oracle9i
supports a new system-defined datatype called XMLType. XMLType
has built-in member functions that offer a powerful mechanism
to create, extract, and index XML data stored in Oracle9i.
As a new datatype, XMLType can be used as the datatype for columns
in tables and views, and also in PL/SQL stored procedures as
parameters, return values, and variables.
With the XMLType and these capabilities,
SQL developers can leverage the power of the relational database
while working in the context of XML, and, simultaneously,
XML developers can leverage the power of XML standards while
working in the context of a relational database.
Oracle
Text
Oracle Text uses standard SQL to index, search, and analyze
text and documents stored in the Oracle database, files and
on the Web. Among other things, with Oracle Text you can index
and search for content in XML documents.
Case Study
The user must be granted with QUERY REWRITE system privileges
to create function-based index. You must also have the following
initialization parameters defined to create a function-based
index:
QUERY_REWRITE_INTEGRITY set to TRUSTED
QUERY_REWRITE_ENABLED set to TRUE
COMPATIBLE set to 8.1.0.0.0 or a greater value
Let's create a table with two columns: the
paper id and the content. We store the content as an XMLType.
create table proceedings (tk number, papers
sys.XMLTYPE);
Let's populate the table with some papers
...
insert into proceedings(tk,papers) values
(1, sys.XMLType.createXML(
'<?xml version="1.0"?>
<paper>
<title>
Accelerating Dynamic Web Sites using Edge Side Includes
</title>
<authors>Soo Yun and Scott Davies
</authors>
<company>
Oracle Corporation
</company>
<abstract>
The main focus of this presentation is on Edge Side Includes
(ESI). ESI is a simple markup language which is used to mark
cacheable and non-cacheable fragments of a web page. An "ESI
aware server", such as Oracle Web Cache and Akamai EdgeSuite,
can take in ESI marked content and cache and assemble pages
closer to the users, at the edge of the network, rather than
at the application server level. This session will discuss
the challenge many dynamic websites face today, discuss what
ESI is, explain how ESI can be used to alleviate these issues.
The session will also describe how to build pages with ESI,
and detail the ESI and JESI (Edge Side Includes for Java)
libraries.
</abstract>
<track>
Fast Track to Oracle9i
</track>
</paper>'));
We create the text index on the XMLType using the usual create
index statement.
create index proc_idx on proceedings(papers)
indextype is ctxsys.context;
Querying
with XPath and contains
Oracle9i
has introduced two new SQL functions existsNode and extract
that operator on XMLType values:
- existsNode(): given an XPath expression,
checks if the XPath applied over the document can return
any valid nodes.
- extract(): given an XPath expression, applies
the XPath to the document and returns the fragment as a
XMLType.
For example let's select the authors only
from the XML document:
select p.papers.extract('/paper/authors/text()').getStringVal()
from proceedings p
With Oracle Text we can use the contains
function to search for content in a text document. For example
let's search for papers that contains Dynamic in the title:
select tk
from proceedings
where contains(papers,'Dynamic INPATH(paper/title)')>0
A full explanation of INPATH is beyond the
scope of this document. For more information please read the
Query Operators in the Text Reference manual.
Using the contains function the database
returns the columns that you selected. In case of an XML document
it returns the entire document. We can combine the extract
function to manipulate the XML for extracting fragments of
XML. For example, let's select the authors of those papers
that contains Dynamic in the title:
select p.papers.extract('/paper/authors/text()').getStringVal()
from proceedings p
where contains(papers,'Dynamic INPATH(paper/title)')>0
We can also apply all the power of the Text
query language for the content search. In this example we
select the authors of those papers that contains Dynamic or
Edge in the title:
select p.papers.extract('/paper/authors/text()').getStringVal()
from proceedings p
where contains(papers,'Dynamic or Edge INPATH(paper/title)')>0
Conclusions
Traditionaly databases have allowed to search their content
or their structure, but not both at the same time. Oracle provides
unique features that enables querying for content and structure
at the same time.
More Info
Oracle9i
Database Daily Features
|