Searching for content and structure in XML documents
   

Documents are structured presentations of information. We can define a document as an asset that contains strucuture, content, and presentation. In this short paper we describe how to search for content and structure at the same time. All the features described in this note are available on Oracle9i.

As a case study we present an online conference proceedings where attendees can perform full text search on the structure of the papers (e.g. title, author, abstract, etc.)

XMLType

Oracle9i supports a new system-defined datatype called XMLType. XMLType has built-in member functions that offer a powerful mechanism to create, extract, and index XML data stored in Oracle9i. As a new datatype, XMLType can be used as the datatype for columns in tables and views, and also in PL/SQL stored procedures as parameters, return values, and variables.

With the XMLType and these capabilities, SQL developers can leverage the power of the relational database while working in the context of XML, and, simultaneously, XML developers can leverage the power of XML standards while working in the context of a relational database.

Oracle Text

Oracle Text uses standard SQL to index, search, and analyze text and documents stored in the Oracle database, files and on the Web. Among other things, with Oracle Text you can index and search for content in XML documents.


Case Study


The user must be granted with QUERY REWRITE system privileges to create function-based index. You must also have the following initialization parameters defined to create a function-based index:

QUERY_REWRITE_INTEGRITY set to TRUSTED
QUERY_REWRITE_ENABLED set to TRUE
COMPATIBLE set to 8.1.0.0.0 or a greater value

Let's create a table with two columns: the paper id and the content. We store the content as an XMLType.

create table proceedings (tk number, papers sys.XMLTYPE);

Let's populate the table with some papers ...

insert into proceedings(tk,papers) values (1, sys.XMLType.createXML(
'<?xml version="1.0"?>
<paper>
<title>
Accelerating Dynamic Web Sites using Edge Side Includes
</title>
<authors>Soo Yun and Scott Davies
</authors>
<company>
Oracle Corporation
</company>
<abstract>

The main focus of this presentation is on Edge Side Includes (ESI). ESI is a simple markup language which is used to mark cacheable and non-cacheable fragments of a web page. An "ESI aware server", such as Oracle Web Cache and Akamai EdgeSuite, can take in ESI marked content and cache and assemble pages closer to the users, at the edge of the network, rather than at the application server level. This session will discuss the challenge many dynamic websites face today, discuss what ESI is, explain how ESI can be used to alleviate these issues. The session will also describe how to build pages with ESI, and detail the ESI and JESI (Edge Side Includes for Java) libraries.

</abstract>
<track>
Fast Track to Oracle9i
</track>
</paper>'));


We create the text index on the XMLType using the usual create index statement.

create index proc_idx on proceedings(papers) indextype is ctxsys.context;

Querying with XPath and contains

Oracle9i has introduced two new SQL functions existsNode and extract that operator on XMLType values:
  1. existsNode(): given an XPath expression, checks if the XPath applied over the document can return any valid nodes.
  2. extract(): given an XPath expression, applies the XPath to the document and returns the fragment as a XMLType.

For example let's select the authors only from the XML document:

select p.papers.extract('/paper/authors/text()').getStringVal()
from proceedings p

With Oracle Text we can use the contains function to search for content in a text document. For example let's search for papers that contains Dynamic in the title:

select tk
from proceedings
where contains(papers,'Dynamic INPATH(paper/title)')>0

A full explanation of INPATH is beyond the scope of this document. For more information please read the Query Operators in the Text Reference manual.

Using the contains function the database returns the columns that you selected. In case of an XML document it returns the entire document. We can combine the extract function to manipulate the XML for extracting fragments of XML. For example, let's select the authors of those papers that contains Dynamic in the title:

select p.papers.extract('/paper/authors/text()').getStringVal()
from proceedings p
where contains(papers,'Dynamic INPATH(paper/title)')>0

We can also apply all the power of the Text query language for the content search. In this example we select the authors of those papers that contains Dynamic or Edge in the title:

select p.papers.extract('/paper/authors/text()').getStringVal()
from proceedings p
where contains(papers,'Dynamic or Edge INPATH(paper/title)')>0

Conclusions

Traditionaly databases have allowed to search their content or their structure, but not both at the same time. Oracle provides unique features that enables querying for content and structure at the same time.


More Info
Oracle Text
Oracle XML page
Oracle Text Discussion Forum
XML Discussion Forum
Oracle9i XML Application Developer's Guide
Oracle9i Text Reference
Oracle9i Text Application Developers' Guide

Oracle9i Database Daily Features
Archives
   
E-mail this page
Printer View Printer View
Oracle Is The Information Company About Oracle | Oracle RSS Feeds | Careers | Contact Us | Site Maps | Legal Notices | Terms of Use | Privacy