Oracle Text Document Section Searching

Overview

Purpose

This tutorial covers the following:

  • Enabling Oracle Text Section Searching
  • Oracle Text Section Types
  • HTML Section Searching with Oracle Text
  • XML Section Searching with Oracle Text

Time to Complete

Approximately 1 hour

Introduction

Section searching enables you to narrow text queries down to blocks of text within documents. Section searching is useful when your documents have internal structure, such as HTML and XML documents. You can also search for text at the sentence and paragraph level.

Prerequisites

Before starting this tutorial, you should:

Perform Creating Oracle Text User and building a simple text query OBE.

Enabling Oracle Text Section Searching

The steps for enabling section searching for your document collection are:

  • Create a Section Group
  • Define Your Sections
  • Index Your Documents
  • Section Searching with the WITHIN Operator

Creating a Section Group

Section searching is enabled by defining section groups. You use one of the system-defined section groups to create an instance of a section group. Choose a section group appropriate for your document collection. You use section groups to specify the type of document set you have and implicitly indicate the tag structure. For instance, to index HTML tagged documents, you use the HTML_SECTION_GROUP. Likewise, to index XML tagged documents, you can use the XML_SECTION_GROUP.  You use the CTX_DDL package to create section groups and define sections as part of section groups.

To create a a Section Group, perform the following steps:
  1. Open a SQL Worksheet using myuser_conn and enter the following code to create a table called docs1 having a single column called text which is VARCHAR2(80).
    Click the Run Statement icon  to run the query.

    CREATE TABLE docs1( text VARCHAR2(80));

    alt description here
  2. Enter the following code to insert the data into the docs1 table and click Run Script icon to run the code.

    INSERT INTO docs1 VALUES('<H1>Oracle</H1>');
    INSERT INTO docs1 VALUES('<H1>California is a state in the US.</H1>');
    INSERT INTO docs1 VALUES('<H1>Paris is a city in France.</H1>');

    alt description here
  3. Create a section group with HTML_SECTION_GROUP using the following code:

    begin
    ctx_ddl.create_section_group('htmgroup', 'HTML_SECTION_GROUP');
    end;

    alt description here

Define Your Sections

    You define sections as part of the section group. Add a zone section called heading for all text within the  <H1> tag:

    begin
    ctx_ddl.add_zone_section('htmgroup', 'heading', 'H1');
    end;

    alt description here

Index Your Documents

    To index HTML documents, you need to specify your section group in the parameter clause of CREATE INDEX.

    create index myindex on docs1(text) indextype is ctxsys.context
    parameters('section group htmgroup');

    alt description here

Section Searching with the WITHIN Operator

    When your documents are indexed, you can query within sections using the WITHIN operator. Here, to find all the documents that contain the word California within their headings, enter the following query:

    select * from docs1 where contains (text, 'California WITHIN heading') > 0;

    alt description here

Oracle Text Section Types

Zone Section

A zone section is a body of text delimited by start and end tags in a document. The positions of the start and end tags are recorded in the index so that any words in between the tags are considered to be within the section.
Any instance of a zone section must have a start and an end tag. Zone sections can nest, overlap, and repeat within a document. You define sections as part of the section group. The following example defines a zone section called heading for all text within the HTML tag:

begin
ctx_ddl.add_zone_section('htmgroup', 'heading', 'H1');
end;

Note: Zone section has been already covered in the previous section.

Field Section

A field section is similar to a zone section in that it is a region of text delimited by start and end tags. Field sections are more efficient than zone sections and are different than zone sections in that the region is indexed separately from the rest of the document. You can create an unlimited number of field sections.  Since field sections are indexed differently, you can also get better query performance over zone sections for when you have a large number of documents indexed.

Unlike zone sections, field sections have the following restrictions:
  • Field sections cannot overlap.
  • Field sections cannot repeat.
  • Note: Repeated field sections are allowed, but WITHIN queries treat them as a single section. To have WITHIN queries distinguish repeated sections, define them as zone sections.

  • Field sections cannot nest.
  1. Create a table called authors using the following CREATE TABLE statement.

    create table authors
    (id number primary key, author_name varchar2(80));

    alt description here
  2. Insert data into the authors table using the following INSERT TABLE statement.

    insert into authors values (1,'<author>John Irving</author>');
    insert into authors values (2,'<author>Ken Belly</author>');

    alt description here
  3. Create a section group called basicgroup of the BASIC_SECTION_GROUP type. Then add a field section in basicgroup called Author for the tag <author> and set the visible flag to FALSE to create an invisible section:

    begin
    ctx_ddl.create_section_group('basicgroup', 'BASIC_SECTION_GROUP');
    ctx_ddl.add_field_section('basicgroup', 'Author', 'author', FALSE);
    end;

    alt description here
  4. Create an index on the authors table to index your documents and specify your section group in the parameter clause of CREATE INDEX.

    create index myindx on authors (author_name) indextype is ctxsys.context
    parameters ('section group basicgroup');

    alt description here
  5. Execute the SELECT statement to find text within the Author section as follows:

    select * from authors
    where contains (author_name, 'Irving WITHIN Author') > 0;

    alt description here

    Note: You must use the WITHIN operator in the query when you set your field section as FALSE.

  6. Create the section group of the BASIC_SECTION_GROUP type and set the visible flag to TRUE to create an visible section as follows:

    begin
    ctx_ddl.add_field_section('basicgroup', 'Author', 'author', TRUE);
    end;

    alt description here
  7. Drop and recreate the index on authors table.

    drop index myindx;

    create index myindx on authors (author_name) indextype is ctxsys.context
    parameters ('section group basicgroup');

    alt description here
  8. Execute the SELECT statement to find text within the Author section without specifying the WITHIN operator as follows:

    select * from authors
    where contains (author_name, 'Irving') > 0;

    alt description here

    Note: If you want to query text within field sections without specifying WITHIN operator, you must set the visible flag to TRUE.

MDATA Section

An MDATA section is used to reference user-defined metadata for a document. Using MDATA sections can speed up mixed queries.MDATA sections are not tokenized. So "Black Dog" is indexed as a single string, rather than split into "black" and "dog" as would be the case for a field or zone section. MDATA can be updated (in a transactional manner) without affecting the rest of the index. So if a library application keeps track of the number of copies of a book it has in stock, then modifying that information does NOT require you to reindex the whole record.

Use CTX_DDL.ADD_MDATA_SECTION to add an MDATA section to a section group. MDATA values can be changed with CTX_DDL.ADD_MDATA and removed with CTX_DDL.REMOVE_MDATA. Also, MDATA sections can have multiple values. Only the owner of the index is allowed to call CTX_DDL.ADD_MDATA and CTX_DDL.REMOVE_MDATA.

  1. Create a table called library_stock using the following CREATE TABLE statement.

    create table library_stock (id number primary key,
    book_info clob);

    alt description here
  2. Insert data into the library_stock table using the following INSERT TABLE statement.

    insert into library_stock values (1,
      '<title>A Prayer for Owen Meany</title>
       <author>John Irving</author>
       <status>In Stock</status>
       <stocklevel>1</stocklevel>');

    insert into library_stock values (2,
      '<title>The World According to Garp</title>
       <author>John Irving</author>
       <status>In Stock</status>
       <stocklevel>12</stocklevel>');

    insert into library_stock values (3,
      '<title>The Hotel New Hampshire</title>
       <author>John Irving</author>
       <status>Out of Stock</status>
       <stocklevel>0</stocklevel>');

    alt description here
  3. Create a section group called mysg of the HTML_SECTION_GROUP type. Then add a field section in mysg called title for the tag <title> and set the visible flag to TRUE. And create a MDATA sections called status and stocklevel.

    begin   
    ctx_ddl.create_section_group('mysg', 'HTML_SECTION_GROUP');
    ctx_ddl.add_field_section('mysg', 'title','title', TRUE);
    ctx_ddl.add_mdata_section('mysg', 'status', 'status');   
    ctx_ddl.add_mdata_section('mysg', 'stocklevel','stocklevel');
    end;

    alt description here
  4. Create an index on the library_stock table to index your documents and specify your section group in the parameter clause of CREATE INDEX.

    create index lib_index on library_stock (book_info)
    indextype is ctxsys.context
    parameters ('section group mysg');

    alt description here
  5. Execute the SELECT statement to find text within the mdata and the field section as follows:

    select book_info from library_stock
    where contains (book_info, 'A Prayer for Owen Meany WITHIN title and mdata(status, In Stock)') > 0;

    alt description here

SDATA Section

The value of an SDATA section is extracted from the document text like other sections, but is indexed as structured data, also referred to as SDATA. Using SDATA sections supports operations such as projection, range searches, and ordering. It also enables SDATA indexing of section data such as embedded tags, and detail table or function invocations. This enables you to perform various combinations of text and structured searches in one single SQL statement.

SDATA operators should only be used as descendants of AND operators that also have non-SDATA children. SDATA operators are meant to be used as secondary, checking or non-driving, criteria. For instance, "find documents with DOG that also have price > 5", rather than "find documents with rating > 4". Other uses will operate properly, but may not have optimal performance.

You use CTX_DDL.ADD_SDATA_SECTION to add an SDATA section to a section group. When querying within an SDATA section, you use the CONTAINS operator. The following example creates a table called items, and adds an SDATA section called my_sec_group, and then queries SDATA in the section.

  1. Create a table called items using the following CREATE TABLE statement.

    CREATE TABLE items
    (id NUMBER PRIMARY KEY, doc VARCHAR2(4000));

    alt description here
  2. Insert data into the items table using the following INSERT TABLE statement.

    INSERT INTO items VALUES (1, ' <description> Honda Pilot </description>
    <category>
    Cars and Trucks </category>
    <price>
    27000 </price> ');
    INSERT INTO items VALUES (2, ' <description> Toyota Sequoia </description>
    <category>
    Cars and Trucks </category>
    <price>
    35000 </price> ');
    INSERT INTO items VALUES (3, ' <description> Toyota Land Cruiser </description>
    <category>
    Cars and Trucks </category>
    <price>
    45000 </price> ');
    INSERT INTO items VALUES (4, ' <description> Palm Pilot </description>
    <category>
    Electronics </category>
    <price>
    5 </price> ');
    INSERT INTO items VALUES (5, ' <description> Toyota Land Cruiser Grill </description>
    <category>
    Parts and Accessories </category>
    <price> 100 </price> ');

  3. alt description here
  4. Create a section group called my_sec_group of the BASIC_SECTION_GROUP type. Then add a SDATA section in my_sec_group called category for the tag <category> and section price for the tag <price>.

    BEGIN
    CTX_DDL.CREATE_SECTION_GROUP('my_sec_group', 'BASIC_SECTION_GROUP');
    CTX_DDL.ADD_SDATA_SECTION('my_sec_group', 'category', 'category', 'VARCHAR2');
    CTX_DDL.ADD_SDATA_SECTION('my_sec_group', 'price', 'price', 'NUMBER');
    END;

    alt description here
  5. Create a CONTEXT index on the items table to index your documents and specify your section group in the parameter clause of CREATE INDEX.

    CREATE INDEX items$doc
    ON items(doc) INDEXTYPE IS CTXSYS.CONTEXT
    PARAMETERS('SECTION GROUP my_sec_group');

    alt description here
  6. Execute the SELECT statement to find text within the sdata section as follows:

    SELECT id, doc
    FROM items
    WHERE contains(doc, 'Toyota
                         AND SDATA(category = ''Cars and Trucks'')
                         AND SDATA(price <= 40000 )') > 0;

    alt description here

XML Section Searching with Oracle Text

Like HTML documents, XML documents have tagged text which you can use to define blocks of text for section searching. The contents of a section can be searched on with the WITHIN or INPATH operators.
You can search XML attribute text in one of two ways:

  • Create attribute sections with CTX_DDL.ADD_ATTR_SECTION and then index with XML_SECTION_GROUP. If you use AUTO_SECTION_GROUP when you index, attribute sections are created automatically. You can query attribute sections with the WITHIN operator.
  • Index with the PATH_SECTION_GROUP and query attribute text with the INPATH operator.

In this tutorial, you will create attribute sections with CTX_DDL.ADD_ATTR_SECTION and then index with XML_SECTION_GROUP.

  1. Create a table called books using the following CREATE TABLE statement.

    create table books(id number, text clob);

    alt description here
  2. Insert data into the books table using the following INSERT TABLE statement. Here, the  XML file defines the <BOOK> tag with a TITLE attribute as follows:

    insert into books values(1, '<BOOK TITLE="Tale of Two Cities">
    It was the best of times. </BOOK>
    <Author NAME="Charles Dickens">
    Born in England in the town, Stratford_Upon_Avon </Author>');

    alt description here
  3. Create a section group called myxmlgroup of the XML_SECTION_GROUP type. Then add a attribute section in myxmlgroup called booktitle for the tag <BOOK>.

    begin
    ctx_ddl.create_section_group('myxmlgroup', 'XML_SECTION_GROUP');
    ctx_ddl.add_attr_section('myxmlgroup', 'booktitle', 'BOOK@TITLE');
    end;

    alt description here
  4. Create an index on the books table to index your documents and specify your section group in the parameter clause of CREATE INDEX.

    CREATE INDEX myindex
    ON books(text)
    INDEXTYPE IS ctxsys.context
    PARAMETERS ('datastore ctxsys.default_datastore
                 filter ctxsys.null_filter
                 section group myxmlgroup'
               );

    alt description here
  5. Execute the SELECT statement to find text within the booktitle attribute section as follows:

    select * from books
    where contains (text, 'Tale WITHIN booktitle') > 0;

    alt description here

Summary

In this tutorial, you learned:

  • Enabling Oracle Text Section Searching
  • Oracle Text Section Types
  • HTML Section Searching with Oracle Text
  • XML Section Searching with Oracle Text

Resources

Credits

  • Lead Curriculum Developer: Dimpi Sarmah
  • Other Contributors: Roger Ford

To navigate this Oracle by Example tutorial, note the following:

Topic List:
Click a topic to navigate to that section.
Expand All Topics:
Click the button to show or hide the details for the sections. By default, all topics are collapsed.
Hide All Images:
Click the button to show or hide the screenshots. By default, all images are displayed.
Print:
Click the button to print the content. The content that is currently displayed or hidden is printed.

To navigate to a particular section in this tutorial, select the topic from the list.