Working with Oracle Theme Indexes

Overview

Purpose

This tutorial introduces you to the basics of Oracle Text theme indexes and queries. It can all be run from SQL*Plus or SQL Developer and requires no additional files.

Time to Complete

Approximately 1 hour

Prerequisites

  • Have access to or have Installed Oracle Database 12c.
  • Have access to or have installed the SQL Developer or SQL*Plus.
  • Have access to or have installed the database examples package.

Theme Indexes

This tutorial covers the following topics:

  • Creating a table with a simple text document.
  • Creating a Theme Index on that table.
  • Examining the themes that were found in the document.

Creating a table with a simple text document

Perform the following steps to complete this section:

  1. Double-click on the SQL Developer icon on your desktop.

    alt description here
  2. In the Connections tab, click the New Connection icon. A New / Select Database Connection window will appear.

    alt description here

    The New / Select Database Connection dialog opens. Enter the connection details as follows and click Test.

    Connection Name: hr_conn
    User Name: hr
    Password: Your hr password (Select Save Password)
    Hostname: localhost
    SID: Your database SID. Default is usually "orcl"

    alt description here


  3. Check for the status of the connection on the left-bottom side (above the Help button). It should read Success. To save the connection, click Save. Then click Connect. Close the window.

    alt description here
  4. Open a SQL Worksheet and create a table called docs. It should have a single column called TEXT which is VARCHAR2(2000).

    CREATE TABLE docs (text VARCHAR2(2000));

    alt description here
    alt description here
  5. Load the table with some simple data.

    INSERT INTO docs VALUES ('
    The interest rate for deposit accounts has not increased for several years.
    This indicates a poor rate of return for investors.');

    alt description here

Creating a Theme Index on that table

A theme is "something that a document is about". A document cannot be about "invested" – that doesn't make sense, grammatically. However, the document could be about "investors" or "investing". Themes are derived from the Oracle Text Knowledge Base – a large body of knowledge (technically, an ontology) about the English language. The Knowledge Base is loaded with the database "Examples" pack.

  1. Create an index on the docs table.

    CREATE INDEX docs_index ON docs(text) INDEXTYPE IS CTXSYS.CONTEXT;

    alt description here

    Note: You have used a special "indextype" of CTXSYS.CONTEXT.Indextypes are part of the Extensibility Framework of the Oracle kernel, and CTXSYS.CONTEXT is an extensible (or DOMAIN ) index used by Oracle Text.
    There are other indextypes in Oracle Text, but CONTEXT is the most common type and the only one we shall consider here. When Oracle Text creates an index, it creates several special tables to hold index data and metadata, the most interesting of which is the so-called "Dollar-I" table. Its name is derived from the index name, and here it will be called DR$DOCS_INDEX$I.

  2. Execute the DESCRIBE command to view the DR$DOCS_INDEX$I table.

    alt description here

    Note: Of these columns, we're most interested in TOKEN_TEXT (the indexed words from your documents) and TOKEN_TYPE (the type of word indexed). Let's examine them in the next step.

  3. Execute the SELECT statement.

    SELECT token_type, token_text FROM dr$docs_index$i;

    alt description here

    Note: Here all the TOKEN_TYPE values are 0. Those are "ordinary indexed words" because we created a basic index.

  4. You want to create themes indexed as well. Drop the old index and recreate it, this time creating an index preference telling the lexer (process which selects words from text) to include themes.

    DROP INDEX docs_index;

    alt description here
  5. Use the sys_conn and enter the following code to grant CTXAPP role and execute privileges on CTX PL/SQL Package to the user hr.

    GRANT RESOURCE, CONNECT, CTXAPP TO hr;
    GRANT EXECUTE ON CTXSYS.CTX_DDL TO hr;

    alt description here
  6. Execute the following statements to create the index preference using the CTX_DDL package, and then specify the index preference using the PARAMETERS clause of the CREATE INDEX statement.

    exec ctx_ddl.create_preference( 'my_lexer', 'BASIC_LEXER' );
    exec ctx_ddl.set_attribute ('my_lexer', 'INDEX_THEMES', 'YES';

    alt description here

    CREATE INDEX docs_index ON docs(text
    INDEXTYPE IS CTXSYS.CONTEXT
    PARAMETERS ( 'lexer my_lexer' );

    alt description here
  7. Execute the SELECT statement to select the tokens again:

    SELECT token_type, token_text FROM dr$docs_index$i;

    alt description here

    Note: TOKEN_TYPE = 0 are the ordinary indexed words – words that appear in the text. Note that "stopwords" such as the, for, has are not indexed. TOKEN_TYPE=1 are indexed themes, that is, things that the document is "about". In general, these words and phrases will not have appeared in the original text, but will have been derived from it.

  8. Execute the SELECT statement with the CONTAINS function in the query to search the expression.

    SELECT * FROM docs WHERE contains( text, 'interest' ) > 0;

    alt description here
  9. Execute the SELECT statement with the ABOUT operator within the CONTAINS clause in the query to do a theme search.

    SELECT * FROM docs WHERE contains(text, 'about(general investment)')>0;

    alt description here

Extracting Themes

This tutorial covers the following topics:

  • Fetching themes for a document
  • Getting "full" themes showing the hierarchy within the Knowledge Base
  1. Create a docs table similar to the previous section, but this time it needs a primary key column.

    drop table docs;
    create table docs (id number primary key, text varchar2(2000));

    alt description here
  2. Run the INSERT statement to insert data into the table.

    insert into docs values (1, ' The interest rate for deposit accounts has not increased for several years.
    This indicates a poor rate of return for investors. ');

    alt description here
  3. Run the CREATE INDEX statement to create an index.

    create index docs_index on docs(text) indextype is ctxsys.context parameters ('nopopulate');

    alt description here
  4. Create a themes table called themes_table to store the extracted themes from the docs table.

    create table themes_table ( query_id number, theme varchar2(2000), weight number );

    alt description here
  5. Use the CTX_DOC package to get themes from the docs table into the themes_table table.
    In the query specify the table containing our docs, the index name, and the primary key value for the row (in the textkey parameter).

    alt description here
  6. Select the themes from themes_table using the following SELECT statement:

    select theme, weight from themes_table order by weight desc;

    alt description here

    Note: The themes in the output are similar to the themes you saw indexed before. But you are getting more information because you can now see how important each theme is for the document, by referencing the WEIGHT column.

  7. You can get more information by extracting FULL themes. Set the full_themes parameter to true as shown in the following statement.

    delete from themes_table;

    begin
    ctx_doc.themes (
    index_name => 'docs_index',
    restab => 'themes_table',
    textkey => 1,
    full_themes => TRUE );
    end;

    alt description here

    select theme, weight from themes_table order by weight desc;

    alt description here

    Note: You will notice that some themes are "solitary" and some part of a hierarchy. The "solitary" themes are unproved.
    That means they were found in the text, but there were no other related terms around them to back them up. Proved themes can be located properly in the knowledge base hierarchy due to the presence of related terms nearby.

Extracting Gists

A gist is a summary of a document – usually a paragraph or two that best describes the overall content of the document.
There are two types of gist:

  • The generic gist, which best describes the document as a whole
  • Point of view gists, that describe the document from the point of view of a particular theme.
  1. Drop and recreate the docs table with a primary key column and a varchar2(2000) column.

    drop table docs;

    create table docs (id number primary key, text varchar2(4000));

    alt description here

  2. Run the INSERT statement to load data into the table.

    insert into docs values (1, '
    Baby Cambridge''s first day out to Granny Carole''s: Proud parents Kate and William leave Kensington Palace for Bucklebury with son after visit from Queen and Harry.'||chr(10)||' Baby Cambridge was taken to see his grandparents on his first afternoon out today as it was revealed the Queen and Prince Harry have now met the royal baby for the first time.'||chr(10)||' Kate and William smiled broadly and waved from their car as they were driven away from Kensington Palace by security, where they had spent their first night together as a family.'||chr(10)||' Both the Duke and Duchess looked happy and fresh-faced, with William sat in the front passenger seat and Kate in the back with their child in his baby seat.'||chr(10)||' They left their west London home shortly after the Queen had visited her new great-grandson and potentially discussed names with the parents, as the guessing game over what he will be called continues.'||chr(10)||' Kensington Palace officials would not confirm where the young family were going this afternoon, but their black Land Rover was later seen arriving at grandparents Carole and Michael Middleton''s mansion in Bucklebury, Berkshire.'||chr(10)||' Her Majesty, who will travel to Balmoral for her summer holiday on Friday, spent 30 minutes with the Duke, Duchess and Baby Cambridge from 11am.'||chr(10)||' It has also emerged that Prince Harry, who is said to be thrilled to have become an uncle, may have been there last night after the trio left hospital, having raced back to London from Wattisham airbase in Suffolk where he is on duty with the RAF.'||chr(10)||' It is understood that James Middleton and Pippa Middleton were also at Kensington Palace yesterday evening.'||chr(10)||' ');

    alt description here
  3. Create an index on the docs table. As before, the index does not need to be populated:

    create index docs_index on docs(text) indextype is ctxsys.context parameters ('nopopulate');

    alt description here
  4. Create a gists_table and fetch the gists into it using the following code:

    create table gists_table ( query_id number, pov varchar2(80), gist clob );

    alt description here

    begin
    ctx_doc.gist (
    index_name => 'docs_index',
    textkey => '1',
    restab => 'gists_table' );
    end;

    alt description here
  5. Run the SELECT statement to select the generic gists from the table.

    select gist from gists_table where pov = 'GENERIC';

    alt description here
  6. Run the SELECT statement to select the point-of-view gists from the table.

    select upper(pov), gist from gists_table;

    alt description here

Summary

In this tutorial, you learned to:

  • Creating a table with a simple text document.
  • Creating a Theme Index on that table.
  • Examining the themes that were found in the document.
  • Fetching themes for a document
  • Getting "full" themes showing the hierarchy within the Knowledge Base
  • Getting generic and point-of-view gists for a document

Resources

Credits

  • Lead Curriculum Developer: Dimpi Sarmah
  • Other Contributors: Roger Ford

To navigate this Oracle by Example tutorial, note the following:

Topic List:
Click a topic to navigate to that section.
Expand All Topics:
Click the button to show or hide the details for the sections. By default, all topics are collapsed.
Hide All Images:
Click the button to show or hide the screenshots. By default, all images are displayed.
Print:
Click the button to print the content. The content that is currently displayed or hidden is printed.

To navigate to a particular section in this tutorial, select the topic from the list.