Protein Secondary Structure Sample Application

prot_idxtype.zip

This application demonstrates how you can implement a data cartridge,  load data into an Oracle table using SQL*Loader, create domain indexes on indexed data, and make use of these indexes in queries.

The indexing scheme used in this example is based on the paper entitled " Searching on the Secondary Structure of Protein Sequences" by Hammel and Patel.  Some details of a protein's secondary structure, including Loops, Sheets, and Helices, can aid Life Scientists in determining the function of the protein.  Life Scientists would like to be able to perform queries based on the type and length of secondary structure segments appearing in a protein.  For example, the query condition <E 3 5><H 4 4> would match all proteins having a Sheet of between length 3 and 5, followed immediately by a Helix of length 4.  More details on the query language and the indexing scheme can be found in the Hammel and Patel paper, as well as in the presentation links below.

The sample application makes use of a dataset of proteins with known secondary structures called Stride, which is distributed with the Predator protein secondary structure prediction utility.  If you would like more secondary structure protein data, you can obtain it from PDB (the "Protein Data Bank"), or you can generate it from the primary structure using Predator.  Predator is available by FTP at:
UNIX Version: ftp://ftp.ebi.ac.uk/pub/software/unix/predator
DOS Version:  ftp://ftp.ebi.ac.uk/pub/software/dos/predator

Presentation Information

  • A presentation walking through this demo is available for download. 
Extensible Indexing Information

For further details about the Extensible Indexing framework, and how to build your own indexing scheme in Oracle, please see the Oracle Data Cartridge Developer's Guide.

Prerequisites

Description of the Files in the Sample Application
  • protss.sql - Creates all of the extensible indexing objects necessary to run the sample code, including indextype, operators, metadata table, and statistics type.  Also creates a schema, protss, to serve as the "indextype owner".   
  • createtable.sql - Creates a schema to hold some sample table data.  This schema is known as the "index owner".  A Proteins table is created to hold primary and secondary protein information.
  • load.ctl - SQL*Loader control file, used to load the sample data.
  • createindex.sql - Creates the domain index on the Proteins table.
  • query.sql - Some sample queries.
  • utlxplan.sql - Creates the plan_table to be used by the "Explain Plan" command to show optimizer query execution plans.
  • showplan.sql - Prints the execution plan stored in plan_table.  Assumes that the plan_table has already been defined.
  • stride_sql.dat - The Stride data in SQL*Loader form, converted from the Stride database available with Predator. 
Installation Notes 
  • From the protein_eix directory, connect to the database as SYSTEM or SYS using SQLPlus.  Execute protss.sql to create the Secondary Structure objects.  At the SQL prompt, type:
  •  
    SQL> @protss

  • Connect to the database as SYSTEM or SYS again.  Execute createtable.sql.  This creates the index owner, and the Proteins table.  At the SQL prompt, type:
  •  
    SQL> @createtable
     
  • Populate the Proteins table.  First, exit SQLPlus, and go to a command prompt, pointed to the protein_eix directory.  Execute SQL*Loader to load the sample data. 

  • From the command prompt, execute:
     
    C:\protein_eix> sqlldr protss_user/protss_user load.ctl log=xyz.log direct=true
     
  • Connect to the database again, as SYSTEM or SYS using SQLPlus.  Execute createindex.sql to create the domain index.  At the SQL prompt, type:
  •  
    SQL> @createindex

    Depending on hardware, this command could take up to several minutes to complete.   

  • The demo setup is now complete.  You can run some sample queries by executing query.sql in SQLPlus.  At the SQL prompt, type:
  •  
    SQL> @query
Oracle Open World 2014 Banner

In-Memory Replay Banner