Legal | Privacy
Using Oracle Data Mining to Analyze Sequence Data

Using Oracle Data Mining to Analyze Sequence Data

This module shows you how to use Basic Local Alignment Search Tool (BLAST) to perform bioinformatics tasks.

This module discusses the following topics:

Overview
Prerequisites
Loading the Data Mining Sample Data
Querying Nucleotide Data
Querying Amino Acid Data
Querying Translated Data

Place the cursor on this icon to display all the screenshots. You can also place the cursor on each individual icon in the following steps to see only the screenshot that is associated with that step.

What Is BLAST?

Sequence alignment is without doubt one of the most commonly performed bioinformatics tasks. The most widely used sequence alignment algorithm is BLAST (Basic Local Alignment Search Tool). BLAST is a method for rapid sequence comparison that was introduced in 1990 by Stephen Altschul. It is typically used to compare a query nucleotide or amino acid sequence against a database of sequences. Its success comes from its combination of speed, sensitivity, and statistical assessment of the results.

BLAST is a heuristic method to find the high-scoring locally optimal alignments between a query sequence and a database. The BLAST algorithm and family of programs rely on the statistics of gapped and ungapped sequence alignments. The statistics allow the probability of obtaining an alignment with a particular score to be estimated.

Over the last few years, as more sequence data has become available as a result of large scale sequencing efforts, BLAST has started to be used for an increasing range of sequence-alignment activities, which include functional annotation, gene discovery, across-organism and across-species analysis, and genome assembly and completion.

A version of BLAST, which is very similar to NCBI BLAST 2.0, has been implemented in the database using table functions. This enables users to perform BLAST queries against data that is held directly inside an Oracle database. Because the algorithms are implemented as table functions, parallel computation is intrinsically supported.

The five core variants of NCBI BLAST have been implemented:

BLASTN compares a nucleotide query sequence against a nucleotide database.
BLASTP compares an amino acid sequence against a protein sequence database.
BLASTX compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database.
TBLASTN compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands).
TBLASTX compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.

Two BLAST table functions have been implemented:

MATCH

Returns the following:

q_seq_id: identifier of the query sequence

t_seq_id: identifier of the matched (target) sequence (for example: the NCBI accession number)

score: score of the match
value: the expect value

 

ALIGN

Returns information about the alignment including the following:

q_seq_id: identifier of the query sequence

t_seq_id: identifier of the matched (target) sequence (for example: the NCBI accession number)

pct_identity: percentage of the query sequence that identically matches with the database sequence
alignment_length: the length of the alignment
mismatches: the number of base-pair mismatches between the query and database sequences
gap_openings: the number of gaps opened in gapped alignment
gap_list: the list of offsets where a gap is opened
q_start: the position where the gap alignment starts
q_end: the position where the gap alignment ends
s_start: the position where the database sequence alignment starts
s_end: the position where the database sequence alignment ends
expect: the expect value of the alignment
score: the score of the alignment

 

BLAST queries can be invoked directly using the SQL interface or through an application. The results of queries can also be displayed either by using the SQL interface or through an application. When the SQL interface is used, the user has the flexibility to decide how the results will be displayed.

The introduction of similarity search functionality in the database means that users no longer have to export sequence data and transform it into a BLAST data set prior to doing similarity searches.

Complex BLAST Queries

As BLAST is implemented in the database, and can be invoked by SQL, you can pre-process any queries, as well as perform any required post-processing. The ability to preprocess data, so that you are performing the BLAST search only on a subset of the data, means that queries should be highly performant. The post-processing capability means that it is now possible to integrate the output of a sequence similarity search with the data mining, text mining, and other analytical features of the database.

Before starting this module, you should have:

1.

Completed the Configuring Linux for the Installation of Oracle Database 10g lesson

 

2.

Completed the Installing the Oracle Database 10g on Linux lesson

 

3.

Completed the Postinstallation Tasks lesson. Note: The demo data needed for this lesson is installed in the Postinstallation Tasks lesson.

 

4.

Downloaded and unzipped blast.zip into your working directory (for example, /home/oracle/wkdir)

 

Before you can run your queries, you must perform the following tasks to load the sample data into your database:

1.

Create a new data mining tablespace

2.

Create a new user

3.

Set the utl_file_dir initialization parameter

4.

Create the schema and load the sample data

Create a New Data Mining Tablespace

Perform the following to create your tablespace:

1.

Open a browser and enter the following URL:

http://<hostname>:5500/em

Specify the User Name as sys and the Password. Choose SYSDBA from the Connect As drop-down list, and then click Login.

Move your mouse over this icon to see the image

 

2.

Click the Administration tab.

Move your mouse over this icon to see the image

 

3.

Under Storage, select Tablespaces.

Move your mouse over this icon to see the image

 

4.

Click Create.

Move your mouse over this icon to see the image

 

5.

Enter DMTBLSP as the name and under Datafiles, click Add.

Move your mouse over this icon to see the image

 

6.

Enter dmtblsp01.dbf as the file name and 20 MB for the file size. Make sure Automatically extend datafile when full (AUTOEXTEND) is selected and enter 10 KB in the Increment field. Then click Continue.

Move your mouse over this icon to see the image

 

7.

Click OK to create the tablespace.

Move your mouse over this icon to see the image

 

8.

Your tablespace has been created. Click your database locator link to return to the Administration page.

Move your mouse over this icon to see the image

 

Create a New User

Now you will create the ODM user that will use the tablespace you just created. Perform the following steps:

1.

From the Administration page, select Users.

Move your mouse over this icon to see the image

 

2.

Click Create.

Move your mouse over this icon to see the image

 

3.

Enter ODM in the Name, Password, and Confirm Password fields. Then click the search light icon next to the Default Tablespace field.

Move your mouse over this icon to see the image

 

4.

Select DMTBLSP and click Select.

Move your mouse over this icon to see the image

 

5.

Click the search light icon next to the Temporary Tablespace field.

Move your mouse over this icon to see the image

 

6.

Select TEMP and click Select.

Move your mouse over this icon to see the image

 

7.

Select the Roles subtab.

Move your mouse over this icon to see the image

 

8.

Click Modify.

Move your mouse over this icon to see the image

 

9.

Select DMUSER_ROLE from the Available Roles list and click Move. Then click OK.

Move your mouse over this icon to see the image

 

10.

Click the System Privileges subtab.

Move your mouse over this icon to see the image

 

11.

Click Modify.

Move your mouse over this icon to see the image

 

12.

Select the following system privileges and click Move. Then click OK.

CREATE PROCEDURE
CREATE SEQUENCE
CREATE SESSION
CREATE SYNONYM
CREATE TABLE
CREATE TYPE
CREATE VIEW

Move your mouse over this icon to see the image

 

13.

Click the Object Privileges subtab.

Move your mouse over this icon to see the image

 

14.

Select the Package object type and click Add.

Move your mouse over this icon to see the image

 

15.

Select the search light icon.

Move your mouse over this icon to see the image

 

16.

Select the CTXSYS schema and click Go.

Move your mouse over this icon to see the image

 

17.

Click CTX_DDL and click Select.

Move your mouse over this icon to see the image

 

18.

Select EXECUTE from the Available Privileges list and click Move. Then click OK.

Move your mouse over this icon to see the image

 

19.

Click the Quotas subtab.

Move your mouse over this icon to see the image

 

20.

For the DMTBLSP tablespace select Unlimited for the Quota and click OK.

Move your mouse over this icon to see the image

 

21.

The ODM user has been created. Click the database locator link.

Move your mouse over this icon to see the image

 

Set the utl_file_dir Initialization Parameter

You must set the utl_file_dir to your data location. Perform the following steps:

1.

Under Instance, select All Initialization Parameters.

Move your mouse over this icon to see the image

 

2.

Select the SPFile subtab.

Move your mouse over this icon to see the image

 

3.

Enter utl_file_dir in the Filter field and click Go.

Move your mouse over this icon to see the image

 

4.

Enter the path for your database files and click Apply.

Move your mouse over this icon to see the image

 

5.

Your change was made. Click the database locator link.

Move your mouse over this icon to see the image

 

Create the Schema and Load the Sample Data

Finally, you are ready to create the schema and load the sample data into the ODM user you created. Perform the following steps:

1.

Open a terminal window and execute the following:

cd $ORACLE_BASE/product/10.1.0/db_1/dm/admin
sqlplus odm/odm
@dmuserld

Enter the following parameters:

1: odm
2: odm
3: <oracle_home>
4: <database file path> 
         

Move your mouse over this icon to see the image

 

2.

The schema has been created and the data has been loaded.

Move your mouse over this icon to see the image

 

You will perform a BLASTN query against a human DNA database using the MATCH and ALIGN functions. Perform the following:

1.

You will perform a BLAST search of the given query sequence against a human DNA database and return the seq_id, score, and expect value of matches that score > 25. Open a terminal window and execute the following commands:

cd wkdir
sqlplus odm/odm
@nblast01

The query in the nblast01.sql script is as follows:

select *  
from TABLE(BLASTN_MATCH (
(select sequence from ecoli_query), -- query_sequence
CURSOR(SELECT seq_id, seq_data FROM ecoli10), -- seqdb_cursor
1, -- subsequence_from
-1, -- subsequence_to
0, -- FILTER_LOW_COMPLEXITY
0, -- MASK_LOWER_CASE
10, -- EXPECT_VALUE
0, -- OPEN_GAP_COST
0, -- EXTEND_GAP_COST
0, -- MISMATCH_COST
0, -- MATCH_REWARD
11, -- WORD_SIZE
0, -- X_DROPOFF
0)) -- FINAL_X_DROPOFF
t where t.score > 25;

Move your mouse over this icon to see the image

The output displays the sequence ID, expect value and score for all nucleotide sequences in the ecoli10 database that have a similarity score of greate than 25 compared to the input nucleotide query sequence.

 

2.

Now perform a BLAST search of the given query sequence against a human DNA database and return the full alignment information of matches that have an expect value score > 25. From your terminal window, execute the following commands:

@nblast02

The query in the nblast02.sql script is as follows:

select T_SEQ_ID, PCT_IDENTITY, ALIGNMENT_LENGTH, 
POSITIVES, MISMATCHES, Q_SEQ_START, Q_SEQ_END,
Q_FRAME, T_SEQ_START, T_SEQ_END, T_FRAME,
score, EXPECT
from TABLE(BLASTN_ALIGN (
(select sequence from ecoli_query), -- query_sequence
CURSOR(SELECT seq_id, seq_data FROM ecoli10), -- seqdb_cursor
1, -- subsequence_from
-1, -- subsequence_to
0, -- FILTER_LOW_COMPLEXITY
0, -- MASK_LOWER_CASE
10, -- EXPECT_VALUE
0, -- OPEN_GAP_COST
0, -- EXTEND_GAP_COST
0, -- MISMATCH_COST
0, -- MATCH_REWARD
11, -- WORD_SIZE
0, -- X_DROPOFF
0)) -- FINAL_X_DROPOFF
t where t.score > 25;

Move your mouse over this icon to see the image

The results show the sequence ID, expect value, score and full alignment information for all nucleotide sequences in the ecoli10 database that have a sequence similarity score of greater than 25 compared to the input nucleotide query sequence.

 

You will perform a BLASTP query against all human proteins in SwissProt using the MATCH and ALIGN functions. Perform the following:

1.

Perform a BLAST search of the given query sequence against all human proteins in SwissProt and return the seq_id, score, and expect value of matches that score > 25. From your terminal window, execute the following commands:

@pblast01

The query in the pblast01.sql script is as follows:

Select T_SEQ_ID, score, EXPECT as evalue  
from TABLE(BLASTP_MATCH (
(select sequence from query_db), -- query_sequence
CURSOR(SELECT seq_id, seq_data
FROM swissprot
WHERE organism = 'Homo sapiens (Human)'), -- seqdb_cursor
1, -- subsequence_from
-1, -- subsequence_to
0, -- FILTER_LOW_COMPLEXITY
0, -- MASK_LOWER_CASE 'BLOSUM62', -- SUB_MATRIX
10, -- EXPECT_VALUE
0, -- OPEN_GAP_COST
0, -- EXTEND_GAP_COST
0, -- WORD_SIZE
0, -- X_DROPOFF
0)) -- FINAL_X_DROPOFF
t where t.score > 25;
Move your mouse over this icon to see the image

The output displays the sequence ID, expect value, and score for all human amino acid sequences in the SwissProt database that have a similarity score of greater than 25 compared to the input amino acid query sequence.

 

2.

Perform a BLAST search of the given query sequence against all of the human proteins in SwissProt created after 01-Jan-90 and return the full alignment information of matches with an expect value score > 25. From your terminal window, execute the following commands:

@pblast02

The query in the pblast02.sql script is as follows:

Select T_SEQ_ID, ALIGNMENT_LENGTH, 
Q_SEQ_START, Q_SEQ_END, Q_FRAME, T_SEQ_START,
T_SEQ_END, T_FRAME, score, EXPECT as evalue
from TABLE(BLASTP_ALIGN (
(select sequence from query_db), -- query_sequence
CURSOR(SELECT seq_id, seq_data
FROM swissprot
WHERE organism = 'Homo sapiens (Human)' AND
creation_date > '01-Jan-90'), -- seqdb_cursor
1, -- subsequence_from
-1, -- subsequence_to
0, -- FILTER_LOW_COMPLEXITY
0, -- MASK_LOWER_CASE 'BLOSUM62', -- SUB_MATRIX
10, -- EXPECT_VALUE
0, -- OPEN_GAP_COST
0, -- EXTEND_GAP_COST
3, -- WORD_SIZE
0, -- X_DROPOFF
0)) -- FINAL_X_DROPOFF
t where t.score > 25;

Move your mouse over this icon to see the image

The results show the sequence ID, expect value, score and full alignment information for all human amino acid sequences in the SwissProt database that have a sequence similarity score of greater than 25 compared to the input amino acid query sequence.

 

You will perform TBLAST queries using the ALIGN function. Perform the following:

1.

Perform a BLAST search of the given nucleotide query sequence against an amino acid database and return the full alignment information of matches with expect values that score > 25. From your terminal window, execute the following commands:

@tblast01

The query in the tblast01.sql script is as follows:

Select T_SEQ_ID, PCT_IDENTITY, ALIGNMENT_LENGTH, 
POSITIVES, MISMATCHES, Q_SEQ_START, Q_SEQ_END,
Q_FRAME, T_SEQ_START, T_SEQ_END, T_FRAME,
score, EXPECT
from TABLE(TBLAST_ALIGN (
(select sequence from ecoli_query), -- query_sequence
CURSOR(SELECT seq_id, seq_data FROM prot_db), -- seqdb_cursor
1, -- subsequence_from
-1, -- subsequence_to
'blastx', -- TRANSLATION_TYPE
1, -- GENETIC_CODE
0, -- FILTER_LOW_COMPLEXITY
0, -- MASK_LOWER_CASE
'BLOSUM62', -- SUB_MATRIX
10, -- EXPECT_VALUE
0, -- OPEN_GAP_COST
0, -- EXTEND_GAP_COST
3, -- WORD_SIZE
0, -- X_DROPOFF
0)) -- FINAL_X_DROPOFF
t where t.score > 25;
Move your mouse over this icon to see the image

The results show the sequence ID, expect value, score and full alignment information for all amino acid sequences in the protdb database that have a sequence similarity score of greater than 25 compared to the input nucleotide query sequence.

 

2.

Perform a BLAST search of an amino acid query sequence against a translated nucleotide database and return the full alignment information of matches with an expect value > 25. From your terminal window, execute the following commands:

@tblast02

The query in the tblast02.sql script is as follows:

Select * 
from TABLE(TBLAST_ALIGN (
(select sequence from query_db), -- query_sequence
CURSOR(SELECT seq_id, seq_data FROM ecoli10), -- seqdb_cursor
1, -- subsequence_from
53, -- subsequence_to
'blastn', -- TRANSLATION_TYPE
1, -- GENETIC_CODE
0, -- FILTER_LOW_COMPLEXITY
0, -- MASK_LOWER_CASE
'BLOSUM62', -- SUB_MATRIX
10, -- EXPECT_VALUE
0, -- OPEN_GAP_COST
0, -- EXTEND_GAP_COST
7, -- WORD_SIZE
0, -- X_DROPOFF
0)) -- FINAL_X_DROPOFF
t where t.score > 25;

Move your mouse over this icon to see the image

The results show the sequence ID, expect value, score and full alignment information for all nucleotide sequences in the ecoli10 database that have a sequence similarity score of greater than 25 compared to the input amino acid sequence.

 

3.

Perform a BLAST search of the given nucleotide query sequence against a nucleotide database and return the full amino acid alignment information of matches with expect values that score > 25. From your terminal window, execute the following commands:

@tblast03

The query in the tblast03.sql script is as follows:

select T_SEQ_ID, PCT_IDENTITY, ALIGNMENT_LENGTH, 
POSITIVES, MISMATCHES, Q_SEQ_START, Q_SEQ_END,
Q_FRAME, T_SEQ_START, T_SEQ_END, T_FRAME,
score, EXPECT
from TABLE(TBLAST_ALIGN (
(select sequence from ecoli_query), -- query_sequence
CURSOR(SELECT seq_id, seq_data FROM ecoli10), -- seqdb_cursor
1, -- subsequence_from
53, -- subsequence_to
'tblastx', -- TRANSLATION_TYPE
1, -- GENETIC_CODE
0, -- FILTER_LOW_COMPLEXITY
0, -- MASK_LOWER_CASE
'BLOSUM62', -- SUB_MATRIX
10, -- EXPECT_VALUE
0, -- OPEN_GAP_COST
0, -- EXTEND_GAP_COST
3, -- WORD_SIZE
0, -- X_DROPOFF
0)) -- FINAL_X_DROPOFF
t where t.score > 25;

Move your mouse over this icon to see the image

The results show the sequence ID, expect value, score and full alignment information for all translated nucleotide sequences in the ecoli10 database that have a sequence similarity score of greater than 25 compared to the input translated nucleotide query sequence.

 

Place the cursor on this icon to hide all screenshots.

 

E-mail this page
Printer View Printer View
Oracle Is The Information Company About Oracle | Oracle RSS Feeds | Careers | Contact Us | Site Maps | Legal Notices | Terms of Use | Privacy