Regular Expression Sample Application - DNA Analysis Regular Expression Sample Application - DNA Analysis


Date: 01-Dec-2004


Table of Contents

Introduction
Application Overview
Software Requirements
Terminology
Configuring the Application
Running the Application
Sample Application Files
TroubleShooting
Additional References


Introduction

Prerequisite

To understand this sample application the user is expected to have knowledge in the following area,

Technical Overview

Regular expressions represent a powerful tool for describing and manipulating text data. These are supported by a wide variety of programming and scripting languages, text editors, and now by Oracle Database 10g SQL and PL/SQL.
Regular expressions are extremely useful, because they allow programmers to work with text in terms of patterns. They are considered the most sophisticated means of performing operations such as string searching, manipulation, validation, and formatting in all applications that deal with text data. Also they are used in bioinformatics to assist with identifying DNA and protein sequences. Linguists use regular expressions to aid research of natural languages. The introduction of native regular expression support to SQL and PL/SQL in the Oracle Database revolutionizes the ability to search for and manipulate text within the database by providing expressive power in queries, data definitions and string manipulations.

Application Overview

This application uses Regular Expression for extracting and analyzing DNA data from SGD database. SGD(Saccharomyces Genome Database) is a scientific database of the molecular biology and genetics of the yeast Saccharomyces cerevisiae, which is commonly known as baker's or budding yeast. Given a region you can query the database to get the yeast genome sequence from this site. This sample uses the regular expressions to parse the output from the raw HTTP data and store the DNA sequence in the database. Further you can run the regular expression queries to identify specific patterns from the database.

A "regular expression" is a set of character that represents one or more strings. To find if a certain pattern is present within a given record such as DNA or protein we construct a regular expression that represents that pattern. For example, the pattern "GGATGA" represents the DNA sequence "GGATGA" and no other sequence. The regular expression " GAA[ACGT]{4}TTC" represents GAAACGTTTC , GAAAAAATTC etc. Here [ACGT]{4} means that the sequence may contain any combination of these characters or even all four can be of same character. You can observe from these examples that some regular expressions characters match only one character (i.e. G represents only Guanine) while others can match much more than one character. Here within lays the power of regular expression searches. Using relatively small number of symbols one can specify many different patterns to search for in one single search.

The sample application uses the DNASEQ function to connect to the SGD database and retrieve the HTTP stream data. This stream is then parsed using Regular Expressions, to extract only the DNA sequence by eliminating the control characters. The DNA sequence is further processed, to check whether the given sequence possesses any of the enzyme patterns and list their first occurrence position within the sequence.

Software Requirements

List the softwares required for configuring and running this sample application.

Terminology

Term Definition

The directory where the sample is extracted


Configuring the Application

  • Unzip the downloaded RegExpDNASample.zip. Extract the file contents into <SAMPLE_HOME> directory.
    This creates RegExpDNASample folder with all the files and folders.

  • Open the command prompt and move to <SAMPLE_HOME>/REGEXPDNASample/src folder by executing the following command,
    cd <SAMPLE_HOME>/REGEXPDNASample/src

  • Open SQL prompt. Connect as SCOTT/TIGER and run the config.sql script from <SAMPLE_HOME>/REGEXPDNASample/src folder. This will create the necessary database objects ( table, function) for this application.
    Example,
    SQL> @config.sql


Running the Application

The application can be run as below.

  • From the SQL prompt, run the dna_analysis.sql file by issueing the following command,
    SQL>@dna_analysis.sql

    Enter the value for the 'region' (Refer the table below for the sample regions). This PL/SQL block executes the DNASEQ function which connects to the http://www.yeastgenome.org website and extracts the DNA sequence. The sequence is then stored in the DNA_DB table. Also this PL/SQL block searches for certain enzyme patterns and prints their first occurrence position within the extracted DNA sequence.

    Note:
    You may input any of the following regions for analysis.
    YMR317W
    YMR010W
    YBL016W
    YBR077C
    YAL004W

    Following are the few enzyme names used in the analysis and their recognition patterns
    Enzyme Name Recognition Pattern Equivalent Oracle Regular Expression Pattern
    EcoRI GAATTC GAATTC
    BamHI GGATCC GGATCC
    HindII GTYRAC GT[CT]{1}[GA]{1}AC
    Ama87I CYCGRG C[CT]{1}CG[GA]{1}G
    Asp700I GAANNNNTTC GAA[ACGT]{4}TTC

  • Refer to the TroubelShooting section if you encounter any problems.

  • Alternatively, you can run the search_localdb.sql, if you encounter problems connecting to the website. This searches for the enzyme sites in the locally stored database.
    Example,
    SQL>@search_localdb.sql

Sample Application Files 

This section will provide a tabular listing of the sample application files, along with their respective directory locations and a description of what they do in the overall scheme of the application.

Directory File Description
readme.html

This file

config.sql This SQL file is used to configure the sample. This creates the necessary table and function
dnaseq.sql
The file that creates DNASEQ function
dna_analysis.sql
This PL/SQL code executes the DNASEQ stored procedure and runs the Regular Expression search on the retrieved sequence.
search_localdb.sql The file runs the SQL script to search patterns in the locally stored database.

TroubleShooting 

You may enocunter "ORA-29273: HTTP request failed" error while running the dna_analysis.sql file if you are behind a firewall.
To solve this problem, open the dnaseq.sql file, search for UTL_HTTP.SET_PROXY, uncomment the line containing UTL_HTTP.SET_PROXY and edit the settings and replace 'www.yourproxy.com' with the correct proxy server address.


Additional References 


Please enter your comments about this sample application here.

 

E-mail this page
Printer View Printer View
Oracle Is The Information Company About Oracle | Oracle RSS Feeds | Careers | Contact Us | Site Maps | Legal Notices | Terms of Use | Privacy