Regular expressions represent a powerful tool for describing
and manipulating text data. These are supported by a wide variety of programming
and scripting languages, text editors, and now by Oracle Database 10g SQL and
PL/SQL. Regular expressions are extremely useful, because they
allow programmers to work with text in terms of patterns. They are considered
the most sophisticated means of performing operations such as string searching,
manipulation, validation, and formatting in all applications that deal with
text data. Also they are used in bioinformatics to assist with identifying DNA
and protein sequences. Linguists use regular expressions to aid research of
natural languages. The introduction of native regular expression support to
SQL and PL/SQL in the Oracle Database revolutionizes the ability to search for
and manipulate text within the database by providing expressive power in queries,
data definitions and string manipulations.
Application Overview
This application uses Regular Expression for extracting and
analyzing DNA data from SGD database. SGD(Saccharomyces Genome Database) is
a scientific database of the molecular biology and genetics of the yeast Saccharomyces
cerevisiae, which is commonly known as baker's or budding yeast. Given a region
you can query the database to get the yeast genome sequence from this site.
This sample uses the regular expressions to parse the output from the raw HTTP
data and store the DNA sequence in the database. Further you can run the regular
expression queries to identify specific patterns from the database.
A "regular expression" is a set of character that represents one or
more strings. To find if a certain pattern is present within a given record
such as DNA or protein we construct a regular expression that represents that
pattern. For example, the pattern "GGATGA" represents the DNA sequence
"GGATGA" and no other sequence. The regular expression " GAA[ACGT]{4}TTC"
represents GAAACGTTTC , GAAAAAATTC etc. Here [ACGT]{4} means that the sequence
may contain any combination of these characters or even all four can be of same
character. You can observe from these examples that some regular expressions
characters match only one character (i.e. G represents only Guanine) while others
can match much more than one character. Here within lays the power of regular
expression searches. Using relatively small number of symbols one can specify
many different patterns to search for in one single search.
The sample application uses the DNASEQ function to connect to
the SGD database and retrieve the HTTP stream data. This stream is then parsed
using Regular Expressions, to extract only the DNA sequence by eliminating the
control characters. The DNA sequence is further processed, to check whether
the given sequence possesses any of the enzyme patterns and list their first
occurrence position within the sequence.
Software Requirements
List the softwares required for configuring and running this
sample application.
Unzip the downloaded RegExpDNASample.zip.
Extract the file contents into <SAMPLE_HOME> directory.
This creates RegExpDNASample folder with all
the files and folders.
Open the command prompt and move
to <SAMPLE_HOME>/REGEXPDNASample/src folder
by executing the following command,
cd <SAMPLE_HOME>/REGEXPDNASample/src
Open SQL prompt. Connect as SCOTT/TIGER and run
the config.sql script from <SAMPLE_HOME>/REGEXPDNASample/src
folder. This will create the necessary database objects ( table, function)
for this application.
Example,
SQL> @config.sql
Running the Application
The application can be run as below.
From the SQL prompt, run the dna_analysis.sql file by issueing
the following command, SQL>@dna_analysis.sql
Enter the value for the 'region' (Refer the table below for the sample regions).
This PL/SQL block executes the DNASEQ function which connects to the http://www.yeastgenome.org
website and extracts the DNA sequence. The sequence is then stored in the
DNA_DB table. Also this PL/SQL block searches for certain enzyme patterns
and prints their first occurrence position within the extracted DNA sequence.
Note: You may input any of the following regions for analysis.
YMR317W
YMR010W
YBL016W
YBR077C
YAL004W
Following are the few enzyme names used in the analysis
and their recognition patterns
Enzyme Name
Recognition Pattern
Equivalent Oracle Regular Expression
Pattern
EcoRI
GAATTC
GAATTC
BamHI
GGATCC
GGATCC
HindII
GTYRAC
GT[CT]{1}[GA]{1}AC
Ama87I
CYCGRG
C[CT]{1}CG[GA]{1}G
Asp700I
GAANNNNTTC
GAA[ACGT]{4}TTC
Refer to the TroubelShooting
section if you encounter any problems.
Alternatively, you can run the search_localdb.sql, if
you encounter problems connecting to the website. This searches for the enzyme
sites in the locally stored database.
Example, SQL>@search_localdb.sql
Sample Application
Files
This section will provide a tabular listing of the sample
application files, along with their respective directory locations and a description
of what they do in the overall scheme of the application.
Directory
File
Description
RegExpDNASample\doc
readme.html
This file
RegExpDNASample\src
config.sql
This SQL file is used to configure the sample.
This creates the necessary table and function
RegExpDNASample\src
dnaseq.sql
The file that creates DNASEQ function
RegExpDNASample\src
dna_analysis.sql
This PL/SQL code executes the DNASEQ stored
procedure and runs the Regular Expression search on the retrieved sequence.
RegExpDNASample\src
search_localdb.sql
The file runs the SQL script to search patterns
in the locally stored database.
TroubleShooting
You may enocunter "ORA-29273: HTTP request failed"
error while running the dna_analysis.sql file if you are behind a firewall.
To solve this problem, open the dnaseq.sql file, search for UTL_HTTP.SET_PROXY,
uncomment the line containing UTL_HTTP.SET_PROXY and edit the settings and replace
'www.yourproxy.com' with the correct proxy server address.