|
Using Regular Expressions
Using Regular Expressions
In this tutorial, you learn how to use regular expression
support. This new feature is introduced in Oracle Database 10g.
Approximately 30 minutes
This tutorial covers the following topics:
Place the cursor over this icon to load and view all the screenshots for this tutorial. (Caution: This action loads all screenshots simultaneously, so response time may be slow depending on your Internet connection.)
Note: Alternatively, you can place the cursor over an individual icon in the following steps to load and view only the screenshot associated with that step. You can hide an individual screenshot by clicking it.
In Oracle Database 10g, you can use both SQL and PL/SQL
to implement regular expression support. Regular expressions are a method of
describing both simple and complex patterns for searching and manipulating.
String manipulation and searching contribute to a large percentage of the logic
in a Web-based application. Usage ranges from the simple (for example, find
the word San Francisco in a specified text) to the complex (for example, extract
all URLs from the text) to the more complex (for instance, find all words in
which every second character is a vowel).
Oracle Database 10g introduces support for regular
expressions. The implementation complies with the Portable Operating System
for UNIX (POSIX) standard, issued by the Institute of Electrical and Electronics
Engineers (IEEE), for ASCII data-matching semantics and syntax. Oracle's multilingual
capabilities extend the matching capabilities of the operators beyond the POSIX
standard.
When coupled with native SQL, the use of regular expressions makes it possible to perform powerful search and manipulation operations on any data stored in an Oracle database. You can use this feature to easily solve problems that would otherwise be very complex to program.
Back to Topic List
Before you perform this tutorial, you should:
Back to Topic List
Matching Mechanism
If you have a string aabcd and you specify a search for a(b|c)d, the search looks for a followed by either b or c, which is then followed by d.
Regular Expression: 'a(b|c)d' String to Match: 'aabcd'
|
| a |
a |
b |
c |
d |
Description |
Result |
| * |
|
|
|
|
Look for a and succeed
|
Match |
| |
* |
|
|
|
Look for b and fail |
No match |
| |
* |
|
|
|
Look for c and fail, reset, and advance |
No match |
| |
* |
|
|
|
Look for a and succeed
|
Match |
| |
|
* |
|
|
Look for b and succeed; remember c as alternative |
Match |
| |
|
|
* |
|
Look for d and fail |
No match |
| |
|
* |
|
|
Look for c as last remembered alternative and fail, reset, and advance |
No match |
| |
|
* |
|
|
Look for a and fail, reset, and advance |
No match |
| |
|
|
* |
|
Look for a and fail, reset, and advance
|
No match |
| |
|
|
|
* |
Look for a and fail, reset, and advance
|
No match |
Given the string aabcd, a(b|c)d does not match it.
To implement regular expression support in either SQL or PL/SQL, you use a new set of functions. These functions are:
| Function Name |
Description |
| REGEXP_LIKE |
Similar to the LIKE operator, but performs regular expression matching instead of simple pattern matching |
| REGEXP_INSTR |
Searches for a given string for a regular expression pattern and returns the position were the match is found |
| REGEXP_REPLACE |
Searches for a regular expression pattern and replaces it with a replacement string |
| REGEXP_SUBSTR |
Searches for a regular expression pattern within a given string and returns the matched substring |
POSIX Metacharacters in Oracle Database Regular Expressions
Metacharacters are special characters that have a special meaning, such as a wild card character, a repeating character, a nonmatching character, or a range of characters.
You can use several predefined metacharacter symbols in the pattern matching with the functions.
| Symbol |
Description |
*
|
Matches zero or more occurrences |
|
|
Alternation operator for specifying alternative matches |
^/$
|
Matches the start of line and the end of line |
[]
|
Bracket expression for a matching list matching any one of the expressions represented in the list |
[^exp]
|
If the caret is inside the bracket, it negates the expression. |
{m}
|
Matches exactly m times |
{m,n}
|
Matches at least m times but no more than n times |
[: :]
|
Specifies a character class and matches any character in that class |
\
|
Can have four different meanings: (1) stand for itself; (2) quote the next character; (3) introduce an operator; (4) do nothing |
+
|
Matches one or more occurrences |
?
|
Matches zero or one occurrence |
.
|
Matches any character in the supported character set (except NULL) |
()
|
Grouping expression (treated as a single subexpression) |
\n
|
Backreference expression |
[==]
|
Specifies equivalence classes |
[..]
|
Specifies one collation element (such as a multicharacter element) |
Perl Regular Expression Extensions
In addition to the POSIX standard, Oracle supports the common Perl-influenced meta characters. If you are a life sciences developer who relies on Perl to do pattern analysis on bioinformatics data stored in huge databases of DNAs and proteins, you can use SQL Regular Expression support directly on the data rather than from the middle tier. This provides you with a more efficient solution. The metacharacters added for Perl compatability are:
| Operator |
Description |
\d
|
Match a digit character |
\D
|
Match a non-digit character |
\w
|
Match a word character |
\W
|
Match a non-word character |
\s
|
Match a whitespace character |
\S
|
Match a non-whitespace character |
\A
|
Match only at beginning of string |
\Z
|
Match only at end of string, or before newline at the end |
\z
|
Match only at end of string |
*?
|
Match 0 or more times (non-greedy) |
+? |
Match 1 or more times (non-greedy) |
??
|
Match 0 or 1 time ( non-greedy) |
{n}? |
Match exactly n times (non-greedy) |
{n,}? |
Match at least n times (non-greedy) |
{n,m}? |
Match at least n but not more than m times (non-greedy) |
For more information on Perl-Influenced Extensions in
Oracle Regular Expressions, refer to the Oracle Database Application Developer's
Guide - Fundamentals 10g Release 2 (10.2), chapter 4 "Using Regular
Expressions in Oracle Database".
Back to Topic List
You can use the regular expression functions to perform basic searches.
1. |
From a terminal window, change to the wkdir
directory.
cd\wkdir
Set your NLS_LANG
environment setting as follows:
set NLS_LANG=.AL32UTF8
Start SQL*Plus, connect to Oracle with user ID and password oe/oe.
sqlplus oe/oe

|
2. |
Examine the syntax of the REGEXP_LIKE
function:
REGEXP_LIKE(srcstr, pattern [,match_option])
In this function:
srcstr: is the search value
pattern: is the regular expression
match_option: provides the option to change default matching. This can include one or more of the following values:
'c': Use case-sensitive matching (default)
'i': Use case-insensitive matching
'n': Allow match-any-character operator to match the newline character
'm': Treat source string as multiple line
To locate all products with names containing SSP/S, SSP/V, SSS/V, or SSS/S in the PRODUCT_NAME column from the PRODUCT_INFORMATION table, execute the following script:
@relike.sql
SELECT product_name
FROM oe.product_information
WHERE regexp_like (product_name, 'SS[PS]/[VS]');

|
| 3. |
The REGEXP_INSTR function
returns the position of a given pattern within a string. Examine the syntax:
REGEXP_INSTR(srcstr, pattern [, position [, occurrence
[, return_option [, match_option]]]])
In this function:
position: is the search starting position
occurrence: is the occurrence to search for
return_option: Indicates the start or end position of occurrence
match_option: provides the option to change default matching. This can include one or more of the following values:
'c': Use case-sensitive matching (default)
'i': Use case-insensitive matching
'n': Allow match-any-character operator to match the newline character
'm': Treat source string as multiple line
To search the product names to find the location of the first nonalphabetic character (regardless of whether it is uppercase or lowercase), execute the following script:
@reinstr.sql
COLUMN product_name FORMAT a45
COLUMN non_alpha FORMAT 9999999999
SELECT product_name, REGEXP_INSTR(product_name, '[^[:alpha:]]') non_alpha
FROM oe.product_information ;

Note that [^[:<class>:]] implies a character class and matches any character from within that class; [:alpha:] matches any alphabetic character. In this case, you are negating this expression by using ^.
|
| 4. |
The REGEXP_SUBSTR function
returns a given string based on a pattern of occurrence. Examine the syntax:
REGEXP_SUBSTR(srcstr, pattern [, position
[, occurrence [, match_option]]])
In this function:
position: is the search starting position
occurrence: is the occurrence to search for
match_option: provides the option to change default matching. This can include one or more of the following values:
'c': Use case-sensitive matching (default)
'i': Use case-insensitive matching
'n': Allow match-any-character operator to match the newline character
'm': Treat source string as multiple line
You want to extract the e-mail names from the CUSTOMERS
table, but extract only those e-mail names for the customers located in
Switzerland. To do this, you need to return the contents in the CUST_EMAIL
column that precedes the @ symbol for customers with an NLS_TERRITORY
of Switzerland. Execute the following script:
@resubstr.sql
SELECT REGEXP_SUBSTR(cust_email, '[^@]+')
FROM oe.customers
WHERE nls_territory = 'SWITZERLAND' ;

Note that in this example, the result returns the first substring that does not have the @ symbol.
|
| 5. |
The REGEXP_REPLACE function
returns a given string with every occurrence of the pattern replaced with
a replacement string. Examine the syntax:
REGEXP_REPLACE(srcstr, pattern [,replacestr [, position
[, occurrence [, match_option]]]])
In this function:
position: is the search starting position
occurrence: is the occurrence to search for
replacestr: is the character string replacing pattern
match_option: provides the option to change default matching. This can include one or more of the following values:
'c': Use case-sensitive matching (default)
'i': Use case-insensitive matching
'n': Allow match-any-character operator
'm': Treat source string as multiple line
To return information from the CATALOG_URL
column in the PRODUCT_INFORMATION
table, you can perform a full scan on the column. However, this would
result in hundreds of rows being returned, because it lists a specific
HTML page location within a number of catalog domains.
In this example, you want to find only the individual domain names themselves and not the lower-level pages they contain. To find the domain names without all the extraneous information, use the REGEXP_REPLACE function. Execute the following script:
@rereplace.sql
SELECT UNIQUE REGEXP_REPLACE (catalog_url, 'http://([^/]+).*', '\1')
FROM oe.product_information ;

Here is an explanation of how the string was processed:
| http:// |
The expression starts by looking for this string literal; there are no special metacharacters here. |
| ([^/]+) |
Then, the expression looks for a series of characters provided that they are "not" slash (/). |
| .* |
The expression finishes by consuming the rest of the string with this part of the expression. |
| \1 |
The matching expression is replaced with backreference 1, which is whatever was matched between the first set of parentheses. |
|
Back to Topic List
Regular expression functions support multilingual capabilities and can be used in locale-sensitive applications. To combine the use of regular expressions with Oracle's NLS Language feature, perform the following steps:
| 1. |
Execute the following script to find the product description in the Portuguese language:
@multiport.sql
SELECT regexp_substr(translated_name, '^[a-z]+')
FROM oe.product_descriptions
WHERE language_id = 'PT'
AND translated_name like 'G%' ;
Note that the data is not displayed.

The ^ is outside the bracket, which means that you are searching for any strings or substrings that start with any character from a to z. |
| 2. |
Perform the same query, but this time use the case-insensitive 'i' switched on. Execute the following script:
@multiport2.sql
SELECT regexp_substr(translated_name, '^[a-z]+', 1, 1, 'i')
FROM oe.product_descriptions
WHERE language_id = 'PT'
AND translated_name like 'G%' ;

|
| 3. |
The results are still incomplete because the returned strings are trimmed
as soon as a non-English character is encountered. This is because the
range [a-z] is sensitive to NLS_LANGUAGE.
Range expressions are sensitive to NLS_SORT,
which in this example, is derived from NLS_LANGUAGE.
You need to set the NLS_LANGUAGE
parameter appropriately to return the complete results. Execute the
following query:
@multiport3.sql
ALTER SESSION SET NLS_LANGUAGE=PORTUGUESE;
SELECT regexp_substr(translated_name, '^[a-z]+', 1, 1, 'i')
FROM oe.product_descriptions
WHERE language_id = 'PT'
AND translated_name like 'G%' ;

|
| 4. |
The final step is to view the results in both English and Portuguese to ensure that the translation has taken place. Execute the following script:
@multiport4.sql
SELECT REGEXP_SUBSTR(i.product_name, '^[a-z]+', 1, 1, 'i') || ' = '
|| regexp_substr(d.translated_name, '^[a-z]+', 1, 1, 'i')
FROM oe.product_descriptions d, oe.product_information i
WHERE d.language_id = 'PT'
AND d.translated_name like 'G%'
AND i.product_id = d.product_id ;
ALTER SESSION SET NLS_LANGUAGE=AMERICAN;

|
Back to Topic List
You can use regular expressions with check constraints. When you define the check constraint, you can add the regular expression syntax condition to check that the data complies with the constraint.
| 1. |
Adding a check constraint on the CUST_EMAIL
column of the CUSTOMERS
table ensures that only strings containing an @ symbol are accepted.
Execute the following script:
@chkemail.sql
INSERT INTO customers VALUES
(9999,'Christian','Patel',
cust_address_typ ('1003 Canyon Road','87501',
'Santa Fe','NM','US'),
phone_list_typ ('+1 505 243 4144'),'us','AMERICA','100',
'ChrisP+creme.com', 149, null, null, null, null, null) ;
Because there was no validation being performed, an e-mail address not containing an @ symbol was accepted.
After the script runs, issue the ROLLBACK
statement at the SQL> prompt before starting the next step.
ROLLBACK;
|
| 2. |
Implement the constraint by executing the following script:
@chkemail2.sql
ALTER TABLE customers
ADD CONSTRAINT cust_email_addr
CHECK(REGEXP_LIKE(cust_email,'@'))NOVALIDATE ;

|
| 3. |
Test the constraint by executing chkemail.sql
again.
@chkemail.sql

The check constraint is violated because the e-mail address does not
contain the required symbol. The NOVALIDATE
clause ensures that existing data is not checked.
|
| 4. |
Remove the constraint by executing the following script:
@chkemail3.sql
ALTER TABLE customers DROP CONSTRAINT cust_email_addr ;

|
Back to Topic List
In this tutorial, you learned how to:
 |
Perform a Variety of Searches to Access Data
by Using Regular Expressions |
 |
Access Data in Other Languages by Using Regular
Expressions |
Back to Topic List
Place
the cursor over this icon to hide all screenshots.
|