Inside Oracle Database 10g

Writing Better SQL Using Regular Expressions (Continued)

By Alice Rischert

Back to Part 1

Backreferences

A useful feature of regular expressions is the ability to store subexpressions for reuse later; this is also called backreferencing (summarized in Table 10). It allows sophisticated replace capabilities such as swapping patterns in new positions or indicating repeated word or letter occurrences. The matched part of the subexpression is stored in a temporary buffer. The buffer is numbered from left to right and accessed with the \digit notation, where digit is a number between 1 and 9 and matches the digit-th subexpression, as indicated by a set of parentheses.

The next example shows the name Ellen Hildi Smith transformed to Smith, Ellen Hildi, by referring to the individual subexpressions by number.

SELECT REGEXP_REPLACE(
       'Ellen Hildi Smith',
       '(.*) (.*) (.*)', '\3, \1 \2')
  FROM dual
REGEXP_REPLACE('EL
------------------
Smith, Ellen Hildi

The SQL statement shows three individual subexpressions enclosed by parentheses. Each individual subexpression consists of a match any metacharacter (.) followed by the * metacharacter, indicating that any character (except newline) must be matched zero or more times. A space separates each subexpression and must be matched as well. The parentheses create subexpressions that capture the values and can be referenced with \digit. The first subexpression is assigned \1, the second \2, and so on. These backreferences are used in the last parameter of this function (\3, \1 \2), which effectively returns the replacement substrings and places them in the desired format (including comma and spaces). Table 11 details the individual components of this regular expression.

Backreferences are useful for replacing, formatting, and substituting values, and you can apply them to find adjacent occurrences of values. The next example shows use of the REGEP_SUBSTR function to find any duplicate occurrences of alphanumeric values separated by a space. The displayed result shows the substring that identifies the duplicated words is.

SELECT REGEXP_SUBSTR(
       'The final test is is the implementation',
       '([[:alnum:]]+)([[:space:]]+)\1') AS substr
  FROM dual
SUBSTR
------
is is

The Match Parameter Option

You may have noticed that the regular expression operator and functions contain an optional match parameter. This parameter controls case-sensitivity, matching of the newline character, and retaining multiline inputs.

Practical Applications for Regular Expressions

You can use regular expressions not only in queries but also anywhere you can use a SQL operator or function, such as in the PL/SQL language. You can write triggers that take advantage of regular expression functionality in order to validate, generate, or extract values.

The next example illustrates how you can apply the REGEXP_LIKE operator in a column check constraint for data validation. It checks for the correct Social Security number format upon insert or update. Social Security numbers in such formats as 123-45-6789 and 123456789 are acceptable values for this column constraint. Valid data must begin with three digits, followed by a hyphen, two more digits and a hyphen, and lastly another four digits. The alternate expression allows only nine consecutive digits. The vertical bar symbol (|) separates the individual choices.

ALTER TABLE students
  ADD CONSTRAINT stud_ssn_ck CHECK
  (REGEXP_LIKE(ssn,
  '^([[:digit:]]{3}-[[:digit:]]{2}-[[:digit:]]{4}|[[:digit:]]{9})$'))

Leading or trailing characters are not acceptable, as indicated by ^ and $. Make sure your regular expression does not split across multiple lines or contain any extraneous spaces unless you want them to be part of the pattern and matched accordingly. Table 12 explains the individual components of this regular expression example.
Next Steps

Visit the Oracle Database 10g page:
/products/database/oracle10g/index.html

Comparing Regular Expressions to Existing Functionality

Regular expressions have several advantages over the familiar LIKE operator and INSTR, SUBSTR, and REPLACE functions. These traditional SQL functions have no facility for matching patterns. Only the LIKE operator performs matching of characters, through the use of the % and _ wildcards, but LIKE does not support repetitions of expressions, complex alternations, ranges of characters, characters lists, POSIX character classes, and so on. Furthermore, the new regular expression functions allow detection of duplicate word occurrences and swapping of patterns. The examples here offer you a glimpse into the world of regular expressions and on how you can apply them in your applications.

A Solid Addition to Your Toolkit

Because they help solve complex problems, regular expressions are very powerful. Some of the functionality of regular expressions is difficult to duplicate by using traditional SQL functions. When you've learned the basic building blocks of this somewhat cryptic language, regular expressions will become an indispensable part of your toolkit in the context of not only SQL but also other programming languages. Although trial and error are sometimes necessary to get your individual pattern right, the elegance and power of regular expressions is indisputable.

Alice Rischert (ar280@yahoo.com) chairs the Database Application Development and Design track at Columbia University's Computer Technology and Application Program. She is the author of the Oracle SQL Interactive Workbook 2nd edition (Prentice Hall, 2002) and the forthcoming Oracle SQL by Example (Prentice Hall, 2003). Rischert has over 15 years of experience as a database architect, DBA, and project leader for Fortune 100 companies and she has worked with Oracle since version 5.

Table 1: Anchoring Metacharacters
MetacharacterDescription
^Anchor the expression to the start of a line
$Anchor the expression to the end of a line

Table 2: Quantifiers, or Repetition Operators
QuantifierDescription
*Match 0 or more times
?Match 0 or 1 time
+Match 1 or more times
{m}Match exactly m times
{m,}Match at least m times
{m, n}Match at least m times but no more than n times

Table 3: Predefined POSIX Character Classes
Character ClassDescription
[:alpha:]Alphabetic characters
[:lower:]Lowercase alphabetic characters
[:upper:]Uppercase alphabetic characters
[:digit:]Numeric digits
[:alnum:]Alphanumeric characters
[:space:]Space characters (nonprinting), such as carriage return, newline, vertical tab, and form feed
[:punct:]Punctuation characters
[:cntrl:]Control characters (nonprinting)
[:print:]Printable characters

Table 4: Alternate Matching and Grouping of Expressions
MetacharacterDescription
|AlternationSeparates alternates, often used with grouping operator ()
( )GroupGroups subexpression into a unit for alternations, for quantifiers, or for backreferencing (see "Backreferences" section)
[char]Character listIndicates a character list; most metacharacters inside a character list are understood as literals, with the exception of character classes, and the ^ and - metacharacters

Table 5: The REGEXP_LIKE Operator
SyntaxDescription
REGEXP_LIKE(source_string, pattern
[, match_parameter])
source_string supports character datatypes (CHAR, VARCHAR2, CLOB, NCHAR, NVARCHAR2, and NCLOB but not LONG). The pattern parameter is another name for the regular expression. match_parameter allows optional parameters such as handling the newline character, retaining multiline formatting, and providing control over case-sensitivity.

Table 6: The REGEXP_INSTR Function
SyntaxDescription
REGEXP_INSTR(source_string, pattern
[, start_position
[, occurrence
[, return_option
[, match_parameter]]]])
This function looks for a pattern and returns the first position of the pattern. Optionally, you can indicate the start_position you want to begin the search. The occurrence parameter defaults to 1 unless you indicate that you are looking for a subsequent occurrence. The default value of the return_option is 0, which returns the starting position of the pattern; a value of 1 returns the starting position of the next character following the match.

Table 7: Explanation of 5-digit + 4 Zip-Code Expression
SyntaxDescription
 Empty space that must be matched
[:digit:] POSIX numeric digit class
] End of character list
{5} Repeat exactly five occurrences of the character list
( Start of subexpression
- A literal hyphen, because it is not a range metacharacter inside a character list
[ Start of character list
[:digit:] POSIX [:digit:] class
[ Start of character list
] End of character list
{4} Repeat exactly four occurrences of the character list
) Closing parenthesis, to end the subexpression
? The ? quantifier matches the grouped subexpression 0 or 1 time thus making the 4-digit code optional
$ Anchoring metacharacter, to indicate the end of the line

Table 8: The REGEXP_SUBSTR Function
SyntaxDescription
REGEXP_SUBSTR(source_string, pattern
[, position [, occurrence
[, match_parameter]]])
The REGEXP_SUBSTR function returns the substring that matches the pattern.

Table 9: The REGEXP_REPLACE Function
SyntaxDescription
REGEXP_REPLACE(source_string, pattern
[, replace_string [, position
[,occurrence, [match_parameter]]]])
This function replaces the matching pattern with a specified replace_string, allowing complex search-and-replace operations.

Table 10: Backreference Metacharacter
MetacharacterDescription
\digitBackslashFollowed by a digit between 1 and 9, the backslash matches the preceding digit-th parenthesized subexpression.
(Note: The backslash has another meaning in regular expressions; depending on the context it can also mean the Escape character

Table 11: Explanation of Pattern-Swap Regular Expression
Regular-Expression ItemDescription
(Start of first subexpression
. Match any single character except a newline
* Repetition operator, matches previous . metacharacter 0 to n times
) End of first subexpression; result of the match is captured in \1
(In this example, it's Ellen.)
  Empty space that needs to be present
( Start of the second subexpression
. Match any single character except a newline
* Repetition operator matches the previous . metacharacter 0 to n times
) End of second subexpression; result of this match is captured in \2
(In this example, it stores Hildi.)
  Empty space
( Start of third subexpression
. Match any single character except a newline
* Repetition operator matches . metacharacter 0 to n times
) End of third subexpression; result of this match is captured in \3
(In this example, it holds Smith.)

Table 12: Explanation of the Social Security Number Regular Expression
Regular-Expression ItemDescription
^ Start of line character (Regular expression cannot have any leading characters before the match.)
( Start subexpression and list alternates separated by the | metacharacter
[ Start of character list
[:digit:] POSIX numeric digit class
] End of character list
{3} Repeat exactly three occurrences of character list
-A hyphen
[ Start of character list
[:digit:] POSIX numeric digit class
] End of character list
{2} Repeat exactly two occurrences of character list
- Another hyphen
[ Start of character list
[:digit:] POSIX numeric digit class
] End of character list
{4} Repeat exactly four occurrences of character list
| Alternation metacharacter; ends the first choice and starts the next alternate expression
[ Start of character list
[:digit:] POSIX numeric digit class.
] End of character list
{9} Repeat exactly nine occurrences of character list
) Ending parenthesis, to close the subexpression group used for alternation
$ Anchoring metacharacter, to indicate the end of the line; no extra characters can follow the pattern

E-mail this page
Printer View Printer View
Oracle Is The Information Company About Oracle | Oracle RSS Feeds | Careers | Contact Us | Site Maps | Legal Notices | Terms of Use | Privacy