DBA: SQL
Develop and Test Regular Expressions with a Unit Testing Package
by Casimir Saternos
Published January 2006
Learn basic techniques for constructing useful regular expressions in SQL.
Downloads for this article:
Oracle Database 10g includes a variety of new features that provide new and better ways of solving old problems. Regular expressions have been in common use among UNIX users for a long time but may not be familiar to programmers and users from other backgrounds.
In this article, I will explain some basic techniques for constructing useful regular expressions and describe a PL/SQL package you can use to assist in the development and testing of regular expressions. The package utilizes the new regular expression functions and operators now available through Oracle SQL and provides a concise way of testing individual strings (and groups of strings) that appear in a column of a given table.
Regular Expressions: What Are They?
Long, complex, and powerful regular expressions can be created by experienced users, but usable expressions can be constructed by anyone who takes the time to learn the significance of their basic syntax and is willing to experiment a bit. They are a powerful tool for searching and processing text data in a concise manner. A regular expression is simply a text string that is used to describe a search pattern. The text string represents a set of possible strings by using character literals and metacharacters. Several languages and utilities and languages—including grep and other UNIX utilities, Perl, .NET, and Java—recognize regular expressions. Numerous third-party libraries allow the use of regular expressions with Java, and as of JDK 1.4, the java.util.regex package is included as part of the base Java language (which allows the use of regular expressions without having to include third party resource.)
Throughout this article, the term source string designates a string that is being tested.
Regular Expressions in Oracle Database 10g
Oracle Database 10g introduced the application of regular expressions within SQL statements. The functionality available through traditional SQL conditions (the
LIKE keyword) and functions (
INSTR,
SUBSTR, and
REPLACE) is enhanced through corresponding operators and conditions that have the prefix
REGEXP_. A query that formerly required several AND and OR conditions can now be expressed in a much more compact and succinct manner with a regular expression.
Regular expressions are powerful, but they can be rather difficult to understand and maintain, especially when applied to problems that are better handled by other mechanisms. When deciding whether to use regular expressions (or any other new process or procedure, for that matter), it is always a good practice to step back and ask if there is another technology that can solve the problem in a more elegant, simple, or concise manner. After all, the SQL language already contains pattern-matching mechanisms. Consideration should be given not only to solving the problem at hand but also to supporting the code in the future. And although programmers often live by the maxim "Make it work; then make it work fast," you need to take into account how the use of regular expressions will affect queries' use of indexes.
When regular expressions are deemed an appropriate solution for a problem, you must face the challenge of developing appropriate expressions. You consider what you are trying to match. You may also be interested in where in a string a match occurs in a source string. And if you are planning to do a search-and-replace operation on the matched pattern, you will want to know what the resulting string looks like.
Testing Regular Expressions
Some general guidelines should be kept in mind for the use and testing of regular expressions. Be aware of the context in which the regular expression is being used. For example, if a regular expression is being used to generate a script that a programmer will closely review before running it, less time can be spent on validation. If it will be used for long-term data validation as a constraint for a table that serves as an inbound interface to a system, however, more time and effort should be spent constructing and testing the regular expression. Ideally, a test suite should be used to test the regular expressions to ensure that inappropriate strings are not matched and that intended matches are not missed.
This is where the Oracle Database 10g implementation of regular expressions within the SQL language really shines. Oracle regular expression functions can be used not only to match patterns but also to determine what was matched in a source string and where the pattern was matched. In addition, you can organize and store data in the database and construct tests that evaluate sets of source strings within a column of a given table. You can use SQL to analyze the data (by using aggregate and analytic functions) to ascertain and improve the accuracy of regular expressions. You can easily identify certain defects in a given regular expression by using SQL functionality, employing
COUNT and
GROUP BY to quickly summarize the way the expression matched a given group of source strings. There is no corresponding mechanism outside of the SQL language that allows comparable analysis.
Process for Creating Regular Expressions
If you are new to regular expressions, you might benefit from a brief overview of the thought process and iterative procedure that goes into constructing a regular expression. You can follow a general outline when constructing regular expressions for use in SQL statements:
1. Identify the general requirements for the expression.
- Should the expression anchor to the beginning or the end of the line? If so, use ^ to anchor the expression to the beginning of the line or $ to anchor it to the end.
- Are there any repeated characters in the pattern you are trying to match? If so, use *, which matches zero or more times; ?, which matches zero or one time; or +, which matches one or more times. If you want to match a specified number of times, enclose the number in brackets ({}) as a single number or as a range represented by two numbers separated by comma. If you put a single number followed by a comma, it will match at least the number of times specified.
- Does your expression include characters of a defined character class? Printable characters are represented by
[:print:]. Similar designations can be used to represent alphanumeric characters ([:alnum:]), alphabetic-only characters ([:alpha:]), or alphabetic characters of only a particular case ([:lower:][:upper:]). The set of numeric characters can be specified by[:digit:]. There are also classes to represent the set of nonprinting space characters,[:space:]; punctuation,[:punct:]; and control characters,[:cntrl:]. Besides your having these options within the regular expression itself, all of the Oracle regular expression functions and operators can optionally be passed amatch_parameter. The parameter can be set to one or more of the following: 'i', 'c', 'n', and 'm'. These allow you to control case sensitivity ('i' for insensitive, 'c' for case-sensitive) and the treatment of newline characters ('n' and 'm'). By default, case sensitivity depends on the setting of the NLS_SORT parameter, a dot (.) does not match a newline character, and the source string is treated as a single line. - Does your expression require a specified class of characters? The [] characters can be used to contain a set of characters (such as the letters between A and M in the alphabet).
- Does your expression require grouping or alternate matches? The | is used to match either of two options. Often, this operator is used with a group of characters within a regular expression. For instance, to match either an
aor abas part of an expression, an expression would contain (a|b). The grouping can also be used with quantifiers or for back-referencing.
2. Construct a regular expression on individual source strings.
When you're first working out a regular expression, all of the Oracle 10g
REGEXP functions will prove useful for different purposes.
REGEXP_INSTR is useful to determine where a match is made. This function returns 0 if there is no match. If there is a match, it returns a number greater than 0 to represent the position of the match.
REGEXP_SUBSTR can be used to display what has been matched. This functionality can be used to identify strings that meet the expressions criteria but are not intended to be included in the matched set.
REGEXP_REPLACE shows what the original string looks like if a string is substituted for the matched expression. By including the empty string, you can see what the original string looks like with the matched expression removed.
SQL> select regexp_instr('This first String is the best', '[f]?[i|b][r|e]st') test from dual;
TEST
----------
6
SQL> select regexp_substr('This first String is the best', '[f]?[i|b][r|e]st') test from dual;
TEST
-----
first
SQL> select regexp_replace('This first String is the best', '[f]?[i|b][r|e]st','one') test from dual;
TEST
--------------------------
This one String is the one
SQL> select xx_regexp.test('This first String is the best', '[f]?[i|b][r|e]st','one') test from dual;
TEST
--------------------------------------------------------------------------------
Where: 6 | What: first | Replaced: This one String is the one
NOT REGEXP_LIKE in the
WHERE clause and do some additional analysis and verification.
3. Test the expression on a large data set (or, if possible, the entire data set) of source strings.
The downloads for this article include a PL/SQL package that can be used for general testing of regular expressions. To install the regular expression test package, perform the following steps:
- Unzip the regexp_plsql.zip file to a directory.
- Navigate to the directory, and open SQL*Plus. For example:
cd C:\regexp_plsql - Log in as the user/schema where you would like to install and run the package (the SCOTT/TIGER account is used in this article).For example: C:\regexp_plsql>sqlplus scott/tiger@orcl
- Run the createAll.sql script. You can ignore the errors related to the dropping of objects that do not exist the first time the script is run.
SQL> @create_all
| Object Name | Object Type | Description |
|---|---|---|
| XX_REGEXP_SUMMARY | Table | Table that holds summary information based on a test of a set of regular expressions. |
| XX_REGEXP_RESULTS | Table | Table that holds detail information based on a test of a set of regular expressions. |
| XX_ADDR | Table | Table that holds sample address data that is used to demonstrate the package in the remainder of the article. |
| S_RESULTSID | Sequence | Used for an ID in the XX_REGEXP_RESULTS table. |
| S_SUMMARYID | Sequence | Used for an ID in the XX_REGEXP_SUMMARY table. |
| XX_REGEXP | Package (specification and body) | Package that contains one function used to test individual source strings and one procedure used to test a set of source strings. Two additional procedures are used internally. |
The
XX_REGEXP package contains a function named
TEST and a procedure named
TEST. The function is used to test individual source strings; the procedure is for testing a group of source strings. In using this package, you explicitly reference all parameters accepted by the Oracle regular expression functions and conditions.
The sample data consists of a list of address strings. The table that holds this data is named XXADDR and contains only one column, ADDRESS_LINE1, of type VARCHAR2(300). Our test scenario involves a requirement to standardize street abbreviations to comply with the U.S. Postal Service standard (described in Appendix E of Postal Addressing Standards). A sample of the standard is shown below:
| Designation | Postal Service Standard | Others Commonly Used |
|---|---|---|
| ALLEY | ALY | ALLEE, ALLY |
| ANNEX | ANX | ANEX, ANNX |
| AVENUE | AVE | AVEN, AVENU, AVN, AVNUE |
| CENTER | CTR | CEN, CENT, CENTR, CENTRE, CNTER |
| CIRCLE | CIR | CIRC, CIRCL, CRCLE |
| HIGHWAY | HWY | HIGHWY, HIWAY, HIWY, HWAY |
| STREET | ST | STRT, STR |
Let's walk through an example of how we might use the package to construct a regular expression to use to match strings representing a street that will be replaced with the postal service standard, ST.
To begin with, notice that the group of patterns you are attempting to match consists of ST, STR, STRT, STREET. All these strings begin with the letters ST. In two of them, the ST is followed by an R. Three of them end with a T. Taking note of these types of features can provide some ideas on how to construct an expression. To simplify the expression, specify a case-insensitive match, using the 'i' match parameter.
As your first attempt, construct the expression
'*ST(R)?(E)*t*'. It will match all the variations listed in the chart. The test function takes a source string, a regular expression, a replacement string, and a match parameter as parameters. It returns a string that identifies where a match occurred in an expression (or 0 for no match), what string was matched, and what the string looks like if the replacement string is substituted for the portion of the string matched by the regular expression.
SQL> select xx_regexp.test('123 Main Street Apt B','*ST(R)?(E)*t*','ST','i')t from dual;
T
--------------------------------------------------------------------------------
Where: 10 | What: Street | Replaced: 123 Main ST Apt B
SQL> select xx_regexp.test('123 Main Strt Apt B','*ST(R)?(E)*t*','ST','i')t from dual;
T
--------------------------------------------------------------------------------
Where: 10 | What: Strt | Replaced: 123 Main ST Apt B
SQL> select xx_regexp.test('123 Main Str Apt B','*ST(R)?(E)*t*','ST','i')t from dual;
T
--------------------------------------------------------------------------------
Where: 10 | What: Str | Replaced: 123 Main ST Apt B
SQL> select xx_regexp.test('123 Main St Apt B','*ST(R)?(E)*t*','ST','i')t from dual;
T
--------------------------------------------------------------------------------
Where: 10 | What: St | Replaced: 123 Main ST Apt B
select xx_regexp.test('123 Main Street Apt B', ' ST(R)?(T)? | STREET ',' ST ‘,'i') test from dual;
exec xx_regexp.test('xx_addr', 'address_line1', '*St(r)?(e)*t*','ST','i');
SQL> set pages 100
SQL> column title format a11
SQL> column descript format a40
SQL> SELECT * FROM XX_REGEXP_SUMMARY;
SEQ DESCRIPT TITLE CNT
---------- ---------------------------------------- ----------- ----------
0 Total 26
1 Matched 9
2 Not Matched 17
3 Position in String/Number of Occurrences: 0 17
4 Position in String/Number of Occurrences: 6 2
5 Position in String/Number of Occurrences: 7 2
6 Position in String/Number of Occurrences: 8 4
7 Position in String/Number of Occurrences: 10 1
8 Match and Number of Occurrences Str 1
9 Match and Number of Occurrences Street 1
10 Match and Number of Occurrences st 7
11 Match and Number of Occurrences 17
12 rows selected.
SQL> set pages 100
SQL> column before_str format a25
SQL> column after_str format a20
SQL> column matched_str format a10
SQL> SELECT * FROM XX_REGEXP_RESULTS;
SEQ BEFORE_STR AFTER_STR MATCHED_ST POS
---------- ------------------------- -------------------- ---------- ----------
0 11 Firestone Anx. 11 FireSTone Anx. st 8
1 6521 Instown Annx 6521 InSTown Annx st 8
2 2114 Opst avenue 2114 OpST avenue st 8
3 2111 Rastally CENTR 2111 RaSTally CENTR st 8
4 121 21st Street 121 21ST Street st 7
5 49 East 31st St 49 EaST 31st St st 6
6 1010 Strawn Ctr 1010 STrawn Ctr Str 6
7 4 Center Street Apt B3 4 Center ST Apt B3 Street 10
8 400 East 21st st. 400 EaST 21st st. st 7
9 rows selected.
The following string returns only the desired rows. See if you can come up with a more comrehensive and eloquent solution.
SQL> exec xx_regexp.test('xx_addr', 'address_line1', ' STREET| ST$| ST( |\.)|STR
(T)?(^A)',' ST','i');
PL/SQL procedure successfully completed.
SQL> SELECT * FROM XX_REGEXP_RESULTS;
SEQ BEFORE_STR AFTER_STR MATCHED_ST POS
---------- ------------------------- -------------------- ---------- ----------
245 121 21st Street 121 21st ST Street 9
246 49 East 31st St 49 East 31st ST St 13
247 4 Center Street Apt B3 4 Center ST Apt B3 Street 9
248 400 East 21st st. 400 East 21st ST st. 14
Summary
Oracle SQL gives you many powerful ways to perform pattern matching. In the past, it was sometimes necessary to use complicated
WHERE clauses, a series of nested
FUNCTIONs, or nested subqueries to perform sophisticated pattern matching. Oracle Database 10g gives you the power to use regular expressions to replace such unwieldy SQL with simple, elegant expressions to perform such tasks. It also provides you with the ability to address pattern matching problems that were impractical to tackle with traditional SQL language features. Keep this feature of Oracle Database 10g in mind when confronted with requirements involving string manipulation and matching.