DBA: SQL

Develop and Test Regular Expressions with a Unit Testing Package

by Casimir Saternos

Published January 2006

Learn basic techniques for constructing useful regular expressions in SQL.

Downloads for this article:

Oracle Database 10g includes a variety of new features that provide new and better ways of solving old problems. Regular expressions have been in common use among UNIX users for a long time but may not be familiar to programmers and users from other backgrounds.

In this article, I will explain some basic techniques for constructing useful regular expressions and describe a PL/SQL package you can use to assist in the development and testing of regular expressions. The package utilizes the new regular expression functions and operators now available through Oracle SQL and provides a concise way of testing individual strings (and groups of strings) that appear in a column of a given table.

Regular Expressions: What Are They?

Long, complex, and powerful regular expressions can be created by experienced users, but usable expressions can be constructed by anyone who takes the time to learn the significance of their basic syntax and is willing to experiment a bit. They are a powerful tool for searching and processing text data in a concise manner. A regular expression is simply a text string that is used to describe a search pattern. The text string represents a set of possible strings by using character literals and metacharacters. Several languages and utilities and languages—including grep and other UNIX utilities, Perl, .NET, and Java—recognize regular expressions. Numerous third-party libraries allow the use of regular expressions with Java, and as of JDK 1.4, the java.util.regex package is included as part of the base Java language (which allows the use of regular expressions without having to include third party resource.)

Throughout this article, the term source string designates a string that is being tested.

Regular Expressions in Oracle Database 10g

Oracle Database 10g introduced the application of regular expressions within SQL statements. The functionality available through traditional SQL conditions (the LIKE keyword) and functions ( INSTR, SUBSTR, and REPLACE) is enhanced through corresponding operators and conditions that have the prefix REGEXP_. A query that formerly required several AND and OR conditions can now be expressed in a much more compact and succinct manner with a regular expression.

Regular expressions are powerful, but they can be rather difficult to understand and maintain, especially when applied to problems that are better handled by other mechanisms. When deciding whether to use regular expressions (or any other new process or procedure, for that matter), it is always a good practice to step back and ask if there is another technology that can solve the problem in a more elegant, simple, or concise manner. After all, the SQL language already contains pattern-matching mechanisms. Consideration should be given not only to solving the problem at hand but also to supporting the code in the future. And although programmers often live by the maxim "Make it work; then make it work fast," you need to take into account how the use of regular expressions will affect queries' use of indexes.

When regular expressions are deemed an appropriate solution for a problem, you must face the challenge of developing appropriate expressions. You consider what you are trying to match. You may also be interested in where in a string a match occurs in a source string. And if you are planning to do a search-and-replace operation on the matched pattern, you will want to know what the resulting string looks like.

Testing Regular Expressions

Some general guidelines should be kept in mind for the use and testing of regular expressions. Be aware of the context in which the regular expression is being used. For example, if a regular expression is being used to generate a script that a programmer will closely review before running it, less time can be spent on validation. If it will be used for long-term data validation as a constraint for a table that serves as an inbound interface to a system, however, more time and effort should be spent constructing and testing the regular expression. Ideally, a test suite should be used to test the regular expressions to ensure that inappropriate strings are not matched and that intended matches are not missed.

This is where the Oracle Database 10g implementation of regular expressions within the SQL language really shines. Oracle regular expression functions can be used not only to match patterns but also to determine what was matched in a source string and where the pattern was matched. In addition, you can organize and store data in the database and construct tests that evaluate sets of source strings within a column of a given table. You can use SQL to analyze the data (by using aggregate and analytic functions) to ascertain and improve the accuracy of regular expressions. You can easily identify certain defects in a given regular expression by using SQL functionality, employing COUNT and GROUP BY to quickly summarize the way the expression matched a given group of source strings. There is no corresponding mechanism outside of the SQL language that allows comparable analysis.

Process for Creating Regular Expressions

If you are new to regular expressions, you might benefit from a brief overview of the thought process and iterative procedure that goes into constructing a regular expression. You can follow a general outline when constructing regular expressions for use in SQL statements:

1. Identify the general requirements for the expression.

Should the expression anchor to the beginning or the end of the line? If so, use ^ to anchor the expression to the beginning of the line or $ to anchor it to the end.
Are there any repeated characters in the pattern you are trying to match? If so, use *, which matches zero or more times; ?, which matches zero or one time; or +, which matches one or more times. If you want to match a specified number of times, enclose the number in brackets ({}) as a single number or as a range represented by two numbers separated by comma. If you put a single number followed by a comma, it will match at least the number of times specified.
Does your expression include characters of a defined character class? Printable characters are represented by [:print:] . Similar designations can be used to represent alphanumeric characters ( [:alnum:] ), alphabetic-only characters ( [:alpha:] ), or alphabetic characters of only a particular case ( [:lower:][:upper:] ). The set of numeric characters can be specified by [:digit:]. There are also classes to represent the set of nonprinting space characters, [:space:] ; punctuation, [:punct:]; and control characters, [:cntrl:]. Besides your having these options within the regular expression itself, all of the Oracle regular expression functions and operators can optionally be passed a match_parameter . The parameter can be set to one or more of the following: 'i', 'c', 'n', and 'm'. These allow you to control case sensitivity ('i' for insensitive, 'c' for case-sensitive) and the treatment of newline characters ('n' and 'm'). By default, case sensitivity depends on the setting of the NLS_SORT parameter, a dot (.) does not match a newline character, and the source string is treated as a single line.
Does your expression require a specified class of characters? The [] characters can be used to contain a set of characters (such as the letters between A and M in the alphabet).
Does your expression require grouping or alternate matches? The | is used to match either of two options. Often, this operator is used with a group of characters within a regular expression. For instance, to match either an a or a b as part of an expression, an expression would contain ( a|b ). The grouping can also be used with quantifiers or for back-referencing.

The Oracle SQL documentation contains a complete reference on characters and their functions within regular expressions.

2. Construct a regular expression on individual source strings.

When you're first working out a regular expression, all of the Oracle 10g REGEXP functions will prove useful for different purposes. REGEXP_INSTR is useful to determine where a match is made. This function returns 0 if there is no match. If there is a match, it returns a number greater than 0 to represent the position of the match. REGEXP_SUBSTR can be used to display what has been matched. This functionality can be used to identify strings that meet the expressions criteria but are not intended to be included in the matched set. REGEXP_REPLACE shows what the original string looks like if a string is substituted for the matched expression. By including the empty string, you can see what the original string looks like with the matched expression removed.

Copy



SQL> select regexp_instr('This first String is the best', '[f]?[i|b][r|e]st') test from dual;

      TEST
----------
         6

SQL> select regexp_substr('This first String is the best', '[f]?[i|b][r|e]st') test from dual;

TEST
-----
first

Note that in this case, only the first matched substring is returned. This becomes evident after our next test:

Copy



SQL> select regexp_replace('This first String is the best', '[f]?[i|b][r|e]st','one') test  from dual;

TEST
--------------------------
This one String is the one

Because information from all of these three functions is useful in testing expressions, they are all used in this function in the package included as a download for this article:

Copy



SQL> select xx_regexp.test('This first String is the best', '[f]?[i|b][r|e]st','one') test from dual;

TEST
--------------------------------------------------------------------------------

Where: 6 | What: first | Replaced: This one String is the one

Keep in mind that these tests cannot indicate strings you intended to match that were not matched. You don't know what you don't know. However, you can use NOT REGEXP_LIKE in the WHERE clause and do some additional analysis and verification.

3. Test the expression on a large data set (or, if possible, the entire data set) of source strings.

The downloads for this article include a PL/SQL package that can be used for general testing of regular expressions. To install the regular expression test package, perform the following steps:

Unzip the regexp_plsql.zip file to a directory.

Navigate to the directory, and open SQL*Plus. For example:

Copy



cd C:\regexp_plsql

Log in as the user/schema where you would like to install and run the package (the SCOTT/TIGER account is used in this article).For example: C:\regexp_plsql>sqlplus scott/tiger@orcl
Run the createAll.sql script. You can ignore the errors related to the dropping of objects that do not exist the first time the script is run.
Copy
```
SQL> @create_all
                                
```

This script creates the tables, sequences, and packages used to test regular expressions. It also creates a table and populates it with sample data. A listing of objects and their purpose follows:

Object Name	Object Type	Description
XX_REGEXP_SUMMARY	Table	Table that holds summary information based on a test of a set of regular expressions.
XX_REGEXP_RESULTS	Table	Table that holds detail information based on a test of a set of regular expressions.
XX_ADDR	Table	Table that holds sample address data that is used to demonstrate the package in the remainder of the article.
S_RESULTSID	Sequence	Used for an ID in the XX_REGEXP_RESULTS table.
S_SUMMARYID	Sequence	Used for an ID in the XX_REGEXP_SUMMARY table.
XX_REGEXP	Package (specification and body)	Package that contains one function used to test individual source strings and one procedure used to test a set of source strings. Two additional procedures are used internally.

The XX_REGEXP package contains a function named TEST and a procedure named TEST. The function is used to test individual source strings; the procedure is for testing a group of source strings. In using this package, you explicitly reference all parameters accepted by the Oracle regular expression functions and conditions.

The sample data consists of a list of address strings. The table that holds this data is named XXADDR and contains only one column, ADDRESS_LINE1, of type VARCHAR2(300). Our test scenario involves a requirement to standardize street abbreviations to comply with the U.S. Postal Service standard (described in Appendix E of Postal Addressing Standards). A sample of the standard is shown below:

Designation	Postal Service Standard	Others Commonly Used
ALLEY	ALY	ALLEE, ALLY
ANNEX	ANX	ANEX, ANNX
AVENUE	AVE	AVEN, AVENU, AVN, AVNUE
CENTER	CTR	CEN, CENT, CENTR, CENTRE, CNTER
CIRCLE	CIR	CIRC, CIRCL, CRCLE
HIGHWAY	HWY	HIGHWY, HIWAY, HIWY, HWAY
STREET	ST	STRT, STR

Let's walk through an example of how we might use the package to construct a regular expression to use to match strings representing a street that will be replaced with the postal service standard, ST.

To begin with, notice that the group of patterns you are attempting to match consists of ST, STR, STRT, STREET. All these strings begin with the letters ST. In two of them, the ST is followed by an R. Three of them end with a T. Taking note of these types of features can provide some ideas on how to construct an expression. To simplify the expression, specify a case-insensitive match, using the 'i' match parameter.

As your first attempt, construct the expression '*ST(R)?(E)*t*'. It will match all the variations listed in the chart. The test function takes a source string, a regular expression, a replacement string, and a match parameter as parameters. It returns a string that identifies where a match occurred in an expression (or 0 for no match), what string was matched, and what the string looks like if the replacement string is substituted for the portion of the string matched by the regular expression.

Copy



SQL> select xx_regexp.test('123 Main Street Apt B','*ST(R)?(E)*t*','ST','i')t from dual;

T
--------------------------------------------------------------------------------

Where: 10 | What: Street | Replaced: 123 Main ST Apt B

SQL> select xx_regexp.test('123 Main Strt Apt B','*ST(R)?(E)*t*','ST','i')t from dual;

T
--------------------------------------------------------------------------------

Where: 10 | What: Strt | Replaced: 123 Main ST Apt B

SQL> select xx_regexp.test('123 Main Str Apt B','*ST(R)?(E)*t*','ST','i')t from dual;

T
--------------------------------------------------------------------------------

Where: 10 | What: Str | Replaced: 123 Main ST Apt B

SQL> select xx_regexp.test('123 Main St Apt B','*ST(R)?(E)*t*','ST','i')t from dual;

T
--------------------------------------------------------------------------------

Where: 10 | What: St | Replaced: 123 Main ST Apt B

Depending on where it is used, this might be sufficient in some cases. However many additional matches are possible. A more specific attempt might be the following:

Copy



select xx_regexp.test('123 Main Street Apt B', ' ST(R)?(T)? | STREET ',' ST ‘,'i') test from dual;

Now you can test the expression on a set of source strings rather than on a single string. The test procedure takes a table name and column name to be tested, a regular expression, a replacement string, and a match parameter as parameters.

Copy



exec xx_regexp.test('xx_addr', 'address_line1', '*St(r)?(e)*t*','ST','i');

Executing this procedure populates the XX_REGEXP_SUMMARY and XX_REGEXP_RESULTS tables. To view a summary of the test, select the results from the XX_REGEXP_SUMMARY table.

Copy



SQL> set pages 100
SQL> column title format a11
SQL> column descript format a40
SQL> SELECT * FROM XX_REGEXP_SUMMARY;

       SEQ DESCRIPT                                  TITLE              CNT
---------- ----------------------------------------  ----------- ----------
         0                                           Total               26
         1                                           Matched              9
         2                                           Not Matched         17
         3 Position in String/Number of Occurrences: 0                   17
         4 Position in String/Number of Occurrences: 6                    2
         5 Position in String/Number of Occurrences: 7                    2
         6 Position in String/Number of Occurrences: 8                    4
         7 Position in String/Number of Occurrences: 10                   1
         8 Match and Number of Occurrences               Str                  1
         9 Match and Number of Occurrences               Street               1
        10 Match and Number of Occurrences               st                   7
        11 Match and Number of Occurrences               17

12 rows selected.

From this summary, you know that matches occurred in 9 of the 26 test strings. The position of the match is not interesting in this case. If you had seen a 1, that would have indicated a match at the beginning of a string, where you would not normally expect a match representing a street. You can see that the source string "st" was matched seven times, whereas "str" and "Street" were each matched once. To view detailed information concerning the matches, query the XX_REGEXP_RESULTS table.

Copy



SQL> set pages 100

SQL> column before_str format a25
SQL> column after_str format a20
SQL> column matched_str format a10
SQL> SELECT * FROM XX_REGEXP_RESULTS;

       SEQ BEFORE_STR                AFTER_STR            MATCHED_ST        POS
---------- ------------------------- -------------------- ---------- ----------
         0 11 Firestone Anx.         11 FireSTone Anx.    st                  8
         1 6521 Instown Annx         6521 InSTown Annx    st                  8
         2 2114 Opst avenue           2114 OpST avenue    st                  8
         3 2111 Rastally CENTR       2111 RaSTally CENTR  st                  8
         4 121 21st Street           121 21ST Street      st                  7
         5 49 East 31st St           49 EaST 31st St      st                  6
         6 1010 Strawn Ctr           1010 STrawn Ctr      Str                 6
         7 4 Center Street Apt B3    4 Center ST Apt B3   Street             10
         8 400 East 21st st.         400 EaST 21st st.    st                  7

9 rows selected.

From these results, you can identify several inappropriate matches. The expression used was not specific enough for your data. Many occurrences of "St" were matched, in the middle of words, as the suffix to street numbers, and as part of the direction "East." The regular expression needs to be refined to avoid matching these undesired strings.

The following string returns only the desired rows. See if you can come up with a more comrehensive and eloquent solution.

Copy



SQL> exec xx_regexp.test('xx_addr', 'address_line1', ' STREET| ST$| ST( |\.)|STR
(T)?(^A)',' ST','i');

PL/SQL procedure successfully completed.

SQL> SELECT * FROM XX_REGEXP_RESULTS;

       SEQ BEFORE_STR                AFTER_STR            MATCHED_ST        POS
---------- ------------------------- -------------------- ---------- ----------
       245 121 21st Street           121 21st ST           Street             9
       246 49 East 31st St           49 East 31st ST       St                13
       247 4 Center Street Apt B3    4 Center ST Apt B3    Street             9
       248 400 East 21st st.         400 East 21st ST      st.               14

Summary

Oracle SQL gives you many powerful ways to perform pattern matching. In the past, it was sometimes necessary to use complicated WHERE clauses, a series of nested FUNCTIONs, or nested subqueries to perform sophisticated pattern matching. Oracle Database 10g gives you the power to use regular expressions to replace such unwieldy SQL with simple, elegant expressions to perform such tasks. It also provides you with the ability to address pattern matching problems that were impractical to tackle with traditional SQL language features. Keep this feature of Oracle Database 10g in mind when confronted with requirements involving string manipulation and matching.

Casimir Saternos is an Oracle Certified DBA, IBM Certified Enterprise Developer, and Sun Certified Java Programmer based in Allentown, Pa.