regular expressions in 10g
This article briefly introduces Oracle’s support for regular expressions in 10g, considered by many developers to be long overdue. The new functions (available in both SQL and PL/SQL) are:
- REGEXP_LIKE
- REGEXP_INSTR
- REGEXP_SUBSTR
- REGEXP_REPLACE
Each of these functions are covered below. This article assumes that the reader has some knowledge of regular expressions and pattern matching. For further details, see the SQL Reference in the online Oracle documentation.
setup
To demonstrate regular expressions we’ll use a small set of arbitrary test data, created as follows.
SQL> CREATE TABLE t (x VARCHAR2(30));
Table created.
SQL> INSERT ALL 2 INTO t VALUES ('XYZ123') 3 INTO t VALUES ('XYZ 123') 4 INTO t VALUES ('xyz 123') 5 INTO t VALUES ('X1Y2Z3') 6 INTO t VALUES ('123123') 7 INTO t VALUES ('?/*.') 8 INTO t VALUES ('\?.') 9 SELECT * FROM dual;
7 rows created.
SQL> SELECT * FROM t;
X ------------ XYZ123 XYZ 123 xyz 123 X1Y2Z3 123123 ?/*. \?. 7 rows selected.
regexp_like
First well look at REGEXP_LIKE and use this to introduce the basic means of specifying regular expressions. This function returns TRUE when it matches patterns using standard (or POSIX) notation. The pattern to match is passed as the second parameter to the function and is represented as a string using a combination of literal or meta characters. These can be quite sophisticated and flexible and it is worth reading the documentation for a full list of meta-characters, POSIX classes and references.
First we’ll look for data with a lower-case letter followed by a space and a number using standard pattern notation.
SQL> SELECT * FROM t WHERE REGEXP_LIKE(x, '[a-z] [0-9]');
X ------------------------------ xyz 123
Next we look for data with a lower-case letter followed by any single character (represented by the “.”) then a number.
SQL> SELECT * FROM t WHERE REGEXP_LIKE(x, '[a-z].[0-9]');
X ------------------------------ xyz 123
The question mark character in the following expression represents either zero or one occurrence of the previous pattern (in this example a lower-case letter). Hence this regular expression matches a wider range of records.
SQL> SELECT * FROM t WHERE REGEXP_LIKE(x, '[a-z]?[0-9]');
X ------------------------------ XYZ123 XYZ 123 xyz 123 X1Y2Z3 123123
The “*” wildcard represents zero or more occurrences of the preceding expression, so it could be argued that the following pattern is simply looking for any data containing at least one number (zero or more lower-case characters followed by a digit).
SQL> SELECT * FROM t WHERE REGEXP_LIKE(x, '[a-z]*[0-9]');
X ------------------------------ XYZ123 XYZ 123 xyz 123 X1Y2Z3 123123
To search for a specific number of occurrences of an expression, it is followed by the number of occurrences in braces. The following expression is searching for data with three consecutive capital letters.
SQL> SELECT * FROM t WHERE REGEXP_LIKE(x, '[A-Z]{3}');
X ------------------------------ XYZ123 XYZ 123
This time we search for a capital letter followed by three numbers (note that to search for at least three digits, the number of occurrences would be followed by a comma, i.e. {3,}).
SQL> SELECT * FROM t WHERE REGEXP_LIKE(x, '([A-Z][0-9]){3}');
X ------------------------------ X1Y2Z3
The following expression is one way of searching for all-numeric data. The caret (^) and dollar ($) denote start and end of line, respectively. The plus (+) after the digit class represents one or more occurrences.
SQL> SELECT * FROM t WHERE REGEXP_LIKE(x, '^[0-9]+$');
X ------------------------------ 123123
We might sometimes need to match meta-characters, meaning that these must be “escaped”. For example, we’ve seen above that the question mark (?) is a special character meaning zero or one of the preceding pattern in the expression. To search explicitly for data containing this character, we must escape it using a backslash (\).
SQL> SELECT * FROM t WHERE REGEXP_LIKE(x, '\?');
X ------------------------------ ?/*. \?.
To specify “not” in a regular expression, a caret (^) is used within the brackets to exclude the referenced characters. Note that outside of these brackets the caret denotes start of line. The following example searches for data that is not all digits, which in our dataset is all but one record. Note that this is not the same as searching for data without any digits at all (that would require a NOT REGEXP_LIKE() or a REGEXP_INSTR() = 0 expression).
SQL> SELECT * FROM t WHERE REGEXP_LIKE(x, '[^0-9]+');
X ------------ XYZ123 XYZ 123 xyz 123 X1Y2Z3 ?/*. \?. 6 rows selected.
Finally, we can also specify “either-or” using a pipe (|) meta-character. The following searches for all data containing either an ‘X’ or a ‘1’.
SQL> SELECT * FROM t WHERE REGEXP_LIKE(x, 'X|1');
X ------------ XYZ123 XYZ 123 xyz 123 X1Y2Z3 123123 5 rows selected.
regexp_instr
REGEXP_INSTR returns the position of a string that matches a given pattern. As with the INSTR function, we can specify a starting point within the string (default 1) and the nth occurrence to search for. There are additional parameters to REGEXP_INSTR that we shall briefly cover below. First, we’ll search for the position of a literal question mark in our test data.
SQL> SELECT x 2 , REGEXP_INSTR(x, '\?') AS "POSITION_OF_?" 3 FROM t;
X POSITION_OF_? ------------------------------ ------------- XYZ123 0 XYZ 123 0 xyz 123 0 X1Y2Z3 0 123123 0 ?/*. 1 \?. 2 7 rows selected.
As mentioned above, REGEXP_INSTR has additional (optional) parameters over and above those we use for standard INSTR-type operations. These additional parameters allow for offset and matching. Offset enables us to specify whether we want the position of the start of the expression match (0) or the position immediately after the end of the regex match (1). A match parameter enables us to request case sensitivity using flags (e.g. 'i' for ignore or 'c' for case sensitivity. Default is whatever is specified in NLS_SORT). The following demonstrates three variations while searching for three capital letters in our test data.
SQL> SELECT x 2 --<>-- 3 , REGEXP_INSTR( 4 x, 5 '[A-Z]{3}', --expression 6 1, --start at 7 1, --nth occurrence 8 0 --offset position 9 ) AS regexp_offset_0 10 --<>-- 11 , REGEXP_INSTR( 12 x, 13 '[A-Z]{3}', 14 1, 15 1, 16 1 17 ) AS regexp_offset_1 18 --<>-- 19 , REGEXP_INSTR( 20 x, 21 '[A-Z]{3}', 22 1, 23 1, 24 0, 25 'i' --match parameter 26 ) AS regexp_case_insensitive 27 --<>-- 28 FROM t;
X REGEXP_OFFSET_0 REGEXP_OFFSET_1 REGEXP_CASE_INSENSITIVE ------------ --------------- --------------- ----------------------- XYZ123 1 4 1 XYZ 123 1 4 1 xyz 123 0 0 1 X1Y2Z3 0 0 0 123123 0 0 0 ?/*. 0 0 0 \?. 0 0 0 7 rows selected.
regexp_substr
REGEXP_SUBSTR is incredibly useful. It returns the part of the string that matches the pattern only. It follows the same convention as REGEXP_INSTR with respect to starting position, nth occurrence and match-parameter arguments. However, it doesn't work like the standard SUBSTR, for which you specify the length of string to return.
For a very simple demonstration of REGEXP_SUBSTR, we’ll look at the default behaviour when searching for a string of one or more digits. We’ll also try to strip a second occurrence of the same pattern by using the relevant parameter. Incidentally, I’ve deliberately introduced an alternative means of expressing “digits-only” by using the POSIX character class (there are quite a few of these classes described in the online documentation).
SQL> set null {null} SQL> SELECT x 2 --<>-- 3 , REGEXP_SUBSTR( 4 x, 5 '[[:digit:]]+' --note POSIX char-class 6 ) AS first_occurrence 7 --<>-- 8 , REGEXP_SUBSTR( 9 x, 10 '[[:digit:]]+', 11 1, 12 2 13 ) AS second_occurrence 14 FROM t;
X FIRST_OCCURRENCE SECOND_OCCURRENCE ------------ -------------------- -------------------- XYZ123 123 {null} XYZ 123 123 {null} xyz 123 123 {null} X1Y2Z3 1 2 123123 123123 {null} ?/*. {null} {null} \?. {null} {null} 7 rows selected.
regexp_replace
Finally, we’ll take a brief look at REGEXP_REPLACE. As its name suggests, this function searches for a specified number of occurrences (default all) of a pattern and replaces it with a given string or NULL. As with REGEXP_INSTR and REGEXP_SUBSTR, we can supply starting position, nth occurrence and a match-parameter. In the following simple example, we’ll replace all digits with a hyphen (-).
SQL> SELECT x 2 , REGEXP_REPLACE(x, '[[:digit:]]', '-') AS nums_to_hyphens 3 FROM t;
X NUMS_TO_HYPHENS ------------ -------------------- XYZ123 XYZ--- XYZ 123 XYZ --- xyz 123 xyz --- X1Y2Z3 X-Y-Z- 123123 ------ ?/*. ?/*. \?. \?. 7 rows selected.
What if we need to replace a given pattern with the pattern itself plus some additional data? REGEXP_REPLACE enables us to back-reference the regular expression (up to 500 expressions in fact) and include it in our replacement. The following example shows how we can replace all digits with the digits themselves followed by an asterisk. We enclose the regular expression in parentheses and reference it in the replacement string using \n (in our case \1 as we only have one preceding expression).
SQL> SELECT x 2 , REGEXP_REPLACE( 3 x, 4 '([0-9])', --note the parentheses 5 '\1*' --\1 means include 1st pattern 6 ) AS search_pattern_included 7 FROM t;
X SEARCH_PATTERN_INCLUDED ------------ ----------------------------------- XYZ123 XYZ1*2*3* XYZ 123 XYZ 1*2*3* xyz 123 xyz 1*2*3* X1Y2Z3 X1*Y2*Z3* 123123 1*2*3*1*2*3* ?/*. ?/*. \?. \?. 7 rows selected.
No comments:
Post a Comment