Saturday, February 23, 2008

Patterns evaluation order in regular expressions

After a lull in my blogging activity I decided to write a note on Oracle regular expressions, which can be useful for those who wants to use Oracle regexp functions more efficiently.

We will talk about masks with several alternatives.
Lets look at the following example:
SQL> with t as (select '1H1' str from dual)
2 select regexp_replace(str, '1|1H', 'A') mask1,
3 regexp_replace(str, '1H|1', 'A') mask2
4 from t
5 /

----- -----


"|" (pipe) is OR operator in regular expressions. It is used to list several alternatives to be matched.
But the most important thing is that they are passed one by one in order of apperance inside search mask.
As you can see in the example we have two columns (mask1, mask2). The difference is that we changed places of "1" and "1H", and the results are absolutely different.

So how it is working:
MASK1 ('1|1H'):
In the initial string '1H1' we start to search for the first occurence.
'1' matches the first pattern ('1') and hence replaced with 'A'.
Then proceeding with the rest of the line ('H1').
The next symbol 'H' is not matching '1' so we go to the next pattern ('1H'), but it doesn't match it also.
So we leave it as it is and go to the rest '1', which matches the first pattern of a mask ('1') and consequently replaced with 'A'.

So if we combine all the changes done - we finally get 'AHA'.

Now when we changed the order of the patterns inside the mask - the result would be different.
So we start with the first symbol '1' again.
It matches the beginning of our first pattern ('1H').
And inspite that this symbol is enough to cover the second pattern, we add one more letter to watch whether it satisfies the first pattern or not.
So we add the next symbol 'H' and get '1H' which matches the first pattern, and hence replaced with 'A'.
Then we proceed with the rest of line ('1').
So it doesn't match the first pattern '1H', we check with the second pattern and we find a match, so change '1' to 'A'.
In the final result we have 'AA'.

The main point what should be learnt here is that we don't proceed with the next pattern until we know for sure - that the current pattern wouldn't match.

How can it be used in practice.
In one of my previous posts I already used this technique, but here are couple of other examples recently posted on OTN forum.

Example #1.
The column comprises the list of names.
We need to get the following result: in case there is only one name in a column - we need to return this name (not touched) with trimmed preceding and trailing spaces.
In case when there are several words in a name - we need to return only first letters (initials).
SQL> with t as (select 'Mark Thomsan' str from dual union all
2 select 'Allen' from dual union all
3 select 'John Trovolta Robert' from dual union all
4 select ' John' from dual union all
5 select 'Frederick ' from dual union all
6 select ' Erick Cartman ' from dual union all
7 select ' Michael ' from dual)
8 --
9 select str,regexp_replace(str,'^ *([^ ]*) *$|(^| )([^ ])|.','\1\3') str_new from t
10 /

----------------------- ----------------
Mark Thomsan MT
Allen Allen
John Trovolta Robert JTR
John John
Frederick Frederick
Erick Cartman EC
Michael Michael

7 rows selected


We have a mask comprising 3 patterns (splitted with '|'):
1) '^ *([^ ]*) *$'
2) '(^| )([^ ])'
3) '.'

So the first one matches the whole string ('^' as the beginning and '$' as the end), that contains no or only one word ('([^ ]*)'), which is preceded or trailed by any number of spaces (' *').
If our column value is like this - then we return only the word as a result ('\1').
If there is more than one word in a column value - this mask is of no use.
Hence we proceed with the second pattern.
This matches the first letters of each word. We specify that the letters are the first only - by placing '(^| )' before any non-space symbol '([^ ])'.
So all the first letters would be returned in the result '\3' (this will happen only for string which contains > 1 word, otherwise the whole string would be covered by first pattern, and we will never reach the second pattern).
The last pattern '.' is symply any other character - not mentioned in the previous two patterns.
So it is kind of clean up technique to put '|.' in the end of regexp_replace mask.

The result is what we need: Allen, John, Frederick and Michael were returned as they were in the input data. The others (contain more than 1 word) are replaced by initials only.

Example #2.

Task: We need to eliminate all the spaces, which are not between two words.
If there are more than one space between words they should be trimmed to only one.
SQL> with t as (select ' 6213, 2345, Application Developer' str from dual union all
2 select '123, Avenue, app. 324, first door second floor' from dual)
3 --
4 select regexp_replace(str,'([[:alpha:]] ) *([[:alpha:]])| |(.)','\1\2\3') new_str from t
5 /

6213,2345,Application Developer
123,Avenue,app.324,first door second floor


We have a mask comprising 3 patterns again:
1) '([[:alpha:]] ) *([[:alpha:]])'
2) ' '
3) '(.)'

The first pattern searches for two letters with at least one space between and returnes these letters with only one space between them ('\1\2').
The second pattern contains only one space and matches all the other spaces, that were not covered by the first pattern.
As we don't have any backreference for this pattern - all such spaces would be eliminated from the initial value.
The third pattern matches any character '(.)'. We remember that all the spaces were covered by first or second pattern, so no spaces would match this '(.)'.
As we have '\3' for this pattern - all such symbols would be returned in the result.
That's how we left one space between letters only, and erased all the other spaces from the sentence.


jlp4221 said...

You explain very well and I have learned alot from your posts.

Allow me to ask this; if you have a string,
"09900 xx/xxxxx"
"00900 xx/xxxxx"
and you want to convert the numbers to currency; this is easy with substr, etc. But with REGEXPs, how would you do it? I can strip off the leading 0s & format to a point; but I can't figure how to add the decimal in place?

regexp_replace (str,'^[0]{1,2}','$')

Is there a way to do this in one pass with REGEXPs?
Thanks for the time to read this.

Volder said...

Hi, buddy.
I'm not fully understand what you want, e.g. I didn't catch what is xx/xxxxx in your data, and should be used or not. Also it is not very clear where you want to put your decimal.
But still - you can try to use REGEXP greediness:

SQL> with t as (select '09900 xx/xxxxx' str from dual union all
2 select '00900 xx/xxxxx' from dual)
3 select str,regexp_replace(str,'^0{1,}([0-9]+)([0-9]{2})','$\1.\2') new_str from t
4 /

-------------- -----------------------
09900 xx/xxxxx $99.00 xx/xxxxx
00900 xx/xxxxx $9.00 xx/xxxxx


jlp4221 said...

Thanks, Volder! That's very close to what I was attempting.