Wednesday, November 28, 2007

Regexp 2: getting all the substrings with the particular preceding or trailing patterns

... previous

Well, let's continue.

In the previous post we learnt how to retrieve all the substrings of special pattern.
In this post we'll improve our skills.

Imagine that you have such data (again phone numbers):
SQL> with t as (select 'three numbers: (111)111-11-11 and [222]222-22-22 and (333)333-33-33' str from dual)
2 --
3 select * from t
4 /

three numbers: (111)111-11-11 and [222]222-22-22 and (333)333-33-33


and you want to retrieve phone numbers of the following format: XXX-XX-XX, where X stands for any digit.
But you want only those numbers which are preceded by (XXX) pattern, where X - again any number.
So the number 222-22-22, despite the fact that it has XXX-XX-XX mask - doesn't meet our requirments. Because it has [222] standing before, but not (222).

If we used regexp_susbtr - we would first retrieve phone number in (XXX)XXX-XX-XX format and then cut the (XXX) pattern possibly with another regexp_substr or regexp_replace, e.g.:
SQL> with t as (select 'three numbers: (111)111-11-11 and [222]222-22-22 and (333)333-33-33' str from dual)
2 --
3 select t.*,
4 regexp_substr(str, '\(\d{3}\)\d{3}-\d{2}-\d{2}') step1,
5 regexp_replace(regexp_substr(str,
6 '\(\d{3}\)\d{3}-\d{2}-\d{2}'),
7 '\(\d{3}\)') step2
8 from t
9 /

------------------------------------------------------------------- --------------------- ---------------
three numbers: (111)111-11-11 and [222]222-22-22 and (333)333-33-33 (111)111-11-11 111-11-11


And again regexp_substr is only for one occurenece at a time.

Using the trick - we could use the following construction:

this will retrieve all the substrings according to the mask 'substring_mask', but with the condition that it is preceded and trailed by particular patterns.

So in our situation we could use:
SQL> with t as (select 'three numbers: (111)111-11-11 and [222]222-22-22 and (333)333-33-33' str from dual)
2 --
3 select t.*,
4 regexp_replace(str, '\(\d{3}\)(\d{3}-\d{2}-\d{2}( |$))|.','\1') phone_numbers
5 from t
6 /

------------------------------------------------------------------- ------------------------
three numbers: (111)111-11-11 and [222]222-22-22 and (333)333-33-33 111-11-11 333-33-33


That's it.
As you can see - there's no 222-22-22 phone number in the result set.

to be continued...