Wednesday, November 28, 2007

Regexp 1: getting all the needed substrings

I decided to write some notes on regular expressions. And especially about one trick, that sometimes can be very useful.
It's not my invention. First time I found it on the following OTN thread written by cd. It's a pity I don't know his full name. All I know is that he's from Austria and that he's an expert on regular expressions :)

The trick is very simple: if you want to get all the occurences of substring of the particular mask from a string, just put:
regexp_replace(your_string,'(regexp_mask)|.','\1')

Those small pipe and dot (|.) at the end are making a huge deal.

Let's get to the example. Imagine you have a table and one field is of varchar type, where there's information stored about phone numbers. But it is not pure phone numbers, there's some additional text around them, e.g.:

SQL> with t as (select 'That''s phone number 1: (916)809-23-34 and that''s the second one: (918)234-12-09' str from dual union all
2 select 'Call me later on (926)507-15-34 or on (913)432-23-21' from dual union all
3 select 'two numbers: (917)888-34-34 and (903)234-43-11' from dual)
4 --
5 select * from t
6 /

STR
-------------------------------------------------------------------------------
That's phone number 1: (916)809-23-34 and that's the second one: (918)234-12-09
Call me later on (926)507-15-34 or on (913)432-23-21
two numbers: (917)888-34-34 and (903)234-43-11

SQL>

What you want is to drag out only phone numbers that have a mask (XXX)XXX-XX-XX, where X stands for any digit.

What could you do if you didn't know about the trick I mentioned?
you could use regexp_susbtr:
SQL> with t as (select 'That''s phone number 1: (916)809-23-34 and that''s the second one: [234]123-43-23 and the third one: (918)234-12-09' str from dual union all
2 select 'Call me later on (926)507-15-34 or on (913)432-23-21 or on [234]123-43-23' from dual union all
3 select 'two numbers: (917)888-34-34 and (903)234-43-11' from dual)
4 --
5 select t.*,
6 regexp_substr(str, '\(\d{3}\)\d{3}(-\d{2}){2}') phone_num1,
7 regexp_substr(str, '\(\d{3}\)\d{3}(-\d{2}){2}', 1, 2) phone_num2
8 from t
9 /

STR PHONE_NUM1 PHONE_NUM2
-------------------------------------------------------------------------------- ----------------- ------------------
That's phone number 1: (916)809-23-34 and that's the second one: [234]123-43-23 (916)809-23-34 (918)234-12-09
Call me later on (926)507-15-34 or on (913)432-23-21 or on [234]123-43-23 (926)507-15-34 (913)432-23-21
two numbers: (917)888-34-34 and (903)234-43-11 (917)888-34-34 (903)234-43-11

SQL>

But as you can see - you should write a separate regexp_substr column for each occurence of substring, because regexp_substr can pull out only one occurence at a time. And sometimes you don't know beforehand how many occurences of substring are there.

BTW I used \d for digit - if you're on version less than 10.2.x.x you can use [:digit:] or [0-9] instead.

And now let's see what we can do using a tricky regexp_replace:
SQL> with t as (select 'That''s phone number 1: (916)809-23-34 and that''s the second one: (918)234-12-09' str from dual union all
2 select 'Call me later on (926)507-15-34 or on (913)432-23-21' from dual union all
3 select 'two numbers: (917)888-34-34 and (903)234-43-11' from dual)
4 --
5 select t.*,
6 regexp_replace(str,
7 '(\(\d{3}\)\d{3}(-\d{2}){2}( |$))|.',
8 '\1') full_phone_numbers
9 from t
10 /

STR FULL_PHONE_NUMBERS
------------------------------------------------------------------------------- --------------------------------------
That's phone number 1: (916)809-23-34 and that's the second one: (918)234-12-09 (916)809-23-34 (918)234-12-09
Call me later on (926)507-15-34 or on (913)432-23-21 (926)507-15-34 (913)432-23-21
two numbers: (917)888-34-34 and (903)234-43-11 (917)888-34-34 (903)234-43-11

SQL>

You see - everything that doesn't correspond to the mask is removed, and all occurences of the needed phone numbers are placed in one column.

to be continued...

3 comments:

Anonymous said...

I have a problem regarding the regex. I have a string like - '04222/02365/12/1893/12/122/111/11'

I hav'04222/02365/12/1893/122/111/11'e to make the string like. That means I have to remove the repeated substring from the given string. The slash (/) is working as separator. Your help would be greatly appreciated.

Volder said...

Please, go through my first note in this blog.
It should help you a lot.

Tino said...

for easier debugging of regex I use kodos, http://kodos.sourceforge.net/

while primary designed for use with python the regex are based on the common syntax defined originally by perl so if you have the limits of your target implementation in mind it works quite well.