Thursday, November 29, 2007

Regexp 3: number of pattern occurence

... previous

In that post I'll talk about a problem, which is very popular, according to the number of posts in forums with similar questions.

Usually it is formulated as: "How to count the number of occurences of a substring with special pattern inside the string".
Actually, everything we were talking about in two previous posts - can be very useful in solving such a task.

Let our input data be the same as in the previous post:
SQL> with t as (select 'three numbers: (111)111-11-11 and [222]222-22-22 and (333)333-33-33' str from dual)
2 --
3 select t.* from t
4 /

STR
-------------------------------------------------------------------
three numbers: (111)111-11-11 and [222]222-22-22 and (333)333-33-33

SQL>
We have a string with phone numbers in it.
We want to find how many numbers in the format (XXX)XXX-XX-XX are there.

So what we can achieve at the moment? We can pull out all the substrings with that particular pattern in one column:
SQL> with t as (select 'three numbers: (111)111-11-11 and [222]222-22-22 and (333)333-33-33' str from dual)
2 --
3 select t.*,
4 regexp_replace(str, '\(\d{3}\)(\d{3}-\d{2}-\d{2}( |$))|.', '\1') phone_numbers
5 from t
6 /

STR PHONE_NUMBERS
------------------------------------------------------------------- -------------------------
three numbers: (111)111-11-11 and [222]222-22-22 and (333)333-33-33 111-11-11 333-33-33

SQL>
What would be the technique to find the number of substrings? The answer would be "very easy".
First instead of finding the substring according to the mask - we'll find only the first symbols of such substrings.
Then using the function LENGTH - we'll find the length of such a string - and it would be equal to the number of occurences - cause every symbol - would mean one occurenece.
SQL> with t as (select 'three numbers: (111)111-11-11 and [222]222-22-22 and (333)333-33-33' str from dual)
2 --
3 select regexp_replace(str, '\(\d{3}\)(\d{3}-\d{2}-\d{2}( |$))|.', '\1') phone_numbers,
4 regexp_replace(str, '\(\d{3}\)(\d)\d{2}-\d{2}-\d{2}|.', '\1') first_letters,
5 nvl(length(regexp_replace(str, '\(\d{3}\)(\d)\d{2}-\d{2}-\d{2}|.', '\1')), 0) occurrencies
6 from t
7 /

PHONE_NUMBERS FIRST_LETTERS OCCURRENCIES
----------------------- ------------- ------------
111-11-11 333-33-33 13 2

SQL>
What we have changed:
instead of \d{3} standing for the first three digits of our phone numbers, we've written \d\d{2}. That is also three digits - but with the first one separated.
Next step would be to place not the whole pattern in parenethes, but only the first digit: (\d)\d{2}
That's why when we get the whole pattern replaced with the '\1' backreference - we get only the first digits: one from each occurenece of the pattern.
After that we put LENGTH function and for the case when there no occurences - we put NVL(...,0).
In our string we have 2 occurences of a pattern like XXX-XX-XX with preceding (XXX).
And that's right.

to be continued...

Wednesday, November 28, 2007

Regexp 2: getting all the substrings with the particular preceding or trailing patterns

... previous

Well, let's continue.

In the previous post we learnt how to retrieve all the substrings of special pattern.
In this post we'll improve our skills.

Imagine that you have such data (again phone numbers):
SQL> with t as (select 'three numbers: (111)111-11-11 and [222]222-22-22 and (333)333-33-33' str from dual)
2 --
3 select * from t
4 /

STR
-------------------------------------------------------------------
three numbers: (111)111-11-11 and [222]222-22-22 and (333)333-33-33

SQL>

and you want to retrieve phone numbers of the following format: XXX-XX-XX, where X stands for any digit.
But you want only those numbers which are preceded by (XXX) pattern, where X - again any number.
So the number 222-22-22, despite the fact that it has XXX-XX-XX mask - doesn't meet our requirments. Because it has [222] standing before, but not (222).

If we used regexp_susbtr - we would first retrieve phone number in (XXX)XXX-XX-XX format and then cut the (XXX) pattern possibly with another regexp_substr or regexp_replace, e.g.:
SQL> with t as (select 'three numbers: (111)111-11-11 and [222]222-22-22 and (333)333-33-33' str from dual)
2 --
3 select t.*,
4 regexp_substr(str, '\(\d{3}\)\d{3}-\d{2}-\d{2}') step1,
5 regexp_replace(regexp_substr(str,
6 '\(\d{3}\)\d{3}-\d{2}-\d{2}'),
7 '\(\d{3}\)') step2
8 from t
9 /

STR STEP1 STEP2
------------------------------------------------------------------- --------------------- ---------------
three numbers: (111)111-11-11 and [222]222-22-22 and (333)333-33-33 (111)111-11-11 111-11-11

SQL>

And again regexp_substr is only for one occurenece at a time.

Using the trick - we could use the following construction:
regexp_replace(your_string,'preceding_pattern(substring_mask)trailing_pattern|.','\1')

this will retrieve all the substrings according to the mask 'substring_mask', but with the condition that it is preceded and trailed by particular patterns.

So in our situation we could use:
SQL> with t as (select 'three numbers: (111)111-11-11 and [222]222-22-22 and (333)333-33-33' str from dual)
2 --
3 select t.*,
4 regexp_replace(str, '\(\d{3}\)(\d{3}-\d{2}-\d{2}( |$))|.','\1') phone_numbers
5 from t
6 /

STR PHONE_NUMBERS
------------------------------------------------------------------- ------------------------
three numbers: (111)111-11-11 and [222]222-22-22 and (333)333-33-33 111-11-11 333-33-33

SQL>

That's it.
As you can see - there's no 222-22-22 phone number in the result set.

to be continued...

Regexp 1: getting all the needed substrings

I decided to write some notes on regular expressions. And especially about one trick, that sometimes can be very useful.
It's not my invention. First time I found it on the following OTN thread written by cd. It's a pity I don't know his full name. All I know is that he's from Austria and that he's an expert on regular expressions :)

The trick is very simple: if you want to get all the occurences of substring of the particular mask from a string, just put:
regexp_replace(your_string,'(regexp_mask)|.','\1')

Those small pipe and dot (|.) at the end are making a huge deal.

Let's get to the example. Imagine you have a table and one field is of varchar type, where there's information stored about phone numbers. But it is not pure phone numbers, there's some additional text around them, e.g.:

SQL> with t as (select 'That''s phone number 1: (916)809-23-34 and that''s the second one: (918)234-12-09' str from dual union all
2 select 'Call me later on (926)507-15-34 or on (913)432-23-21' from dual union all
3 select 'two numbers: (917)888-34-34 and (903)234-43-11' from dual)
4 --
5 select * from t
6 /

STR
-------------------------------------------------------------------------------
That's phone number 1: (916)809-23-34 and that's the second one: (918)234-12-09
Call me later on (926)507-15-34 or on (913)432-23-21
two numbers: (917)888-34-34 and (903)234-43-11

SQL>

What you want is to drag out only phone numbers that have a mask (XXX)XXX-XX-XX, where X stands for any digit.

What could you do if you didn't know about the trick I mentioned?
you could use regexp_susbtr:
SQL> with t as (select 'That''s phone number 1: (916)809-23-34 and that''s the second one: [234]123-43-23 and the third one: (918)234-12-09' str from dual union all
2 select 'Call me later on (926)507-15-34 or on (913)432-23-21 or on [234]123-43-23' from dual union all
3 select 'two numbers: (917)888-34-34 and (903)234-43-11' from dual)
4 --
5 select t.*,
6 regexp_substr(str, '\(\d{3}\)\d{3}(-\d{2}){2}') phone_num1,
7 regexp_substr(str, '\(\d{3}\)\d{3}(-\d{2}){2}', 1, 2) phone_num2
8 from t
9 /

STR PHONE_NUM1 PHONE_NUM2
-------------------------------------------------------------------------------- ----------------- ------------------
That's phone number 1: (916)809-23-34 and that's the second one: [234]123-43-23 (916)809-23-34 (918)234-12-09
Call me later on (926)507-15-34 or on (913)432-23-21 or on [234]123-43-23 (926)507-15-34 (913)432-23-21
two numbers: (917)888-34-34 and (903)234-43-11 (917)888-34-34 (903)234-43-11

SQL>

But as you can see - you should write a separate regexp_substr column for each occurence of substring, because regexp_substr can pull out only one occurence at a time. And sometimes you don't know beforehand how many occurences of substring are there.

BTW I used \d for digit - if you're on version less than 10.2.x.x you can use [:digit:] or [0-9] instead.

And now let's see what we can do using a tricky regexp_replace:
SQL> with t as (select 'That''s phone number 1: (916)809-23-34 and that''s the second one: (918)234-12-09' str from dual union all
2 select 'Call me later on (926)507-15-34 or on (913)432-23-21' from dual union all
3 select 'two numbers: (917)888-34-34 and (903)234-43-11' from dual)
4 --
5 select t.*,
6 regexp_replace(str,
7 '(\(\d{3}\)\d{3}(-\d{2}){2}( |$))|.',
8 '\1') full_phone_numbers
9 from t
10 /

STR FULL_PHONE_NUMBERS
------------------------------------------------------------------------------- --------------------------------------
That's phone number 1: (916)809-23-34 and that's the second one: (918)234-12-09 (916)809-23-34 (918)234-12-09
Call me later on (926)507-15-34 or on (913)432-23-21 (926)507-15-34 (913)432-23-21
two numbers: (917)888-34-34 and (903)234-43-11 (917)888-34-34 (903)234-43-11

SQL>

You see - everything that doesn't correspond to the mask is removed, and all occurences of the needed phone numbers are placed in one column.

to be continued...