Thursday, November 29, 2007

Regexp 3: number of pattern occurence

... previous

In that post I'll talk about a problem, which is very popular, according to the number of posts in forums with similar questions.

Usually it is formulated as: "How to count the number of occurences of a substring with special pattern inside the string".
Actually, everything we were talking about in two previous posts - can be very useful in solving such a task.

Let our input data be the same as in the previous post:
SQL> with t as (select 'three numbers: (111)111-11-11 and [222]222-22-22 and (333)333-33-33' str from dual)
2 --
3 select t.* from t
4 /

three numbers: (111)111-11-11 and [222]222-22-22 and (333)333-33-33

We have a string with phone numbers in it.
We want to find how many numbers in the format (XXX)XXX-XX-XX are there.

So what we can achieve at the moment? We can pull out all the substrings with that particular pattern in one column:
SQL> with t as (select 'three numbers: (111)111-11-11 and [222]222-22-22 and (333)333-33-33' str from dual)
2 --
3 select t.*,
4 regexp_replace(str, '\(\d{3}\)(\d{3}-\d{2}-\d{2}( |$))|.', '\1') phone_numbers
5 from t
6 /

------------------------------------------------------------------- -------------------------
three numbers: (111)111-11-11 and [222]222-22-22 and (333)333-33-33 111-11-11 333-33-33

What would be the technique to find the number of substrings? The answer would be "very easy".
First instead of finding the substring according to the mask - we'll find only the first symbols of such substrings.
Then using the function LENGTH - we'll find the length of such a string - and it would be equal to the number of occurences - cause every symbol - would mean one occurenece.
SQL> with t as (select 'three numbers: (111)111-11-11 and [222]222-22-22 and (333)333-33-33' str from dual)
2 --
3 select regexp_replace(str, '\(\d{3}\)(\d{3}-\d{2}-\d{2}( |$))|.', '\1') phone_numbers,
4 regexp_replace(str, '\(\d{3}\)(\d)\d{2}-\d{2}-\d{2}|.', '\1') first_letters,
5 nvl(length(regexp_replace(str, '\(\d{3}\)(\d)\d{2}-\d{2}-\d{2}|.', '\1')), 0) occurrencies
6 from t
7 /

----------------------- ------------- ------------
111-11-11 333-33-33 13 2

What we have changed:
instead of \d{3} standing for the first three digits of our phone numbers, we've written \d\d{2}. That is also three digits - but with the first one separated.
Next step would be to place not the whole pattern in parenethes, but only the first digit: (\d)\d{2}
That's why when we get the whole pattern replaced with the '\1' backreference - we get only the first digits: one from each occurenece of the pattern.
After that we put LENGTH function and for the case when there no occurences - we put NVL(...,0).
In our string we have 2 occurences of a pattern like XXX-XX-XX with preceding (XXX).
And that's right.

to be continued...


mendy said...

Thank you. I was looking for this.
I had something simpler, I just needed to find the number of commas in a string.
I did len(dbo.RegExReplace(@string,'[^,]',''))