Sunday, December 2, 2007

Regexp 4: number of occurence (advanced)

... previous

In this post I'll talk about a technique - that I also learned from cd.
In the previous post - we got familiar with a technique to find the number of occurrencies of a substring in a string.

But there are some drawbacks, e.g.:
SQL> with t as (select '(111)111-11-11 words-non-stop(222)222-22-22' str from dual)
2 --
3 select str,
4 nvl(length(regexp_replace(str,
5 '\(\d{3}\)(\d)\d{2}-\d{2}-\d{2}|.',
6 '\1')),
7 0) occurrencies
8 from t
9 /

STR OCCURRENCIES
------------------------------------------- ------------
(111)111-11-11 words-non-stop(222)222-22-22 2

SQL>

As we can see here - we find both examples of phone numbers as fitting to our requirments. Actually, yes - there are two occurrencies of our pattern. But usually when it is combined with another alphanumeric value - we don't want to take it into account. So the result should be only 1 occurrence.

Well, let's assign the task more concrete: our phone numbers should be preceded or trailed by one of the following: space, comma or semicolon. And it also can be the first or the last structure in a string.

So we may think, that if we add such symbols before and after our pattern - it can help us:
SQL> with t as (select '(111)111-11-11 words-non-stop(222)222-22-22' str from dual)
2 --
3 select str,
4 nvl(length(regexp_replace(str,
5 '(^|[ ,;])\(\d{3}\)(\d)\d{2}-\d{2}-\d{2}([ ,;]|$)|.',
6 '\2')),
7 0) occurrencies
8 from t
9 /

STR OCCURRENCIES
------------------------------------------- ------------
(111)111-11-11 words-non-stop(222)222-22-22 1

SQL>

Yep, in that case it was helpful.

But if we take an example, when two phone numbers are coming one after another - separated by one space - it is not working:
SQL> with t as (select '(111)111-11-11 (222)222-22-22' str from dual)
2 --
3 select str,
4 nvl(length(regexp_replace(str,
5 '(^|[ ,;])\(\d{3}\)(\d)\d{2}-\d{2}-\d{2}([ ,;]|$)|.',
6 '\2')),
7 0) occurrencies
8 from t
9 /

STR OCCURRENCIES
----------------------------- ------------
(111)111-11-11 (222)222-22-22 1

SQL>

Why? because it works in the following way:
1. Finds the first pattern '(111)111-11-11 '.
As you see - there's a space at the end.
2. Replaces it with backreference - which in this case would be '1'.
3. Then goes to the rest of the string which is '(222)222-22-22' and it is not the beginning of the string.
So such a structure doesn't suit to our pattern - because it requires - space, comma or colon at the beginning. And we got no space - cause it was taken by the previous structure '(111)111-11-11 '.
4. The function replaces this structure out.
5. We get 1 occurrence as a result, instead of 2.

What we can do here:
The technique is very interesting:
Firstly we place all the patterns - which don't fulfil our requirments. We separate them with the pipe (|), which means OR on the language of regular expressions.
Then we place our desired pattern. And at the end we place '|.' as usual :)

So the generalised pattern would look in the following way:

regexp_replace(your_string,'[symbol1](our_pattern)+|(our_pattern)[symbol2]+|(our_pattern)|.','\3')

where symbol1 - symbol or list of symbols, that can't precede our pattern,
symbol2 - symbol or list of symbols, that can't trail our pattern.

And when we determine the backreference - we would place a reference to the third our pattern.
So when the regexp engine will meet the '[symbol1](our_pattern)+' or '(our_pattern)[symbol2]+', the regexp will remove such structures (actually, it will replace it with our backreference, which points to the 3rd structure - so it would be null). And when it will meet the needed pattern (third one) - it would leave it for us.
The |. at the end - will remove everything - that doesn't suits the previous three structures.

If we get back to our example, we get:
SQL> with t as (select '(111)111-11-11 (222)222-22-22' str from dual)
2 --
3 select str,
4 nvl(length(regexp_replace(str,
5 '[^ ,;](\(\d{3}\)\d{3}-\d{2}-\d{2})+|\(\d{3}\)\d{3}-\d{2}-\d{2}[^ ,;]+|\(\d{3}\)(\d)\d{2}-\d{2}-\d{2}|.',
6 '\2')),
7 0) occurrencies
8 from t
9 /

STR OCCURRENCIES
----------------------------- ------------
(111)111-11-11 (222)222-22-22 2

SQL>

Ok. Let's investigate why we put additional '+' two times.
The first one was when we put it in the first OR structure.
Let's remove it and look at the following example:
SQL> with t as (select '(111)111-11-11 (222)222-22-22 3(333)333-33-33(444)444-44-44' str from dual)
2 --
3 select str,
4 nvl(length(regexp_replace(str,
5 '[^ ,;](\(\d{3}\)\d{3}-\d{2}-\d{2})|\(\d{3}\)\d{3}-\d{2}-\d{2}[^ ,;]+|\(\d{3}\)(\d)\d{2}-\d{2}-\d{2}|.',
6 '\2')),
7 0) occurrencies_INCORRECT,
8 nvl(length(regexp_replace(str,
9 '[^ ,;](\(\d{3}\)\d{3}-\d{2}-\d{2})+|\(\d{3}\)\d{3}-\d{2}-\d{2}[^ ,;]+|\(\d{3}\)(\d)\d{2}-\d{2}-\d{2}|.',
10 '\2')),
11 0) occurrencies_CORRECT
12 from t
13 /

STR OCCURRENCIES_INCORRECT OCCURRENCIES_CORRECT
----------------------------------------------------------- ---------------------- --------------------
(111)111-11-11 (222)222-22-22 3(333)333-33-33(444)444-44-44 3 2

SQL>

Actually, when we removed '+' (occurrencies_incorrect field) we got 3 as a result insted of 2 - why?
Well, the problem as how you guessed in the '3(333)333-33-33(444)444-44-44' part.
If we don't put '+' in the '[^ ,;](\(\d{3}\)\d{3}-\d{2}-\d{2})' - it would meet 3(333)333-33-33 and stop. After this structure will be removed - the regexp is going forward.
It finds (444)444-44-44.
Well it doesn't suit to the '[^ ,;](\(\d{3}\)\d{3}-\d{2}-\d{2})' - because there's no symbol [^ ,;] before it (we removed 3(333)333-33-33).
Then it checks whether it suits to '\(\d{3}\)\d{3}-\d{2}-\d{2}[^ ,;]+'. Again negative - cause there is no '[^ ,;]+' at the end. Finally it suits to our third structure '\(\d{3}\)(\d)\d{2}-\d{2}-\d{2}' - and that's why it hits the final result set.
If we put '+' after the first pattern - it would mean one or more occurrences of our pattern. So the whole 3(333)333-33-33(444)444-44-44 will fall under the first structure - and gets removed.

Let's continue with the second '+'.
Imagine the following situation:
SQL> with t as (select '(111)111-11-11 (222)222-22-22 (333)333-33-33A(444)444-44-44' str from dual)
2 --
3 select str,
4 nvl(length(regexp_replace(str,
5 '[^ ,;](\(\d{3}\)\d{3}-\d{2}-\d{2})+|\(\d{3}\)\d{3}-\d{2}-\d{2}[^ ,;]|\(\d{3}\)(\d)\d{2}-\d{2}-\d{2}|.',
6 '\2')),
7 0) occurrencies_INCORRECT,
8 nvl(length(regexp_replace(str,
9 '[^ ,;](\(\d{3}\)\d{3}-\d{2}-\d{2})+|\(\d{3}\)\d{3}-\d{2}-\d{2}[^ ,;]+|\(\d{3}\)(\d)\d{2}-\d{2}-\d{2}|.',
10 '\2')),
11 0) occurrencies_CORRECT
12 from t
13 /

STR OCCURRENCIES_INCORRECT OCCURRENCIES_CORRECT
----------------------------------------------------------- ---------------------- --------------------
(111)111-11-11 (222)222-22-22 (333)333-33-33A(444)444-44-44 3 2

SQL>

So the problem again in the last two phone numbers. Now we put an extra 'A' between them. What causes the problem in that case:
If we don't place '+' after the '[^ ,;]' in the second structure,
the pattern '\(\d{3}\)\d{3}-\d{2}-\d{2}[^ ,;]' will meet (333)333-33-33A, with extra 'A' at the end - and removes it all.
So when it continues with the rest of the line it recognizes (444)444-44-44 as a valid phone number.
But when we add a '+' after '[^ ,;]' the pattern '\(\d{3}\)\d{3}-\d{2}-\d{2}[^ ,;]+' will cover the whole (333)333-33-33A(444)444-44-44, and remove it.

PS
Our mask in the regexp can be simplified a little.
In general it would look like:
regexp_replace(your_string,'[symbol1](our_pattern)+|(our_pattern)[symbol3]|.','\2')

Here everything means the same as in the previous examples - except symbol3. Now it means any symbol that can trail our pattern.
And in that case we should place the second structure as a backreference (instead of third in the previous examples).
SQL> with t as (select '(111)111-11-11 (222)222-22-22 3(333)333-33-33 (444)444-44-44' str from dual)
2 --
3 select str,
4 nvl(length(regexp_replace(str,
5 '[^ ,;](\(\d{3}\)\d{3}-\d{2}-\d{2})+|\(\d{3}\)(\d)\d{2}-\d{2}-\d{2}([ ,;]|$)|.',
6 '\2')),
7 0) occurrencies
8 from t
9 /

STR OCCURRENCIES
------------------------------------------------------------ ------------
(111)111-11-11 (222)222-22-22 3(333)333-33-33 (444)444-44-44 3

SQL>


Hope you can find out how it works by yourself now :)

2 comments:

SnippetyJoe said...

Nice series of articles Volder. Regarding

"but if we take an example, when two phone numbers are coming one after another - separated by one space - it is not working:"

here's an alternative solution. It uses an extra function call, but the regular expression is a little simpler.

with t as (select '(111)111-11-11 (222)222-22-22' str from dual)
--
select
str,
nvl
( length
( regexp_replace
( ' ' ||
replace( str, ' ', '  ' ) || -- replaces one space with two
' ',
'[ ,;]\(\d{3}\)(\d)\d{2}-\d{2}-\d{2}[ ,;]|.', '\1')
),
0
) occurrencies
from t
/

STR OCCURRENCIES
----------------------------- ------------
(111)111-11-11 (222)222-22-22 2

(Forgive the ugly formatting, I couldn't figure a way to get Blogger to format the code using monospace.)

Volder said...

Hi, Joe.
Thanks for your comment.

This is an alternative way of doing this. But a small addition - you better not double only whitespaces - but you should double all possible delimiters.
So instead of replace(str,' ',' ') you should use regexp_replace(str,'([ ,;])','\1\1') and BTW you should remember to change the pattern mask, in case if it is possible to meet tha same character inside the mask.