Saturday, February 23, 2008

Patterns evaluation order in regular expressions

After a lull in my blogging activity I decided to write a note on Oracle regular expressions, which can be useful for those who wants to use Oracle regexp functions more efficiently.

We will talk about masks with several alternatives.
Lets look at the following example:
SQL> with t as (select '1H1' str from dual)
2 select regexp_replace(str, '1|1H', 'A') mask1,
3 regexp_replace(str, '1H|1', 'A') mask2
4 from t
5 /

MASK1 MASK2
----- -----
AHA AA

SQL>

"|" (pipe) is OR operator in regular expressions. It is used to list several alternatives to be matched.
But the most important thing is that they are passed one by one in order of apperance inside search mask.
As you can see in the example we have two columns (mask1, mask2). The difference is that we changed places of "1" and "1H", and the results are absolutely different.

So how it is working:
MASK1 ('1|1H'):
In the initial string '1H1' we start to search for the first occurence.
'1' matches the first pattern ('1') and hence replaced with 'A'.
Then proceeding with the rest of the line ('H1').
The next symbol 'H' is not matching '1' so we go to the next pattern ('1H'), but it doesn't match it also.
So we leave it as it is and go to the rest '1', which matches the first pattern of a mask ('1') and consequently replaced with 'A'.

So if we combine all the changes done - we finally get 'AHA'.

MASK2('1H|1'):
Now when we changed the order of the patterns inside the mask - the result would be different.
So we start with the first symbol '1' again.
It matches the beginning of our first pattern ('1H').
And inspite that this symbol is enough to cover the second pattern, we add one more letter to watch whether it satisfies the first pattern or not.
So we add the next symbol 'H' and get '1H' which matches the first pattern, and hence replaced with 'A'.
Then we proceed with the rest of line ('1').
So it doesn't match the first pattern '1H', we check with the second pattern and we find a match, so change '1' to 'A'.
In the final result we have 'AA'.

The main point what should be learnt here is that we don't proceed with the next pattern until we know for sure - that the current pattern wouldn't match.

How can it be used in practice.
In one of my previous posts I already used this technique, but here are couple of other examples recently posted on OTN forum.

Example #1.
Task:
The column comprises the list of names.
We need to get the following result: in case there is only one name in a column - we need to return this name (not touched) with trimmed preceding and trailing spaces.
In case when there are several words in a name - we need to return only first letters (initials).
Solution:
SQL> with t as (select 'Mark Thomsan' str from dual union all
2 select 'Allen' from dual union all
3 select 'John Trovolta Robert' from dual union all
4 select ' John' from dual union all
5 select 'Frederick ' from dual union all
6 select ' Erick Cartman ' from dual union all
7 select ' Michael ' from dual)
8 --
9 select str,regexp_replace(str,'^ *([^ ]*) *$|(^| )([^ ])|.','\1\3') str_new from t
10 /

STR STR_NEW
----------------------- ----------------
Mark Thomsan MT
Allen Allen
John Trovolta Robert JTR
John John
Frederick Frederick
Erick Cartman EC
Michael Michael

7 rows selected

SQL>

Explanation:
We have a mask comprising 3 patterns (splitted with '|'):
1) '^ *([^ ]*) *$'
2) '(^| )([^ ])'
3) '.'

So the first one matches the whole string ('^' as the beginning and '$' as the end), that contains no or only one word ('([^ ]*)'), which is preceded or trailed by any number of spaces (' *').
If our column value is like this - then we return only the word as a result ('\1').
If there is more than one word in a column value - this mask is of no use.
Hence we proceed with the second pattern.
This matches the first letters of each word. We specify that the letters are the first only - by placing '(^| )' before any non-space symbol '([^ ])'.
So all the first letters would be returned in the result '\3' (this will happen only for string which contains > 1 word, otherwise the whole string would be covered by first pattern, and we will never reach the second pattern).
The last pattern '.' is symply any other character - not mentioned in the previous two patterns.
So it is kind of clean up technique to put '|.' in the end of regexp_replace mask.

The result is what we need: Allen, John, Frederick and Michael were returned as they were in the input data. The others (contain more than 1 word) are replaced by initials only.

Example #2.

Task: We need to eliminate all the spaces, which are not between two words.
If there are more than one space between words they should be trimmed to only one.
Solution:
SQL> with t as (select ' 6213, 2345, Application Developer' str from dual union all
2 select '123, Avenue, app. 324, first door second floor' from dual)
3 --
4 select regexp_replace(str,'([[:alpha:]] ) *([[:alpha:]])| |(.)','\1\2\3') new_str from t
5 /

NEW_STR
-------------------------------------------------
6213,2345,Application Developer
123,Avenue,app.324,first door second floor

SQL>

Exaplnation:
We have a mask comprising 3 patterns again:
1) '([[:alpha:]] ) *([[:alpha:]])'
2) ' '
3) '(.)'

The first pattern searches for two letters with at least one space between and returnes these letters with only one space between them ('\1\2').
The second pattern contains only one space and matches all the other spaces, that were not covered by the first pattern.
As we don't have any backreference for this pattern - all such spaces would be eliminated from the initial value.
The third pattern matches any character '(.)'. We remember that all the spaces were covered by first or second pattern, so no spaces would match this '(.)'.
As we have '\3' for this pattern - all such symbols would be returned in the result.
That's how we left one space between letters only, and erased all the other spaces from the sentence.

Sunday, February 10, 2008

Magic MIT whiteboard

I came across a post from Claudia Zeiler, where she mentioned about a really interesting stuff. It is not regarding Oracle, but I liked it, so decided to post in my blog also.
I remember the times at my school, when we used wooden boards and were writing with chalks.
Then came soft-tip pens with paper and white boards or stand with throw-over lists of paper.
And look what can be used now :) BTW the reproduction of physical objects natural behaviour is pretty good.
So no doubts, that our children would be cleverer us, using such a stuff while studying.

Monday, February 4, 2008

Prime factors of a number

Image:Animation Sieve of Eratosth-2.gif
All we know that every natural number (except 1) can be written as a product of prime numbers, and such a decomposition would be unique for each number. This is what the Fundamental theorem of arithmetic is about.

Let's get a query in SQL, which should return (output) a prime factorization of a given natural number (input).

Some time ago I wrote on a OTN forum a query, using model clause, for finding all prime numbers less than a given one. It uses Sieve of Eratosthenes method.

So using this stuff I added one more model part that would iteratively find out all prime factors of a number (one prime factor at an iteration).

And here is my query:
SQL> var n number
SQL> exec :n:=1260

PL/SQL procedure successfully completed
n
---------
1260

SQL>
SQL> with t as (select level l from dual connect by level <= :n),
2 --
3 t1 as (select l prim_num from
4 (select * from t
5 model
6 dimension by (l dim)
7 measures (l,2 temp)
8 rules iterate (1e8) until (power(temp[1],2)>:n)
9 (l[DIM>TEMP[1]]=decode(mod(l[CV()],temp[1]),0,null,l[CV()]),
10 temp[1]=min(l)[dim>temp[1]])
11 )
12 where l is not null)
13 --
14 select '('||prim_num||'^'||pow||')' prim_factors from (
15 select * from t1
16 model
17 dimension by (rownum rn)
18 measures(prim_num, :n val, 0 pow)
19 rules iterate(1000) until (val[1]=1)
20 (pow[any] order by rn = decode(mod(val[1],prim_num[CV()]),0,pow[CV()]+1,pow[CV()]),
21 val[rn>1] order by rn = decode(mod(val[CV()-1],prim_num[CV()]),0,val[CV()-1]/prim_num[CV()],val[CV()-1]),
22 val[1]=min(val)[any]))
23 where rn>1 and pow>0
24 /

PRIM_FACTORS
--------------------------------------------------------------------------------
(2^2)
(3^2)
(5^1)
(7^1)
n
---------
1260

SQL>

It was not very effective, e.g. to find prime factors of numbers near 100000 it took about 2 or more minutes to execute on my laptop.

So I decided to post it as challenge on russian SQL.RU forum (russian version and english mirror) and there were pretty good solutions.

But I liked the one from andreymx:
SQL> var p_value number
SQL> exec :p_value:=123432435

PL/SQL procedure successfully completed

Executed in 0,03 seconds
p_value
---------
123432435

SQL>
SQL> set timing on
SQL> WITH s1 AS (SELECT LEVEL lv FROM dual CONNECT BY ROWNUM <= SQRT(:p_value)),
2 s2 AS (SELECT lv id1, :p_value/lv id2 FROM s1 WHERE MOD(:p_value, lv)=0),
3 s3 AS (SELECT id1 ID FROM s2 UNION SELECT id2 FROM s2),
4 s4 AS (SELECT ID, (SELECT MIN(ID) FROM s3 s3_1 WHERE s3_1.ID / s3.ID = TRUNC(s3_1.ID / s3.ID) AND s3_1.ID > s3.ID) idd FROM s3)
5 SELECT :p_value||'='||MAX(LTRIM(RTRIM(SYS_CONNECT_BY_PATH(idd/ID, '*'), '*'), '*')) FROM s4
6 CONNECT BY ID=PRIOR IDd
7 START WITH ID=1
8 /

:P_VALUE||'='||MAX(LTRIM(RTRIM
--------------------------------------------------------------------------------
123432435=3*3*5*7*71*5519

Executed in 0,22 seconds
p_value
---------
123432435

SQL>

As you can see it executes a query for huge numbers less than a second :))
Of course, it depends, but still...

Pair of words about it:
First, he finds all the numbers less than square root of a number (s1).

Then finds all the factors of that number among them and the opposite factor (dividing the number by each factor) (s2).
Thats how he saves a lot of resources.

Then Andrey builds a full list of factors for the initial number (s3).

Then he finds for each factor the next bigger factor which is divided by the current one without rest (s4). This trick was interesting.

At the last step he builds a hierarchy from the 1 factor up to the end (the initial number). All the prime factors are calculated as "child" factor/"parent" factor (IDD/ID in the query).

nice, isn't it? :))