Thursday, October 25, 2007

Using collections with 10g features

After a small lull I decided to post some thoughts on collections. Especially those features that were introduced starting from 10g Oracle version.

Actually, the legs are growing from that thread on OTN forum. A lot of solutions were given, and for sure I would use one of them on the OP's place, but I just decided to show how It can be done using collections.

Problem description:

Input data (from the forum):
"ok I have 4 columns returned in my query which can either have a value of 'FF' or null".

So we have a table:


SQL> create table t as
2 (select 'FF' as col1, null as col2, null as col3, null as col4 from dual union all
3 select 'FF', 'FF', 'FF', null from dual union all
4 select 'FF', 'FF', null, 'FF' from dual union all
5 select 'FF', null, 'FF', null from dual)
6 /

Table created.

SQL> select * from t;

COL1 COL2 COL3 COL4
---- ---- ---- ----
FF
FF FF FF
FF FF FF
FF FF
SQL>

Goal:
"I want to create a fifth column with a calculation which basically states..." how many fields have 'FF' value in each row. Or it can be formulated as: "how many columns have non-null values" (it's not the same, but when the possible values domain consists of only 'FF' or NULL - it is quite similar).

Expected output:

COL1 COL2 COL3 COL4 COL5
---------- ---------- ---------- ---------- ----------
FF NULL NULL NULL 1
FF FF FF NULL 3
FF FF NULL FF 3
FF NULL FF NULL 2


Solution:
First of all let's create a collection consisting of varchar elements:


SQL> create type test as table of varchar2(100);
2 /

Type created.
SQL>

Then we execute a query which would return us a collection in the fifth field, where all the values from the previous four columns would be stored:


SQL> column col1 format A10;
SQL> column col2 format A10;
SQL> column col3 format A10;
SQL> column col4 format A10;
SQL> column coll format A30;
SQL> set null 'NULL';

SQL> select col1, col2, col3, col4,
2 cast(multiset(
3 select decode(level, 1, col1, 2, col2, 3, col3, 4, col4) from dual
4 connect by level <= 4) as test) coll
5 from t
6 /

COL1 COL2 COL3 COL4 COLL
---------- ---------- ---------- ---------- ------------------------------
FF NULL NULL NULL TEST('FF', NULL, NULL, NULL)
FF FF FF NULL TEST('FF', 'FF', 'FF', NULL)
FF FF NULL FF TEST('FF', 'FF', NULL, 'FF')
FF NULL FF NULL TEST('FF', NULL, 'FF', NULL)
SQL>

Now we look for a new operator MULTISET INTERSECT.
So the main idea of our solution is to get rid of needless elements in the initial collection and then count how many elements are left there.

So if we intersect our collection with a collection storing only 'FF' elements - we'll get rid of redundant NULL elements.

Let's do it:


SQL> select col1,col2,col3,col4,
2 cast(multiset(
3 select decode(level,1,col1,2,col2,3,col3,4,col4) from dual
4 connect by level<=4) as test)
5 multiset intersect all
6 cast(multiset (select 'FF' from dual connect by level <= 4) as test) coll from t
7 /

COL1 COL2 COL3 COL4 COLL
---------- ---------- ---------- ---------- ------------------------------
FF NULL NULL NULL TEST('FF')
FF FF FF NULL TEST('FF', 'FF', 'FF')
FF FF NULL FF TEST('FF', 'FF', 'FF')
FF NULL FF NULL TEST('FF', 'FF')
SQL>

I use ALL key word explicitly (although it is default value) to distinguish that we don't want to loose values wich are duplicated ('FF' in our case).

Another important point here is making the second collection consist of not less elements than the first does (4 elements in our case - I use simple connect by level<=4), because if you leave only one 'FF' element in the second collection - you'll get also only one element in the resulting collection:


SQL> select col1, col2, col3, col4,
2 cast(multiset(select decode(level,1,col1,2,col2,3,col3,4,col4) from dual
3 connect by level<=4) as test)
4 multiset intersect all
5 cast(multiset (select 'FF' from dual) as test) coll from t
6 /

COL1 COL2 COL3 COL4 COLL
---------- ---------- ---------- ---------- ------------------------------
FF NULL NULL NULL TEST('FF')
FF FF FF NULL TEST('FF')
FF FF NULL FF TEST('FF')
FF NULL FF NULL TEST('FF')
SQL>

It is described in the documentation:
For example, if a particular value occurs m times in nested_table1 and n times in nested_table2, then the result would contain the element min(m,n) times.

Ok, now the last our step - is counting how many elements are there in the resulting collection. It can easily be done with a CARDINALITY function.

So the final query would be:


SQL> select col1,col2,col3,col4,
2 cardinality(cast(multiset(
3 select decode(level,1,col1,2,col2,3,col3,4,col4) from dual
4 connect by level<=4) as test)
5 multiset intersect all
6 cast(multiset (select 'FF' from dual connect by level <= 4) as test)) coll from t
7 /

COL1 COL2 COL3 COL4 COLL
---------- ---------- ---------- ---------- ----------
FF NULL NULL NULL 1
FF FF FF NULL 3
FF FF NULL FF 3
FF NULL FF NULL 2
SQL>

Alternative solution would be using as a second collection - a collection with null elements, then intersect them both, and counting the column value as 4 - number_of_null_columns, where 4 is total number of fields in our row.

But we can use here another new multiset operator MULTISET EXCEPT:


SQL> select col1, col2, col3, col4,
2 cardinality(cast(multiset(select decode(level, 1, col1, 2, col2, 3, col3, 4, col4) from dual
3 connect by level <= 4) as test)
4 multiset except all
5 cast(multiset(select null from dual connect by level<=4) as test)) coll from t
6 /

COL1 COL2 COL3 COL4 COLL
---------- ---------- ---------- ---------- ----------
FF NULL NULL NULL 1
FF FF FF NULL 3
FF FF NULL FF 3
FF NULL FF NULL 2
SQL>

It is returning all elements from the first collection which are not met in the second one.

PS
strange for me was the following problem:


SQL> select col1, col2, col3, col4,
2 cardinality(cast(multiset(select col1 from dual union all
3 select col2 from dual union all
4 select col3 from dual union all
5 select col4 from dual) as test)
6 multiset intersect all
7 cast(multiset (select 'FF' from dual connect by level <= 4) as test)) coll from t
8 /

...

ORA-03001: unimplemented feature
SQL>


So we can't pass a query with UNION ALL to the CAST(MULTISET(...) as ...) function. But I didn't find such a kind of limitation in the doc.

Friday, October 12, 2007

Learning XHTML & CSS

For everyone who'd like to get familiar with XHTML & CSS stuff I highly recommend the book
Head First HTML with CSS & XHTML
Head First HTML with CSS & XHTML.

A week ago I didn't even know what CSS is, but today I've changed the appearence of my blog :))
The one thing was irritating that the width of my blog was fixed, and when it was read from the monitors with high resolution it looked like a narrow column, but now the width of the post bodies is flexible depending on the width of the browser.
I like it.

Wednesday, October 3, 2007

Removing duplicate elements from the string

Hi there! :)
Well, this is actually my first post in this newly created blog.
The idea of keeping a blog was born a day ago, when there was a question on the Oracle forum (this thread). I’ve met the same one only a few days ago – but couldn’t find it, cause there were some problems with the search engine I usually face when I need to find something :))
So I decided – why don’t I keep my own blog – where I can post such solutions that can be interesting for others. And here we are!

Ok, let's go back to the problem, mentioned in those links.

Problem description:

Input data:

There is a string containing elements. They can be separated with any symbol, e.g. a comma:

SQL> with t as (select 'elem1, elem2, elem3, elem1, elem3, elem2, elem2' str
from dual)
--
select * from t
/

STR
-----------------------------------------------
elem1, elem2, elem3, elem1, elem3, elem2, elem2

SQL>

...or any other clear way, for example, it was formulated as "3 letter codes". So in this case there are no element delimiters, but we know that each element is of three same consequtive letters, e.g.:

SQL> with t as (select 'AAABBBCCCBBBDDDAAAEEEBBB' str from dual)
--
select * from t
/

STR
------------------------
AAABBBCCCBBBDDDAAAEEEBBB

SQL>

Goal:

The objective is to remove duplicate elements out of the string, to leave only one specimen of each element.

Expected output:

So in the first case the result should be:
elem1, elem2, elem3

In the second:
AAABBBCCCDDDEEE

Solution:

In both cases the first thing what we need - is to pick out elements from the string.
Regular expressions are very friendly for us in that job (espesially, when there are no delimiters).
So elements like AAA, BBB, CCC etc. can be written as '([[:alpha:]])\1{2}' - any letter trailed by two same letters.

Now we want to understand - what would be the mask for such an element followed by another elements and again the same first element, e.g. AAABBBCCCAAA, or it can be followed directly by the same element: AAAAAA.
In language of regular expressions it would look like '(([[:alpha:]])\2{2}).*\1'

That's it!
Now if we put regexp_replace(str, '(([[:alpha:]])\2{2})(.*)\1','\1\3') we would throw out one last element which is met firstly and has duplicate values in the string. It would be last, because we used greedy operator '*'.

So for example:
SQL> with t as (select 'AAABBBCCCBBBEEEDDDAAAEEEBBBEEE' str from dual)
--
select str,
regexp_replace(str, '(([[:alpha:]])\2{2})(.*)\1','\1\3') new_str
from t
/

STR NEW_STR
------------------------------ -------------------------------
AAABBBCCCBBBEEEDDDAAAEEEBBBEEE AAABBBCCCBBBEEEDDDEEEBBB

SQL>
What is the logic of this operation:

  1. We look for the first element which has duplicated values. In our case it is AAA.
  2. Then regular expression operator finds the last AAA met in the string and removes it.
  3. Then, it goes to the rest of the string and again finds the first element, which has duplicated values in the rest of the string, now it is EEE.
  4. And finally removes the last EEE element.
  5. In our example it is the end of the operations, but if the string is longer it would proceed the previously mentioned operations again and again.
It would be more comprehensible if we mark the first met element with green, and the last, which would be removed, with red:
'AAABBBCCCBBBEEEDDDAAAEEEBBBEEE'.
As you can see, the first EEE element is not really the first EEE element in the string. It is the first one in the rest of the string after we removed AAA element.

If we iteratively apply this regular expression to our string - finally we remove all duplicated elements. But about it later. Now let's improve our expression a little bit.
First we'll change greedy operator '*' to non-greedy '*?'. So that we will do our job in fewer iterations.

Let's look at the following example:
we have string 'AAABBBAAAAAACCCAAA'.
with greedy regexp_replace(str, '(([[:alpha:]])\2{2})(.*)\1','\1\3') we'll have iterations:

  1. AAABBBAAAAAACCCAAA
  2. AAABBBAAAAAACCC
  3. AAABBBAAACCC
result: 'AAABBBCCC'. So it took us 3 times to iterate.
And with the non-greedy regexp_replace(str, '(([[:alpha:]])\2{2})(.*?)\1','\1\3') we'll have iterations:

  1. AAABBBAAAAAACCCAAA
  2. AAABBBAAACCC
The same result and achieved in two iterations.

Let's make one more improvement: add '+' after the backreference '\1'. In the language of regular expressions - it means one or more occurencies of the element which is preceding this '+'.
Let's imagine we have a string: 'AAABBBAAAAAAAAAAAA'.
With regexp_replace(str, '(([[:alpha:]])\2{2})(.*?)\1','\1\3') we'll make three iterations:

  1. AAABBBAAAAAAAAAAAA
  2. AAABBBAAAAAA
  3. AAABBBAAA
If we place '+' and use regexp_replace(str, '(([[:alpha:]])\2{2})(.*?)\1+','\1\3') we'll have to make only one iteration:

  1. AAABBBAAAAAAAAAAAA
Finally, it is a time for implementing an iterative mechanism of applying the same function to the string. Starting from Oracle version 10g there was a nice Model clause introduced, which can be used to proceed operations iteratively.

So, the final query would look like:
SQL> WITH t AS (SELECT
'AAABBBCCCBBBDDDAAAEEEBBB' str FROM dual)
--
select str, str_new from t
model
dimension by (0 dim)
measures(str, str str_new)
rules iterate(100) until (str_new[0] = previous(str_new[0]))
(str_new[0]=regexp_replace(str_new[0],'(([[:alpha:]])\2{2})(.*?)\1+','\1\3'));

STR STR_NEW
------------------------ ------------------------
AAABBBCCCBBBDDDAAAEEEBBB AAABBBCCCDDDEEE

SQL>

Two words about Model clause here:

  • With regexp_replace you are already familiar. It is applyed to the string during each iteration;
  • iterate(100) means maximum of iterations could be proceeded is 100 (it can be increased if you want);
  • until (str[0] = previous(str[0])) means stop iterations when the string is not changed during the previous iteration.
Well, that's it.

Now let's take the case when there is a delimited string, e.g. with commas. In that case we need to change our regular expression a little bit. Aketi Jyuuzou made this in one of the mentioned links. I just clarify it for readers. So for a string like 'elem1, elem2, elem3, elem1, elem3, elem2, elem2' we'll need the following:
regexp_replace(str,'(^|,)([^,]+,)(.*?,)?\2+','\1\2\3')
I just added non-greedy '*?' and '+' in the end, but you have already read about the impact of these. So you understand what are they needed for.

Backreference \1 stands for (^|,) - this is the symbol before the first met duplicated element. It is either begining of the string (^), either a comma trailing the previous element (,).
Backreference \2 stands for the ([^,]+,) - this is the element itself, which means one or more non-comma symbols followed by a comma.
Backreference \3 stands for (.*?,)? which is the minimum (non-greedy) number of symbols before one or more duplicated value (\2+).

So the final query would look like:

SQL> WITH t AS (SELECT 'elem1,elem2,elem3,elem1,elem3,elem2,elem2' str FROM dual)
--
select str, rtrim(str_new,',') new_str from t
model
dimension by (0 dim)
measures(str, str||',' str_new)
rules iterate(100) until (str_new[0] = previous(str_new[0]))
(str_new[0]=regexp_replace(str_new[0],'(^|,)([^,]+,)(.*?,)?\2+','\1\2\3'));

STR NEW_STR
----------------------------------------- -------------------------
elem1,elem2,elem3,elem1,elem3,elem2,elem2 elem1,elem2,elem3

SQL>


Hope it was useful!

PS
It is my first post, so I'll be grateful if you post your comments and notices about it. Was it too comlicatedely stated or maybe too detailed. Well,waiting for your replies :)