Wednesday, December 19, 2007

Reports: getting total of all children values in a tree

... previous

Well, actually, this note is a continuation of a model series on reports.
So the structure of an article is the same: the problem description -> non-model solution -> model clause solution.

But also this note is very similar to summation of tree values, I've written a little bit earlier.

Input data:

So let's get started.
Our input for this post would be a table t from here and an additional table t_hrchy with a hierarchical structure of departments:
SQL> drop table t_hrchy;

Table dropped

SQL>
SQL> create table t_hrchy as (select 'DEP2' dep, 'DEP1' dep_par from dual union all
2 select 'DEP3' dep, 'DEP2' dep_par from dual union all
3 select 'DEP4' dep, 'DEP3' dep_par from dual)
4 /

Table created

SQL> select * from t_hrchy;

DEP DEP_PAR
---- -------
DEP2 DEP1
DEP3 DEP2
DEP4 DEP3

SQL>
So this table tells us that DEP1 is a parent department for DEP2, DEP2 is a parent department for DEP3 and DEP3 is a parent department for DEP4.

Problem:

Now our task would be to create a report where in one column would be sales of a particular department output and in another column we need to get the sales of a current department and for all the departments that are under the current one in the organization structure.
The data should be consolidated only inside the same year and inside the same city.

Non-Model solution:

When the matter concerns the hierarchical structure the first thing that pops up in our head is hierarchical queries with CONNECT BY clause.

Elaborating this a little bit we can use the following query proposed by Rob van Wijk here or here:
SQL> select to_char(year,'yyyy') year,
2 city,
3 t.dep,
4 th.dep_par,
5 sales,
6 sum(connect_by_root sales) tot_sales
7 from t, t_hrchy th
8 where t.dep = th.dep(+)
9 connect by t.dep = prior dep_par
10 and prior year = year
11 and prior city = city
12 group by city, year, t.dep, dep_par, sales
13 order by 1,2,3;

YEAR CITY DEP DEP_PAR SALES TOT_SALES
---- ------ ---- ------- ---------- ----------
2006 Moscow DEP1 562 2254
2006 Moscow DEP2 DEP1 457 1692
2006 Moscow DEP3 DEP2 890 1235
2006 Moscow DEP4 DEP3 345 345
2006 Omsk DEP1 23 264
2006 Omsk DEP2 DEP1 154 241
2006 Omsk DEP3 DEP2 87 87
2007 Moscow DEP1 762 2926
2007 Moscow DEP2 DEP1 657 2164
2007 Moscow DEP3 DEP2 1020 1507
2007 Moscow DEP4 DEP3 487 487
2007 Omsk DEP1 34 403
2007 Omsk DEP2 DEP1 213 369
2007 Omsk DEP3 DEP2 156 156

14 rows selected

SQL>
Lets see what this query actually does.
First we use an outer join of two tables - so for each row we get a parent department name.

Then we build a hierarchy by using
connect by t.dep = prior dep_par
Mention that hierarchy is built in the backward direction starting from the children and going up to their parents.

Also we add in the connect by clause the following:
and prior year = year
and prior city = city
So we build hierarchy separartely for different years and cities.

By using the operator
connect_by_root sales
in every row we find a value for sales of the root elements for this or that thread in a hierarchy. As we didn't specify any START WITH condition - the hierarchy is built starting from all the departments present in the data.
For example, let's look what would happen inside the partition for 2007 year Moscow city:
SQL> select to_char(year,'yyyy') year,
2 city,
3 t.dep,
4 th.dep_par,
5 sales,
6 connect_by_root sales root_sales,
7 level,
8 sys_connect_by_path(t.dep,'/') hrchy_path
9 from t, t_hrchy th
10 where t.dep = th.dep(+)
11 and trunc(t.year,'y')=to_date('01.01.2007','dd.mm.yyyy')
12 and t.city='Moscow'
13 connect by t.dep = prior dep_par
14 and prior year = year
15 and prior city = city
16 /

YEAR CITY DEP DEP_PAR SALES ROOT_SALES LEVEL HRCHY_PATH
---- ------ ---- ------- ---------- ---------- ---------- ----------------------------
2007 Moscow DEP1 762 762 1 /DEP1
2007 Moscow DEP2 DEP1 657 657 1 /DEP2
2007 Moscow DEP1 762 657 2 /DEP2/DEP1
2007 Moscow DEP3 DEP2 1020 1020 1 /DEP3
2007 Moscow DEP2 DEP1 657 1020 2 /DEP3/DEP2
2007 Moscow DEP1 762 1020 3 /DEP3/DEP2/DEP1
2007 Moscow DEP4 DEP3 487 487 1 /DEP4
2007 Moscow DEP3 DEP2 1020 487 2 /DEP4/DEP3
2007 Moscow DEP2 DEP1 657 487 3 /DEP4/DEP3/DEP2
2007 Moscow DEP1 762 487 4 /DEP4/DEP3/DEP2/DEP1

10 rows selected

SQL>
As you see DEP1 is met four times:
- Firstly it stands for itself (level=1).
- Then we begin to build hierarchy starting from DEP2 and DEP1 is met as a "child" (remember we build a hierarchy in the opposite direction) for DEP2 (level=2)
- The next time we meet it as a "grandchild" for DEP3 (level=3)
- And finally as a "great-grandchild" for DEP4 (level=4).

In each row we have a value of sales for the root department. So if we group data by department and put a sum() aggregate function on connect_by_root(sales) column we get the sales of DEP1 and all the sales of all departments that are under DEP1 in one row.
The same stuff will happen to all the other departments:
SQL> select to_char(year,'yyyy') year,
2 city,
3 t.dep,
4 th.dep_par,
5 sales,
6 sum(connect_by_root sales) tot_salesh
7 from t, t_hrchy th
8 where t.dep = th.dep(+)
9 and trunc(t.year,'y')=to_date('01.01.2007','dd.mm.yyyy')
10 and t.city='Moscow'
11 connect by t.dep = prior dep_par
12 and prior year = year
13 and prior city = city
14 group by year,city,t.dep,th.dep_par,sales
15 order by 1,2,3
16 /

YEAR CITY DEP DEP_PAR SALES TOT_SALESH
---- ------ ---- ------- ---------- ----------
2007 Moscow DEP1 762 2926
2007 Moscow DEP2 DEP1 657 2164
2007 Moscow DEP3 DEP2 1020 1507
2007 Moscow DEP4 DEP3 487 487

SQL>


Model Solution

Now we are proceeding and will solve the same problem using Model clause without any connect by's.

To get less data to be outputed at every step - let's restrict our data to 2007 year and Moscow city only as we did in Non-Model solution:
SQL> select * from t, t_hrchy th
2 where t.dep = th.dep(+)
3 and trunc(t.year,'y')=to_date('01.01.2007','dd.mm.yyyy')
4 and t.city='Moscow'
5 order by 3,1,2
6 /

CITY DEP YEAR SALES DEP DEP_PAR
------ ---- ----------- ---------- ---- -------
Moscow DEP1 31.12.2007 762
Moscow DEP2 31.12.2007 657 DEP2 DEP1
Moscow DEP3 31.12.2007 1020 DEP3 DEP2
Moscow DEP4 31.12.2007 487 DEP4 DEP3

SQL>
Let's think of what woud be placed in PARTITION BY part of our model. Remeber that partitions should be absolutely independent one of another.
Yes - we would put YEAR and CITY there. As a task instructs us to treat sales of different years and cities separately.
In dimension we would place DEP and DEP_PART columns.
And in measures - we'll put SALES, of course. And create one more measure called TOT_SALES with initial value of zero.
SQL> select * from t, t_hrchy th
2 where t.dep = th.dep(+)
3 and trunc(t.year,'y')=to_date('01.01.2007','dd.mm.yyyy')
4 and t.city='Moscow'
5 model
6 partition by (year,city)
7 dimension by (t.dep dep,dep_par)
8 measures(sales, 0 tot_sales)
9 ()
10 order by 1,2,3
11 /

YEAR CITY DEP DEP_PAR SALES TOT_SALES
----------- ------ ---- ------- ---------- ----------
31.12.2007 Moscow DEP1 762 0
31.12.2007 Moscow DEP2 DEP1 657 0
31.12.2007 Moscow DEP3 DEP2 1020 0
31.12.2007 Moscow DEP4 DEP3 487 0

SQL>

Now we are going to write a rule for calculating the tot_sales.
For any cell in our model the tot_sales should be equal to a sum of sales of the current department and all the tot_sales of all departments under the current.
We can write it as:
tot_sales[any,any] = sales[CV(),CV()]+nvl(sum(tot_sales)[any,CV(dep)],0)

BTW it would be correct in case when the tot_sales of the departments under the current one already included all the tot_sales of the departments under them.

So what is important here is the ORDER in what the rules are executed.
If we don't specify anything about the order of execution to our model it uses the default SEQUENTIAL ORDER and can easily throw out an error for us:
SQL> select * from t, t_hrchy th
2 where t.dep = th.dep(+)
3 and trunc(t.year,'y')=to_date('01.01.2007','dd.mm.yyyy')
4 and t.city='Moscow'
5 model
6 partition by (year,city)
7 dimension by (t.dep dep,dep_par)
8 measures(sales, 0 tot_sales)
9 rules
10 (tot_sales[any,any] = sales[CV(),CV()]+nvl(sum(tot_sales)[any,CV(dep)],0))
11 order by 1,2,3
12 /

...

ORA-32637: Self cyclic rule in sequential order MODEL

SQL>

But putting AUTOMATIC ORDER causes all rules to be evaluated in an order based on their logical dependencies.
Let's check:
SQL> select * from t, t_hrchy th
2 where t.dep = th.dep(+)
3 and trunc(t.year,'y')=to_date('01.01.2007','dd.mm.yyyy')
4 and t.city='Moscow'
5 model
6 partition by (year,city)
7 dimension by (t.dep dep,dep_par)
8 measures(sales, 0 tot_sales)
9 rules automatic order
10 (tot_sales[any,any] = sales[CV(),CV()]+nvl(sum(tot_sales)[any,CV(dep)],0))
11 order by 1,2,3
12 /

YEAR CITY DEP DEP_PAR SALES TOT_SALES
----------- ------ ---- ------- ---------- ----------
31.12.2007 Moscow DEP1 762 2926
31.12.2007 Moscow DEP2 DEP1 657 2164
31.12.2007 Moscow DEP3 DEP2 1020 1507
31.12.2007 Moscow DEP4 DEP3 487 487

SQL>

VoilĂ ! This is exactly what we were looking for.

Let's find a detailed description of what's happening when we don't use AUTOMATIC ORDER and what's changing when we begin to use it.
First of all, the equivalent of our single rule would be actually a list of four rules for every row (unique set of dimension values) in our data:
tot_sales['DEP1', null ]=sales['DEP1', null ]+nvl(sum(tot_sales)[any,'DEP1'],0),
tot_sales['DEP2','DEP1']=sales['DEP2','DEP1']+nvl(sum(tot_sales)[any,'DEP2'],0),
tot_sales['DEP3','DEP2']=sales['DEP3','DEP2']+nvl(sum(tot_sales)[any,'DEP3'],0),
tot_sales['DEP4','DEP3']=sales['DEP4','DEP3']+nvl(sum(tot_sales)[any,'DEP4'],0)

But using SEQUENTIAL ORDER doesn't guarantee us any order of these rules execution. Because as it stated in the documetation: SEQUENTIAL ORDER: This causes rules to be evaluated in the order they are written. This is the default.
But we have only one rule in our model. So the detailed rules can be run in any order:
for example first we find tot_sales['DEP1', null ] and then tot_sales['DEP2','DEP1'], or vice versa. And that is the problem, because this two possibilities will give us different results in the output.

What actually ORA-32637 stays for is:
Cause: A self-cyclic rule was detected in the sequential order MODEL. Sequential order MODELs cannot have self cyclic rules to guarantee that the results do not depend on the order of evaluation of the cells that are updated or upserted.
Action: Use ordered rule evaluation for this rule.


So if use order by DEP we shouldn't have such an error, let's try:
SQL> select * from t, t_hrchy th
2 where t.dep = th.dep(+)
3 and trunc(t.year,'y')=to_date('01.01.2007','dd.mm.yyyy')
4 and t.city='Moscow'
5 model
6 partition by (year,city)
7 dimension by (t.dep dep,dep_par)
8 measures(sales, 0 tot_sales)
9 rules
10 (tot_sales[any,any] order by dep = sales[CV(),CV()]+nvl(sum(tot_sales)[any,CV(dep)],0))
11 order by 1,2,3
12 /

YEAR CITY DEP DEP_PAR SALES TOT_SALES
----------- ------ ---- ------- ---------- ----------
31.12.2007 Moscow DEP1 762 762
31.12.2007 Moscow DEP2 DEP1 657 657
31.12.2007 Moscow DEP3 DEP2 1020 1020
31.12.2007 Moscow DEP4 DEP3 487 487

SQL>
Yes, the query returns no error - because we explicitly defined the order of rules execution. But as you see we have wrong results. Because actually in our case we should order by dep DESC! You can try it yourself.
While we don't know what are the dependences in the hierarchy data beforehand - let's give Oracle to decide - what are the dependencies between cells and simply use AUTOMATIC ORDER:
SQL> select * from t, t_hrchy th
2 where t.dep = th.dep(+)
3 model
4 partition by (to_char(year,'yyyy') year,city)
5 dimension by (t.dep dep,dep_par)
6 measures(sales, 0 tot_sales)
7 rules automatic order
8 (tot_sales[any,any] = sales[CV(),CV()]+nvl(sum(tot_sales)[any,CV(dep)],0))
9 order by 1,2,3
10 /

YEAR CITY DEP DEP_PAR SALES TOT_SALES
---- ------ ---- ------- ---------- ----------
2006 Moscow DEP1 562 2254
2006 Moscow DEP2 DEP1 457 1692
2006 Moscow DEP3 DEP2 890 1235
2006 Moscow DEP4 DEP3 345 345
2006 Omsk DEP1 23 264
2006 Omsk DEP2 DEP1 154 241
2006 Omsk DEP3 DEP2 87 87
2007 Moscow DEP1 762 2926
2007 Moscow DEP2 DEP1 657 2164
2007 Moscow DEP3 DEP2 1020 1507
2007 Moscow DEP4 DEP3 487 487
2007 Omsk DEP1 34 403
2007 Omsk DEP2 DEP1 213 369
2007 Omsk DEP3 DEP2 156 156

14 rows selected

SQL>


to be continued...

Sunday, December 16, 2007

Reports: forecasting

... previous

We carry on with our reports (the needed input data).
Now what we need is to make a little forecast. We are requested to find out what the approximate sales for the next year would be.

The growth rate is determined as a ratio of current year sales (2007) to the previous year sales (2006), but it should be calculated separately for different cities.
So to find the growth rate for each row - we need to find total sales for the particular city for the current year and divide it on the total sales for the same city for the previous year.

Non-Model solution
Without using model - we can find the growth rate in a scalar subquery:
select sum(decode(to_char(year, 'yyyy'), '2007', sales)) /
sum(decode(to_char(year, 'yyyy'), '2006', sales))
from t
where city = t1.city
where t1.city should be a reference to the main outer query.
If we put this scalar subquery directly into the query we already had, we get the following result:
SQL> select city,
2 dep,
3 sum(sales) sales_2007,
4 round(sum(sales) *
5 (select sum(decode(to_char(year, 'yyyy'), '2007', sales)) /
6 sum(decode(to_char(year, 'yyyy'), '2006', sales))
7 from t
8 where city = t1.city),
9 2) sales_2008_forecast
10 from t t1
11 where trunc(year, 'y') = to_date('01.01.2007', 'dd.mm.yyyy')
12 group by rollup(city, dep)
13 /

CITY DEP SALES_2007 SALES_2008_FORECAST
------ ---- ---------- -------------------
Omsk DEP1 34 51,9
Omsk DEP2 213 325,15
Omsk DEP3 156 238,14
Omsk 403 615,19
Moscow DEP1 762 989,18
Moscow DEP2 657 852,88
Moscow DEP3 1020 1324,1
Moscow DEP4 487 632,19
Moscow 2926 3798,35
3329

10 rows selected

SQL>
As you see - we don't get a total for year 2008. This happens because city value in this total row is equal to NULL. And while we calculating the growth rate for particular cities - we get no rate at all for this row. So as a result we get NULL.

The workaround here would be first to get forecasting for detailed sales and then group with ROLLUP the sales for 2007 year and also for the predictable 2008 year.

So it would look like the following:
SQL> select city,
2 dep,
3 sum(sales) sales_2007,
4 sum(sales2) sales_2008_forecast,
5 round(100 * (sum(sales2) / sum(sales) - 1), 2) || '%' growth_rate
6 from (select t0.*,
7 round(sales *
8 (select sum(decode(to_char(year, 'yyyy'), '2007', sales)) /
9 sum(decode(to_char(year, 'yyyy'), '2006', sales))
10 from t
11 where city = t0.city),
12 2) sales2
13 from t t0) t1
14 where trunc(year, 'y') = to_date('01.01.2007', 'dd.mm.yyyy')
15 group by rollup(city, dep)
16 /

CITY DEP SALES_2007 SALES_2008_FORECAST GROWTH_RATE
------ ---- ---------- ------------------- ---------------
Omsk DEP1 34 51,9 52.65%
Omsk DEP2 213 325,15 52.65%
Omsk DEP3 156 238,14 52.65%
Omsk 403 615,19 52.65%
Moscow DEP1 762 989,18 29.81%
Moscow DEP2 657 852,88 29.81%
Moscow DEP3 1020 1324,1 29.81%
Moscow DEP4 487 632,19 29.81%
Moscow 2926 3798,35 29.81%
3329 4413,54 32.58%

10 rows selected

SQL>
Now all the totals are where they should be. And we added an extra column with growth rate. So we can see, that sales in Omsk grew up ~53% and in Moscow increased ~30%. The average growth rate from 2006 until 2007 years was ~33%.

Model solution #1
With the model solution instead of scalar subquery - we can use reference model clause.
We put the following query into the reference model:
SQL> select city, to_char(year,'yyyy') y, sum(sales) sum_sales from t group by city,year;

CITY Y SUM_SALES
------ ---- ----------
Moscow 2006 2254
Omsk 2006 264
Moscow 2007 2926
Omsk 2007 403

SQL>
And then we'll use the values from that resultset.
SQL> select city,dep,sales,sales_fc,round(100*(growth-1),2)||'%' growth_rate
2 from t where trunc(year,'y') = to_date('01.01.2007','dd.mm.yyyy')
3 model
4 reference r on (select city, to_char(year,'yyyy') y, sum(sales) sum_sales from t group by city,year)
5 dimension by (y, city)
6 measures(sum_sales)
7 main m
8 dimension by (0 total,city, dep)
9 measures(sales, 0 sales_FC, 0 growth)
10 rules upsert all
11 (sales[1,any,null]=sum(sales)[0,CV(),any],
12 sales[1,null,null]=sum(sales)[0,any,any],
13 growth[any,any,any]=r.sum_sales['2007',CV(city)]/r.sum_sales['2006',CV(city)],
14 sales_FC[any,any,any]=round(sales[CV(),CV(),CV()]*growth[CV(),CV(),CV()],2))
15 order by 2,3;

CITY DEP SALES SALES_FC GROWTH_RATE
------ ---- ---------- ---------- -----------------------------------------
Omsk DEP1 34 51,9 52.65%
Moscow DEP1 762 989,18 29.81%
Omsk DEP2 213 325,15 52.65%
Moscow DEP2 657 852,88 29.81%
Omsk DEP3 156 238,14 52.65%
Moscow DEP3 1020 1324,1 29.81%
Moscow DEP4 487 632,19 29.81%
Omsk 403 615,19 52.65%
Moscow 2926 3798,35 29.81%
3329 %

10 rows selected

SQL>

So we put our query into the reference clause, so that we can use it's cell values in the main part.
We created a new measure growth and in the rules part wrote:
growth[any,any,any]=r.sum_sales['2007',CV(city)]/r.sum_sales['2006',CV(city)]
So for each cell we find a growth rate for the current row (current city) by dividing summary sales of the current year for this particular city to the same of the previous year.
The next our step (rule) is finding the predicted sales for the next year:
sales_FC[any,any,any]=round(sales[CV(),CV(),CV()]*growth[CV(),CV(),CV()],2)
And we are facing the same problem we had in the Non-Model part. We can't define any city in the total row - so we can't find a growth rate for this row.
As you already know - we can easily write a rule to find the total sales for the next year as a sum of other cells:
SQL> select city,dep,sales,sales_fc,round(100*(growth-1),2)||'%' growth_rate
2 from t where trunc(year,'y') = to_date('01.01.2007','dd.mm.yyyy')
3 model
4 reference r on (select city, to_char(year,'yyyy') y, sum(sales) sum_sales from t group by city,year)
5 dimension by (y, city)
6 measures(sum_sales)
7 main m
8 dimension by (0 total,city, dep)
9 measures(sales, 0 sales_FC, 0 growth)
10 rules upsert all
11 (sales[1,any,null]=sum(sales)[0,CV(),any],
12 sales[1,null,null]=sum(sales)[0,any,any],
13 sales_FC[any,city is not null,any]=round(sales[CV(),CV(),CV()]*r.sum_sales['2007',CV(city)]/r.sum_sales['2006',CV(city)],2),
14 sales_FC[1,null,null]=sum(sales_FC)[1,any,null],
15 growth[any,any,any]=sales_FC[CV(),CV(),CV()]/sales[CV(),CV(),CV()])
16 order by 1,2;

CITY DEP SALES SALES_FC GROWTH_RATE
------ ---- ---------- ---------- ------------------
Moscow DEP1 762 989,18 29.81%
Moscow DEP2 657 852,88 29.81%
Moscow DEP3 1020 1324,1 29.81%
Moscow DEP4 487 632,19 29.81%
Moscow 2926 3798,35 29.81%
Omsk DEP1 34 51,9 52.65%
Omsk DEP2 213 325,15 52.65%
Omsk DEP3 156 238,14 52.65%
Omsk 403 615,19 52.65%
3329 4413,54 32.58%

10 rows selected

SQL>
You see that we changed the order of the rules and the rules themselves (that is because we want to find the growth rate for the overall total row, and hence we would find growth rates AFTER we get overall total sales value).
First we find the forecasting sales for the next year for all the rows excluding the very total:
sales_FC[any,city is not null,any]=
round(sales[CV(),CV(),CV()]*r.sum_sales['2007',CV(city)]/r.sum_sales['2006',CV(city)],2)
Then we find this total by summing up all the city subtotals:
sales_FC[1,null,null]=sum(sales_FC)[1,any,null]
And finally we get the growth rate by dividing sales_FC (forecast) on the actual sales (current year sales):
growth[any,any,any]=sales_FC[CV(),CV(),CV()]/sales[CV(),CV(),CV()]

After all this is the same result we had in the Non-Model solution.

Model solution #2
Well, in both solutions (Non-Model & Model #1) we accessed the table twice. First to get the main result and second one - to get the growth rate.

Now we are going to access table only once by using extra dimension in Model and refusing from the reference model clause.
SQL> select city,dep,sales,sales_fc,round(100*(growth-1),2)||'%' growth_rate from t
2 model
3 return updated rows
4 dimension by (0 total,city, dep, to_char(year,'yyyy') year)
5 measures(sales, sales sales_FC,0 growth)
6 rules upsert all
7 (sales[1,any,null,'2007']=sum(sales)[0,CV(),any,CV()],
8 sales[1,null,null,'2007']=sum(sales)[0,any,any,CV()],
9 sales_FC[any,any,any,'2007']=round(sales[CV(),CV(),CV(),CV()]*
10 sales[1,CV(),null,CV()]/sum(sales)[0,CV(),any,'2006'],2),
11 sales_FC[1,null,null,'2007']=sum(sales_FC)[1,any,null,'2007'],
12 growth[any,any,any,'2007']=sales_FC[CV(),CV(),CV(),CV()]/sales[CV(),CV(),CV(),CV()])
13 order by 1,2
14 /

CITY DEP SALES SALES_FC GROWTH_RATE
------ ---- ---------- ---------- -----------------------------------------
Moscow DEP1 762 989,18 29.81%
Moscow DEP2 657 852,88 29.81%
Moscow DEP3 1020 1324,1 29.81%
Moscow DEP4 487 632,19 29.81%
Moscow 2926 3798,35 29.81%
Omsk DEP1 34 51,9 52.65%
Omsk DEP2 213 325,15 52.65%
Omsk DEP3 156 238,14 52.65%
Omsk 403 615,19 52.65%
3329 4413,54 32.58%

10 rows selected

SQL>
What you should mention first is that we removed
where trunc(year,'y') = to_date('01.01.2007','dd.mm.yyyy')
So now all the data (for all years: 2006 and 2007) is pumped into model clause.

And you probably didn't miss that we put RETURN UPDATED ROWS, so only rows where cells were updated will be returned in the resultset.
That's how we can avoid additional outer WHERE clause for our MODEL: we update only rows that are corresppondent to 2007 year, but not the 2006.

Another thing changed is we added a dimension
to_char(year,'yyyy') year
So now we can reference to sales of a particular year by putting it in the format like [...,'YYYY'].

And the final amendment - would be adding a value for a new dimension in every cell reference ([<total_dim>,<city_dim>,<dep_dim>,'2007']). If we put it on the left side of a formula we can put just CV() on the right side in the place for <year_dim>.

to be continued ...

Tuesday, December 11, 2007

Reports: ratios & percentages

... previous

Let's continue with our spreadsheet report (input data you can find here).

Imagine that we want to add a column that would represent sales percentage of each department in the total of it's city sales. So we need to know how much every department contributes to it's region sales.

Non-model solution:

Developing the group by solution with rollup - we could use RATIO_TO_REPORT function for that purpose.

Our query would look like:
SQL> select city, dep, sum(sales),
2 decode(grouping_id(city, dep),
3 0,
4 round(2 * 100 * ratio_to_report(sum(sales))
5 over(partition by city),
6 2) || '%') perc
7 from t
8 where trunc(year, 'y') = to_date('01.01.2007', 'dd.mm.yyyy')
9 group by rollup(city, dep)
10 /

CITY DEP SUM(SALES) PERC
------ ---- ---------- ---------
Moscow DEP1 762 26.04%
Moscow DEP2 657 22.45%
Moscow DEP3 1020 34.86%
Moscow DEP4 487 16.64%
Moscow 2926
Omsk DEP1 34 8.44%
Omsk DEP2 213 52.85%
Omsk DEP3 156 38.71%
Omsk 403
3329

10 rows selected

SQL>
Let's take a closer look.
First of all we use GROUPING_ID function - to understand when the data is detailed (comparing it to 0). When it is detailed - we return the ratio of the value in the current row to the total sales of the region. Otherwise we return NULL.
But we need to remember that analytic functions are executed after the grouping has been done. So when our ratio_to_report would be applied - we will have doubled sales for each city: on the one hand - the sales of each department, and on the other hand, the sum of all departments sales (as a city subtotal) will also influence our ratio_to_report.
That's why we multiply the function by 2, so we get correct results.
With multiplying by 100 we get percentage values.

Model solution
For getting the same result with Model clause - first of all we need to add a new measure:
cast(null as varchar2(7)) PERC
So it would be our percentage column in the result set:
SQL> select * from t where trunc(year,'y') = to_date('01.01.2007','dd.mm.yyyy')
2 model
3 dimension by (0 total,city, dep)
4 measures(sales, cast(null as varchar2(7)) perc)
5 rules upsert all
6 (sales[1,any,null]=sum(sales)[0,CV(),any],
7 sales[1,null,null]=sum(sales)[0,any,any])
8 order by 2,3
9 /

TOTAL CITY DEP SALES PERC
---------- ------ ---- ---------- -------
0 Moscow DEP1 762
0 Moscow DEP2 657
0 Moscow DEP3 1020
0 Moscow DEP4 487
1 Moscow 2926
0 Omsk DEP1 34
0 Omsk DEP2 213
0 Omsk DEP3 156
1 Omsk 403
1 3329

10 rows selected

SQL>
Now we need to create a rule for our column.
So first of all we need to calculate values for all the cells with detail data. On the language of our model it would be perc[0,any,any]. By using a "Total" dimension with value=0 we will get only cells with detailed data. City and Department could be any. That is what we put on the left side of our formula.

On the right side we need to find for each detailed row a ratio of the current row sales - to it's region sales total.
So we would write sales[0,CV(),CV()]/sales[1,CV(),null]
In the denominator we just put a reference to the cell with a region total. While it has been already calculated - we don't need to sum up all the sales of a particular city again.
SQL> select * from t where trunc(year,'y') = to_date('01.01.2007','dd.mm.yyyy')
2 model
3 dimension by (0 total,city, dep)
4 measures(sales, cast(null as varchar2(7)) perc)
5 rules upsert all
6 (sales[1,any,null]=sum(sales)[0,CV(),any],
7 sales[1,null,null]=sum(sales)[0,any,any],
8 perc[0,any,any]=round(sales[0,CV(),CV()]/sales[1,CV(),null]*100,2)||'%')
9 order by 2,3
10 /

TOTAL CITY DEP SALES PERC
---------- ------ ---- ---------- -------
0 Moscow DEP1 762 26.04%
0 Moscow DEP2 657 22.45%
0 Moscow DEP3 1020 34.86%
0 Moscow DEP4 487 16.64%
0 Moscow %
1 Moscow 2926
0 Omsk DEP1 34 8.44%
0 Omsk DEP2 213 52.85%
0 Omsk DEP3 156 38.71%
0 Omsk %
1 Omsk 403
1 3329
0 %

13 rows selected

SQL>
Ooops. As you can see there are three redundant rows with only a '%' in our new column.
Why did they appear? Because we placed UPSERT ALL.
If we don't want a particular rule to be under a general semantic, which is mentioned after the RULES word, we can put a needed behaviour name directly before the rule. In our case it is UPDATE (no new rows generation -> only update, for more information look Reports: totals & subtotals):
SQL> select * from t where trunc(year,'y') = to_date('01.01.2007','dd.mm.yyyy')
2 model
3 dimension by (0 total,city, dep)
4 measures(sales, cast(null as varchar2(7)) perc)
5 rules upsert all
6 (sales[1,any,null]=sum(sales)[0,CV(),any],
7 sales[1,null,null]=sum(sales)[0,any,any],
8 update perc[0,any,any]=round(sales[0,CV(),CV()]/sales[1,CV(),null]*100,2)||'%')
9 order by 2,3
10 /

TOTAL CITY DEP SALES PERC
---------- ------ ---- ---------- -------
0 Moscow DEP1 762 26.04%
0 Moscow DEP2 657 22.45%
0 Moscow DEP3 1020 34.86%
0 Moscow DEP4 487 16.64%
1 Moscow 2926
0 Omsk DEP1 34 8.44%
0 Omsk DEP2 213 52.85%
0 Omsk DEP3 156 38.71%
1 Omsk 403
1 3329

10 rows selected

SQL>

This is actually what we were looking for!

to be continued ...

Sunday, December 9, 2007

Reports: totals & subtotals

Starting from the current post I decided to write several notes on queries with Model clause. Mostly they would be dedicated to the topics of producing diverse reports.
In each case I'll give an opportunity how it can be done without using the model clause. And then - how the model clause could be used. So let's begin.
Input data:
SQL> drop table t;

Table dropped

SQL>
SQL> create table t as (select 'Omsk' city, 'DEP1' dep, to_date('31.12.2006','dd.mm.yyyy') year, 23 sales from dual union all
2 select 'Omsk' city, 'DEP2' dep, to_date('31.12.2006','dd.mm.yyyy') year, 154 sales from dual union all
3 select 'Omsk' city, 'DEP3' dep, to_date('31.12.2006','dd.mm.yyyy') year, 87 sales from dual union all
4 select 'Moscow' city, 'DEP1' dep, to_date('31.12.2006','dd.mm.yyyy') year, 562 sales from dual union all
5 select 'Moscow' city, 'DEP2' dep, to_date('31.12.2006','dd.mm.yyyy') year, 457 sales from dual union all
6 select 'Moscow' city, 'DEP3' dep, to_date('31.12.2006','dd.mm.yyyy') year, 890 sales from dual union all
7 select 'Moscow' city, 'DEP4' dep, to_date('31.12.2006','dd.mm.yyyy') year, 345 sales from dual union all
8 select 'Omsk' city, 'DEP1' dep, to_date('31.12.2007','dd.mm.yyyy') year, 34 sales from dual union all
9 select 'Omsk' city, 'DEP2' dep, to_date('31.12.2007','dd.mm.yyyy') year, 213 sales from dual union all
10 select 'Omsk' city, 'DEP3' dep, to_date('31.12.2007','dd.mm.yyyy') year, 156 sales from dual union all
11 select 'Moscow' city, 'DEP1' dep, to_date('31.12.2007','dd.mm.yyyy') year, 762 sales from dual union all
12 select 'Moscow' city, 'DEP2' dep, to_date('31.12.2007','dd.mm.yyyy') year, 657 sales from dual union all
13 select 'Moscow' city, 'DEP3' dep, to_date('31.12.2007','dd.mm.yyyy') year, 1020 sales from dual union all
14 select 'Moscow' city, 'DEP4' dep, to_date('31.12.2007','dd.mm.yyyy') year, 487 sales from dual)
15 /

Table created

SQL> commit;

Commit complete

SQL>
SQL>select * from t
2 order by 3,1,2
3 /

CITY DEP YEAR SALES
------ ---- ----------- ----------
Moscow DEP1 31.12.2006 562
Moscow DEP2 31.12.2006 457
Moscow DEP3 31.12.2006 890
Moscow DEP4 31.12.2006 345
Omsk DEP1 31.12.2006 23
Omsk DEP2 31.12.2006 154
Omsk DEP3 31.12.2006 87
Moscow DEP1 31.12.2007 762
Moscow DEP2 31.12.2007 657
Moscow DEP3 31.12.2007 1020
Moscow DEP4 31.12.2007 487
Omsk DEP1 31.12.2007 34
Omsk DEP2 31.12.2007 213
Omsk DEP3 31.12.2007 156

14 rows selected

SQL>
This is a table of sales for a company in Russia, which has it's departments in two cities: Omsk (Dep 1,2,3) and in Moscow (Dep 1,2,3,4). The data about sales is given for two years: 2006 and 2007.

Now we want to get sales for the year 2007. But we need not only detailed sales (as in a table), but also subtotals for each city. And a total for the whole company.

Non-Model Solution:
Usually such reports are created by using extensions to Group By: Cube, Rollup & Grouping Sets expression.

So our query would look like:
SQL> select city, dep, sum(sales) sales
2 from t
3 where trunc(year, 'y') = to_date('01.01.2007', 'dd.mm.yyyy')
4 group by rollup(city, dep)
5 order by 1, 2
6 /

CITY DEP SALES
------ ---- ----------
Moscow DEP1 762
Moscow DEP2 657
Moscow DEP3 1020
Moscow DEP4 487
Moscow 2926
Omsk DEP1 34
Omsk DEP2 213
Omsk DEP3 156
Omsk 403
3329

10 rows selected

SQL>

Model solution:
To create the same report using model clause, we have to execute the following query:
SQL> select city,dep,sales
2 from t
3 where trunc(year,'y') = to_date('01.01.2007','dd.mm.yyyy')
4 model
5 dimension by (0 total,city, dep)
6 measures(sales)
7 rules upsert all
8 (sales[1,any,null]=sum(sales)[0,CV(),any],
9 sales[1,null,null]=sum(sales)[0,any,any])
10 order by 1,2
11 /

CITY DEP SALES
------ ---- ----------
Moscow DEP1 762
Moscow DEP2 657
Moscow DEP3 1020
Moscow DEP4 487
Moscow 2926
Omsk DEP1 34
Omsk DEP2 213
Omsk DEP3 156
Omsk 403
3329

10 rows selected

SQL>
Let's elaborate how it works.

Using the condition in the WHERE clause - we get only data for the year 2007 in the result.
Then we create dimensions: City and Dep - are the dimensions from the table. But we add one more dimension called Total and by default assign value 0 to all the cells for that dimension.
As a measure we take the field from our table Sales.
So at this stage we get as a result:
SQL> select *
2 from t
3 where trunc(year,'y') = to_date('01.01.2007','dd.mm.yyyy')
4 model
5 dimension by (0 total,city, dep)
6 measures(sales)
7 ()
8 order by 2,3
9 /

TOTAL CITY DEP SALES
---------- ------ ---- ----------
0 Moscow DEP1 762
0 Moscow DEP2 657
0 Moscow DEP3 1020
0 Moscow DEP4 487
0 Omsk DEP1 34
0 Omsk DEP2 213
0 Omsk DEP3 156

7 rows selected

SQL>
Now we need to create extra lines for the subtotals and total.
We use rules for this.
You should remember that there are two ways to refer to a cell in a rule of a model clause: positional or symbolic.

Example of positional: sales['Omsk','DEP3'] or sales['Omsk','DEP4'].
When a cell which is positionally referenced exists - it's value is updated. When it doesn't exist - the model clause will create a cell with such dimensions.

Example of symbolic reference:
sales[city='Omsk',dep='DEP3'], sales[city is any,dep='DEP3'] ,sales[any,dep='DEP3'].
Actually the last one [any] is a positional reference - but I put it in the symbolic group because there's a remark in the documentation:
Note that ANY is treated as a symbolic reference even if it is specified positionally, because it really means that (dimension IS NOT NULL OR dimension IS NULL).
A typical behaviour for symbolically referenced cells: when it exists - the value gets updated. When it doesn't exist - no cells are created.

Don't forget about the cells which are referenced symbolically and positionally at the same time. By default they behave like all the references are symbolic.

Well, actually we could use 3 types of behaviour: UPDATE, UPSERT and UPSERT ALL.
For explicit determining we should place the name of behaviour directly after the RULES keyword. In this case it would be adjusted to all the rules.
If we want to specify a particular rule - we can put the name of behaviour right before the rule (we'll see how it works in the next post).

If we don't put any - the default behaviour UPSERT is used. How it works - I mentioned earlier.

When we use UPDATE - all cells are got updated, and if they don't exist - no cells are created, whichever reference (positional or symbolic) is used.

UPSERT ALL is a scpecial case - it can create cells even for symbolic reference.
But not in all cases, for example - it wouldn't create any cells if we put a rule:
sales[city='Voronezh','DEP5']=100 (because there's no city called 'Voronezh' in our table), but at the same time it would create a cell (which doesn't exist) for the rule sales[city='Omsk','DEP5']=100.
Why?
Because UPSERT ALL works in the following way (from the doc):

Step 1 Find Cells
Find the existing cells that satisfy all the symbolic predicates of the cell reference.

Step 2 Find Distinct Dimension Values
Using just the dimensions that have symbolic references, find the distinct dimension value combinations of these cells.

Step 3 Perform a Cross Product
Perform a cross product of these value combinations with the dimension values specified through positional references.

Step 4 Upsert New Cells

The results of Step 3 are then used to upsert new cells into the array

Just look:
SQL> select *
2 from t
3 where trunc(year,'y') = to_date('01.01.2007','dd.mm.yyyy')
4 model
5 return updated rows
6 dimension by (0 total,city, dep)
7 measures(sales)
8 rules upsert all
9 (sales[0,city='Omsk','DEP5']=100,
10 sales[0,city='Voronezh','DEP5']=200)
11 order by 2,3
12 /

TOTAL CITY DEP SALES
---------- ------ ---- ----------
0 Omsk DEP5 100

SQL>
By putting RETURN UPDATED ROWS we get as a result only updated or inserted cells.
So as you can see despite we had a rule with Voronezh - no cells were inserted.

Let's get back to our subtotals and total.
We had two rules:
sales[1,any,null]=sum(sales)[0,CV(),any],
sales[1,null,null]=sum(sales)[0,any,any]

But as you remember [any] is treated as a symbolic reference - and won't create any cells if we don't put UPSERT ALL semantics.
SQL> select city,dep,sales
2 from t
3 where trunc(year,'y') = to_date('01.01.2007','dd.mm.yyyy')
4 model return updated rows
5 dimension by (0 total,city, dep)
6 measures(sales)
7 rules
8 (sales[1,any,null]=sum(sales)[0,CV(),any])
9 order by 1,2
10 /

CITY DEP SALES
------ ---- ----------

SQL>
If we put UPSERT ALL semantics, our rules would do the following:

First rule: sales[1,any,null] will look for all dimension values that are referenced symbolically - city dimension: Omsk and Moscow.
Then create a cross product with positionally referenced dimensions: we'll get two cells - sales[1,'Omsk',null] and sales[1,'Moscow',null]
And these cells would be inserted.
BTW the dimension total=1 would mean that these are totals or subtotals.

Next rule sales[1,null,null] - will create a company total cell.
On the right side of the rules we have grouping formulas which will sum up all the sales values needed. By using CV() we reference to the dimension value of the cell on the left side of the rule.

PS
We could get the same result without Total dimension:
SQL> select city,dep,sales
2 from t
3 where trunc(year,'y') = to_date('01.01.2007','dd.mm.yyyy')
4 model
5 dimension by (city, dep)
6 measures(sales)
7 rules upsert all
8 (sales[any,null]=sum(sales)[CV(),any],
9 sales[null,null]=sum(sales)[any,null])
10 order by 1,2
11 /

CITY DEP SALES
------ ---- ----------
Moscow DEP1 762
Moscow DEP2 657
Moscow DEP3 1020
Moscow DEP4 487
Moscow 2926
Omsk DEP1 34
Omsk DEP2 213
Omsk DEP3 156
Omsk 403
3329

10 rows selected

SQL>

Here the rules are a little bit different.
The company Total is calculated from the cities subtotals, that has already been calculated, while in the main part (with Total dimension) we calculated all totals and subtotals on detailed sales data.

to be continued ...

Thursday, December 6, 2007

Summation of tree values

Input data:
SQL> drop table t;

Table dropped

SQL>
SQL> create table t as (select 1 id, 20 val from dual union all
2 select 2 id, 30 val from dual union all
3 select 3 id, 15 val from dual union all
4 select 4 id, 100 val from dual)
5 /

Table created

SQL> commit;

Commit complete

SQL> select * from t;

ID VAL
---------- ----------
1 20
2 30
3 15
4 100

SQL>
Problem:
You want to build a hierarchy on that data and get the sum of the values of all preceding ancestors including current value.

E.g. your hierarchy is:
SQL> select t1.*, level, sys_connect_by_path(id,'/') path
2 from t t1
3 connect by prior id = id - 1
4 start with id = 1
5 /

ID VAL LEVEL PATH
---------- ---------- ---------- --------------
1 20 1 /1
2 30 2 /1/2
3 15 3 /1/2/3
4 100 4 /1/2/3/4

SQL>
Solution here could probably be a scalar subquery, which will build tree in a backward direction starting from the current id and will compute sum of all the childs (ancestors in the outer query) values:
SQL> select t1.*,
2 level,
3 sys_connect_by_path(id,'/') path,
4 (select sum(val)
5 from t t2
6 connect by prior id = id + 1
7 start with t2.id = t1.id) summ
8 from t t1
9 connect by prior id = id - 1
10 start with id = 1
11 /

ID VAL LEVEL PATH SUMM
---------- ---------- ---------- ------------- ----------
1 20 1 /1 20
2 30 2 /1/2 50
3 15 3 /1/2/3 65
4 100 4 /1/2/3/4 165

SQL>
So while we have 'prior id = id - 1' rule in the outer query we put 'prior id = id + 1' in a scalar subquery to move backward.

But if the query was built in the following way:
SQL> select t1.*,
2 level,
3 sys_connect_by_path(id, '/') path
4 from t t1
5 connect by prior id < id
6 start with id = 1
7 /

ID VAL LEVEL PATH
---------- ---------- ---------- -----------------
1 20 1 /1
2 30 2 /1/2
3 15 3 /1/2/3
4 100 4 /1/2/3/4
4 100 3 /1/2/4
3 15 2 /1/3
4 100 3 /1/3/4
4 100 2 /1/4

8 rows selected

SQL>
What should we put in the connect by part of a scalar subquery?
If we reverse condition to 'prior id > id' in the subquery - we get wrong result. And it is not needed to be clarified why.
SQL> select t1.*,
2 level,
3 sys_connect_by_path(id,'/') path,
4 (select sum(val)
5 from t t2
6 connect by prior id > id
7 start with t2.id = t1.id) summ
8 from t t1
9 connect by prior id < id
10 start with id = 1
11 /

ID VAL LEVEL PATH SUMM
---------- ---------- ---------- ------------ ----------
1 20 1 /1 20
2 30 2 /1/2 50
3 15 3 /1/2/3 85
4 100 4 /1/2/3/4 255
4 100 3 /1/2/4 255
3 15 2 /1/3 85
4 100 3 /1/3/4 255
4 100 2 /1/4 255

8 rows selected

SQL>
We need to get a backward motion in a particular order. And what can show us the correct order - right, SYS_CONNECT_BY_PATH.

The solution could be:
SQL> select t1.*,
2 (select sum(val)
3 from t t2
4 connect by level <= t1.lvl
5 and id = regexp_substr(t1.path,'[^/]+',1,t1.lvl - level + 1)
6 start with t2.id = t1.id) summ
7 from (select t.*, level lvl, sys_connect_by_path(id, '/') path
8 from t
9 connect by prior id < id
10 start with id = 1) t1
11 /

ID VAL LVL PATH SUMM
---------- ---------- ---------- ------------ ----------
1 20 1 /1 20
2 30 2 /1/2 50
3 15 3 /1/2/3 65
4 100 4 /1/2/3/4 165
4 100 3 /1/2/4 150
3 15 2 /1/3 35
4 100 3 /1/3/4 135
4 100 2 /1/4 120

8 rows selected

SQL>
So at every level - we cut out and determine what the ID should be.

BTW Put these conditions exactly in that order:
...level <= t1.lvl and id = regexp_substr(t1.path,'[^/]+',1,t1.lvl - level + 1)...
if you put it
...id = regexp_substr(t1.path,'[^/]+',1,t1.lvl - level + 1)and level <= t1.lvl...
you'll get:
ORA-01428: argument '0' is out of range


PS Later I wrote another note on the same kinda a problem, which you can discover here: Reports: getting total of all children values in a tree

Sunday, December 2, 2007

Regexp 4: number of occurence (advanced)

... previous

In this post I'll talk about a technique - that I also learned from cd.
In the previous post - we got familiar with a technique to find the number of occurrencies of a substring in a string.

But there are some drawbacks, e.g.:
SQL> with t as (select '(111)111-11-11 words-non-stop(222)222-22-22' str from dual)
2 --
3 select str,
4 nvl(length(regexp_replace(str,
5 '\(\d{3}\)(\d)\d{2}-\d{2}-\d{2}|.',
6 '\1')),
7 0) occurrencies
8 from t
9 /

STR OCCURRENCIES
------------------------------------------- ------------
(111)111-11-11 words-non-stop(222)222-22-22 2

SQL>

As we can see here - we find both examples of phone numbers as fitting to our requirments. Actually, yes - there are two occurrencies of our pattern. But usually when it is combined with another alphanumeric value - we don't want to take it into account. So the result should be only 1 occurrence.

Well, let's assign the task more concrete: our phone numbers should be preceded or trailed by one of the following: space, comma or semicolon. And it also can be the first or the last structure in a string.

So we may think, that if we add such symbols before and after our pattern - it can help us:
SQL> with t as (select '(111)111-11-11 words-non-stop(222)222-22-22' str from dual)
2 --
3 select str,
4 nvl(length(regexp_replace(str,
5 '(^|[ ,;])\(\d{3}\)(\d)\d{2}-\d{2}-\d{2}([ ,;]|$)|.',
6 '\2')),
7 0) occurrencies
8 from t
9 /

STR OCCURRENCIES
------------------------------------------- ------------
(111)111-11-11 words-non-stop(222)222-22-22 1

SQL>

Yep, in that case it was helpful.

But if we take an example, when two phone numbers are coming one after another - separated by one space - it is not working:
SQL> with t as (select '(111)111-11-11 (222)222-22-22' str from dual)
2 --
3 select str,
4 nvl(length(regexp_replace(str,
5 '(^|[ ,;])\(\d{3}\)(\d)\d{2}-\d{2}-\d{2}([ ,;]|$)|.',
6 '\2')),
7 0) occurrencies
8 from t
9 /

STR OCCURRENCIES
----------------------------- ------------
(111)111-11-11 (222)222-22-22 1

SQL>

Why? because it works in the following way:
1. Finds the first pattern '(111)111-11-11 '.
As you see - there's a space at the end.
2. Replaces it with backreference - which in this case would be '1'.
3. Then goes to the rest of the string which is '(222)222-22-22' and it is not the beginning of the string.
So such a structure doesn't suit to our pattern - because it requires - space, comma or colon at the beginning. And we got no space - cause it was taken by the previous structure '(111)111-11-11 '.
4. The function replaces this structure out.
5. We get 1 occurrence as a result, instead of 2.

What we can do here:
The technique is very interesting:
Firstly we place all the patterns - which don't fulfil our requirments. We separate them with the pipe (|), which means OR on the language of regular expressions.
Then we place our desired pattern. And at the end we place '|.' as usual :)

So the generalised pattern would look in the following way:

regexp_replace(your_string,'[symbol1](our_pattern)+|(our_pattern)[symbol2]+|(our_pattern)|.','\3')

where symbol1 - symbol or list of symbols, that can't precede our pattern,
symbol2 - symbol or list of symbols, that can't trail our pattern.

And when we determine the backreference - we would place a reference to the third our pattern.
So when the regexp engine will meet the '[symbol1](our_pattern)+' or '(our_pattern)[symbol2]+', the regexp will remove such structures (actually, it will replace it with our backreference, which points to the 3rd structure - so it would be null). And when it will meet the needed pattern (third one) - it would leave it for us.
The |. at the end - will remove everything - that doesn't suits the previous three structures.

If we get back to our example, we get:
SQL> with t as (select '(111)111-11-11 (222)222-22-22' str from dual)
2 --
3 select str,
4 nvl(length(regexp_replace(str,
5 '[^ ,;](\(\d{3}\)\d{3}-\d{2}-\d{2})+|\(\d{3}\)\d{3}-\d{2}-\d{2}[^ ,;]+|\(\d{3}\)(\d)\d{2}-\d{2}-\d{2}|.',
6 '\2')),
7 0) occurrencies
8 from t
9 /

STR OCCURRENCIES
----------------------------- ------------
(111)111-11-11 (222)222-22-22 2

SQL>

Ok. Let's investigate why we put additional '+' two times.
The first one was when we put it in the first OR structure.
Let's remove it and look at the following example:
SQL> with t as (select '(111)111-11-11 (222)222-22-22 3(333)333-33-33(444)444-44-44' str from dual)
2 --
3 select str,
4 nvl(length(regexp_replace(str,
5 '[^ ,;](\(\d{3}\)\d{3}-\d{2}-\d{2})|\(\d{3}\)\d{3}-\d{2}-\d{2}[^ ,;]+|\(\d{3}\)(\d)\d{2}-\d{2}-\d{2}|.',
6 '\2')),
7 0) occurrencies_INCORRECT,
8 nvl(length(regexp_replace(str,
9 '[^ ,;](\(\d{3}\)\d{3}-\d{2}-\d{2})+|\(\d{3}\)\d{3}-\d{2}-\d{2}[^ ,;]+|\(\d{3}\)(\d)\d{2}-\d{2}-\d{2}|.',
10 '\2')),
11 0) occurrencies_CORRECT
12 from t
13 /

STR OCCURRENCIES_INCORRECT OCCURRENCIES_CORRECT
----------------------------------------------------------- ---------------------- --------------------
(111)111-11-11 (222)222-22-22 3(333)333-33-33(444)444-44-44 3 2

SQL>

Actually, when we removed '+' (occurrencies_incorrect field) we got 3 as a result insted of 2 - why?
Well, the problem as how you guessed in the '3(333)333-33-33(444)444-44-44' part.
If we don't put '+' in the '[^ ,;](\(\d{3}\)\d{3}-\d{2}-\d{2})' - it would meet 3(333)333-33-33 and stop. After this structure will be removed - the regexp is going forward.
It finds (444)444-44-44.
Well it doesn't suit to the '[^ ,;](\(\d{3}\)\d{3}-\d{2}-\d{2})' - because there's no symbol [^ ,;] before it (we removed 3(333)333-33-33).
Then it checks whether it suits to '\(\d{3}\)\d{3}-\d{2}-\d{2}[^ ,;]+'. Again negative - cause there is no '[^ ,;]+' at the end. Finally it suits to our third structure '\(\d{3}\)(\d)\d{2}-\d{2}-\d{2}' - and that's why it hits the final result set.
If we put '+' after the first pattern - it would mean one or more occurrences of our pattern. So the whole 3(333)333-33-33(444)444-44-44 will fall under the first structure - and gets removed.

Let's continue with the second '+'.
Imagine the following situation:
SQL> with t as (select '(111)111-11-11 (222)222-22-22 (333)333-33-33A(444)444-44-44' str from dual)
2 --
3 select str,
4 nvl(length(regexp_replace(str,
5 '[^ ,;](\(\d{3}\)\d{3}-\d{2}-\d{2})+|\(\d{3}\)\d{3}-\d{2}-\d{2}[^ ,;]|\(\d{3}\)(\d)\d{2}-\d{2}-\d{2}|.',
6 '\2')),
7 0) occurrencies_INCORRECT,
8 nvl(length(regexp_replace(str,
9 '[^ ,;](\(\d{3}\)\d{3}-\d{2}-\d{2})+|\(\d{3}\)\d{3}-\d{2}-\d{2}[^ ,;]+|\(\d{3}\)(\d)\d{2}-\d{2}-\d{2}|.',
10 '\2')),
11 0) occurrencies_CORRECT
12 from t
13 /

STR OCCURRENCIES_INCORRECT OCCURRENCIES_CORRECT
----------------------------------------------------------- ---------------------- --------------------
(111)111-11-11 (222)222-22-22 (333)333-33-33A(444)444-44-44 3 2

SQL>

So the problem again in the last two phone numbers. Now we put an extra 'A' between them. What causes the problem in that case:
If we don't place '+' after the '[^ ,;]' in the second structure,
the pattern '\(\d{3}\)\d{3}-\d{2}-\d{2}[^ ,;]' will meet (333)333-33-33A, with extra 'A' at the end - and removes it all.
So when it continues with the rest of the line it recognizes (444)444-44-44 as a valid phone number.
But when we add a '+' after '[^ ,;]' the pattern '\(\d{3}\)\d{3}-\d{2}-\d{2}[^ ,;]+' will cover the whole (333)333-33-33A(444)444-44-44, and remove it.

PS
Our mask in the regexp can be simplified a little.
In general it would look like:
regexp_replace(your_string,'[symbol1](our_pattern)+|(our_pattern)[symbol3]|.','\2')

Here everything means the same as in the previous examples - except symbol3. Now it means any symbol that can trail our pattern.
And in that case we should place the second structure as a backreference (instead of third in the previous examples).
SQL> with t as (select '(111)111-11-11 (222)222-22-22 3(333)333-33-33 (444)444-44-44' str from dual)
2 --
3 select str,
4 nvl(length(regexp_replace(str,
5 '[^ ,;](\(\d{3}\)\d{3}-\d{2}-\d{2})+|\(\d{3}\)(\d)\d{2}-\d{2}-\d{2}([ ,;]|$)|.',
6 '\2')),
7 0) occurrencies
8 from t
9 /

STR OCCURRENCIES
------------------------------------------------------------ ------------
(111)111-11-11 (222)222-22-22 3(333)333-33-33 (444)444-44-44 3

SQL>


Hope you can find out how it works by yourself now :)

Thursday, November 29, 2007

Regexp 3: number of pattern occurence

... previous

In that post I'll talk about a problem, which is very popular, according to the number of posts in forums with similar questions.

Usually it is formulated as: "How to count the number of occurences of a substring with special pattern inside the string".
Actually, everything we were talking about in two previous posts - can be very useful in solving such a task.

Let our input data be the same as in the previous post:
SQL> with t as (select 'three numbers: (111)111-11-11 and [222]222-22-22 and (333)333-33-33' str from dual)
2 --
3 select t.* from t
4 /

STR
-------------------------------------------------------------------
three numbers: (111)111-11-11 and [222]222-22-22 and (333)333-33-33

SQL>
We have a string with phone numbers in it.
We want to find how many numbers in the format (XXX)XXX-XX-XX are there.

So what we can achieve at the moment? We can pull out all the substrings with that particular pattern in one column:
SQL> with t as (select 'three numbers: (111)111-11-11 and [222]222-22-22 and (333)333-33-33' str from dual)
2 --
3 select t.*,
4 regexp_replace(str, '\(\d{3}\)(\d{3}-\d{2}-\d{2}( |$))|.', '\1') phone_numbers
5 from t
6 /

STR PHONE_NUMBERS
------------------------------------------------------------------- -------------------------
three numbers: (111)111-11-11 and [222]222-22-22 and (333)333-33-33 111-11-11 333-33-33

SQL>
What would be the technique to find the number of substrings? The answer would be "very easy".
First instead of finding the substring according to the mask - we'll find only the first symbols of such substrings.
Then using the function LENGTH - we'll find the length of such a string - and it would be equal to the number of occurences - cause every symbol - would mean one occurenece.
SQL> with t as (select 'three numbers: (111)111-11-11 and [222]222-22-22 and (333)333-33-33' str from dual)
2 --
3 select regexp_replace(str, '\(\d{3}\)(\d{3}-\d{2}-\d{2}( |$))|.', '\1') phone_numbers,
4 regexp_replace(str, '\(\d{3}\)(\d)\d{2}-\d{2}-\d{2}|.', '\1') first_letters,
5 nvl(length(regexp_replace(str, '\(\d{3}\)(\d)\d{2}-\d{2}-\d{2}|.', '\1')), 0) occurrencies
6 from t
7 /

PHONE_NUMBERS FIRST_LETTERS OCCURRENCIES
----------------------- ------------- ------------
111-11-11 333-33-33 13 2

SQL>
What we have changed:
instead of \d{3} standing for the first three digits of our phone numbers, we've written \d\d{2}. That is also three digits - but with the first one separated.
Next step would be to place not the whole pattern in parenethes, but only the first digit: (\d)\d{2}
That's why when we get the whole pattern replaced with the '\1' backreference - we get only the first digits: one from each occurenece of the pattern.
After that we put LENGTH function and for the case when there no occurences - we put NVL(...,0).
In our string we have 2 occurences of a pattern like XXX-XX-XX with preceding (XXX).
And that's right.

to be continued...

Wednesday, November 28, 2007

Regexp 2: getting all the substrings with the particular preceding or trailing patterns

... previous

Well, let's continue.

In the previous post we learnt how to retrieve all the substrings of special pattern.
In this post we'll improve our skills.

Imagine that you have such data (again phone numbers):
SQL> with t as (select 'three numbers: (111)111-11-11 and [222]222-22-22 and (333)333-33-33' str from dual)
2 --
3 select * from t
4 /

STR
-------------------------------------------------------------------
three numbers: (111)111-11-11 and [222]222-22-22 and (333)333-33-33

SQL>

and you want to retrieve phone numbers of the following format: XXX-XX-XX, where X stands for any digit.
But you want only those numbers which are preceded by (XXX) pattern, where X - again any number.
So the number 222-22-22, despite the fact that it has XXX-XX-XX mask - doesn't meet our requirments. Because it has [222] standing before, but not (222).

If we used regexp_susbtr - we would first retrieve phone number in (XXX)XXX-XX-XX format and then cut the (XXX) pattern possibly with another regexp_substr or regexp_replace, e.g.:
SQL> with t as (select 'three numbers: (111)111-11-11 and [222]222-22-22 and (333)333-33-33' str from dual)
2 --
3 select t.*,
4 regexp_substr(str, '\(\d{3}\)\d{3}-\d{2}-\d{2}') step1,
5 regexp_replace(regexp_substr(str,
6 '\(\d{3}\)\d{3}-\d{2}-\d{2}'),
7 '\(\d{3}\)') step2
8 from t
9 /

STR STEP1 STEP2
------------------------------------------------------------------- --------------------- ---------------
three numbers: (111)111-11-11 and [222]222-22-22 and (333)333-33-33 (111)111-11-11 111-11-11

SQL>

And again regexp_substr is only for one occurenece at a time.

Using the trick - we could use the following construction:
regexp_replace(your_string,'preceding_pattern(substring_mask)trailing_pattern|.','\1')

this will retrieve all the substrings according to the mask 'substring_mask', but with the condition that it is preceded and trailed by particular patterns.

So in our situation we could use:
SQL> with t as (select 'three numbers: (111)111-11-11 and [222]222-22-22 and (333)333-33-33' str from dual)
2 --
3 select t.*,
4 regexp_replace(str, '\(\d{3}\)(\d{3}-\d{2}-\d{2}( |$))|.','\1') phone_numbers
5 from t
6 /

STR PHONE_NUMBERS
------------------------------------------------------------------- ------------------------
three numbers: (111)111-11-11 and [222]222-22-22 and (333)333-33-33 111-11-11 333-33-33

SQL>

That's it.
As you can see - there's no 222-22-22 phone number in the result set.

to be continued...

Regexp 1: getting all the needed substrings

I decided to write some notes on regular expressions. And especially about one trick, that sometimes can be very useful.
It's not my invention. First time I found it on the following OTN thread written by cd. It's a pity I don't know his full name. All I know is that he's from Austria and that he's an expert on regular expressions :)

The trick is very simple: if you want to get all the occurences of substring of the particular mask from a string, just put:
regexp_replace(your_string,'(regexp_mask)|.','\1')

Those small pipe and dot (|.) at the end are making a huge deal.

Let's get to the example. Imagine you have a table and one field is of varchar type, where there's information stored about phone numbers. But it is not pure phone numbers, there's some additional text around them, e.g.:

SQL> with t as (select 'That''s phone number 1: (916)809-23-34 and that''s the second one: (918)234-12-09' str from dual union all
2 select 'Call me later on (926)507-15-34 or on (913)432-23-21' from dual union all
3 select 'two numbers: (917)888-34-34 and (903)234-43-11' from dual)
4 --
5 select * from t
6 /

STR
-------------------------------------------------------------------------------
That's phone number 1: (916)809-23-34 and that's the second one: (918)234-12-09
Call me later on (926)507-15-34 or on (913)432-23-21
two numbers: (917)888-34-34 and (903)234-43-11

SQL>

What you want is to drag out only phone numbers that have a mask (XXX)XXX-XX-XX, where X stands for any digit.

What could you do if you didn't know about the trick I mentioned?
you could use regexp_susbtr:
SQL> with t as (select 'That''s phone number 1: (916)809-23-34 and that''s the second one: [234]123-43-23 and the third one: (918)234-12-09' str from dual union all
2 select 'Call me later on (926)507-15-34 or on (913)432-23-21 or on [234]123-43-23' from dual union all
3 select 'two numbers: (917)888-34-34 and (903)234-43-11' from dual)
4 --
5 select t.*,
6 regexp_substr(str, '\(\d{3}\)\d{3}(-\d{2}){2}') phone_num1,
7 regexp_substr(str, '\(\d{3}\)\d{3}(-\d{2}){2}', 1, 2) phone_num2
8 from t
9 /

STR PHONE_NUM1 PHONE_NUM2
-------------------------------------------------------------------------------- ----------------- ------------------
That's phone number 1: (916)809-23-34 and that's the second one: [234]123-43-23 (916)809-23-34 (918)234-12-09
Call me later on (926)507-15-34 or on (913)432-23-21 or on [234]123-43-23 (926)507-15-34 (913)432-23-21
two numbers: (917)888-34-34 and (903)234-43-11 (917)888-34-34 (903)234-43-11

SQL>

But as you can see - you should write a separate regexp_substr column for each occurence of substring, because regexp_substr can pull out only one occurence at a time. And sometimes you don't know beforehand how many occurences of substring are there.

BTW I used \d for digit - if you're on version less than 10.2.x.x you can use [:digit:] or [0-9] instead.

And now let's see what we can do using a tricky regexp_replace:
SQL> with t as (select 'That''s phone number 1: (916)809-23-34 and that''s the second one: (918)234-12-09' str from dual union all
2 select 'Call me later on (926)507-15-34 or on (913)432-23-21' from dual union all
3 select 'two numbers: (917)888-34-34 and (903)234-43-11' from dual)
4 --
5 select t.*,
6 regexp_replace(str,
7 '(\(\d{3}\)\d{3}(-\d{2}){2}( |$))|.',
8 '\1') full_phone_numbers
9 from t
10 /

STR FULL_PHONE_NUMBERS
------------------------------------------------------------------------------- --------------------------------------
That's phone number 1: (916)809-23-34 and that's the second one: (918)234-12-09 (916)809-23-34 (918)234-12-09
Call me later on (926)507-15-34 or on (913)432-23-21 (926)507-15-34 (913)432-23-21
two numbers: (917)888-34-34 and (903)234-43-11 (917)888-34-34 (903)234-43-11

SQL>

You see - everything that doesn't correspond to the mask is removed, and all occurences of the needed phone numbers are placed in one column.

to be continued...

Thursday, October 25, 2007

Using collections with 10g features

After a small lull I decided to post some thoughts on collections. Especially those features that were introduced starting from 10g Oracle version.

Actually, the legs are growing from that thread on OTN forum. A lot of solutions were given, and for sure I would use one of them on the OP's place, but I just decided to show how It can be done using collections.

Problem description:

Input data (from the forum):
"ok I have 4 columns returned in my query which can either have a value of 'FF' or null".

So we have a table:


SQL> create table t as
2 (select 'FF' as col1, null as col2, null as col3, null as col4 from dual union all
3 select 'FF', 'FF', 'FF', null from dual union all
4 select 'FF', 'FF', null, 'FF' from dual union all
5 select 'FF', null, 'FF', null from dual)
6 /

Table created.

SQL> select * from t;

COL1 COL2 COL3 COL4
---- ---- ---- ----
FF
FF FF FF
FF FF FF
FF FF
SQL>

Goal:
"I want to create a fifth column with a calculation which basically states..." how many fields have 'FF' value in each row. Or it can be formulated as: "how many columns have non-null values" (it's not the same, but when the possible values domain consists of only 'FF' or NULL - it is quite similar).

Expected output:

COL1 COL2 COL3 COL4 COL5
---------- ---------- ---------- ---------- ----------
FF NULL NULL NULL 1
FF FF FF NULL 3
FF FF NULL FF 3
FF NULL FF NULL 2


Solution:
First of all let's create a collection consisting of varchar elements:


SQL> create type test as table of varchar2(100);
2 /

Type created.
SQL>

Then we execute a query which would return us a collection in the fifth field, where all the values from the previous four columns would be stored:


SQL> column col1 format A10;
SQL> column col2 format A10;
SQL> column col3 format A10;
SQL> column col4 format A10;
SQL> column coll format A30;
SQL> set null 'NULL';

SQL> select col1, col2, col3, col4,
2 cast(multiset(
3 select decode(level, 1, col1, 2, col2, 3, col3, 4, col4) from dual
4 connect by level <= 4) as test) coll
5 from t
6 /

COL1 COL2 COL3 COL4 COLL
---------- ---------- ---------- ---------- ------------------------------
FF NULL NULL NULL TEST('FF', NULL, NULL, NULL)
FF FF FF NULL TEST('FF', 'FF', 'FF', NULL)
FF FF NULL FF TEST('FF', 'FF', NULL, 'FF')
FF NULL FF NULL TEST('FF', NULL, 'FF', NULL)
SQL>

Now we look for a new operator MULTISET INTERSECT.
So the main idea of our solution is to get rid of needless elements in the initial collection and then count how many elements are left there.

So if we intersect our collection with a collection storing only 'FF' elements - we'll get rid of redundant NULL elements.

Let's do it:


SQL> select col1,col2,col3,col4,
2 cast(multiset(
3 select decode(level,1,col1,2,col2,3,col3,4,col4) from dual
4 connect by level<=4) as test)
5 multiset intersect all
6 cast(multiset (select 'FF' from dual connect by level <= 4) as test) coll from t
7 /

COL1 COL2 COL3 COL4 COLL
---------- ---------- ---------- ---------- ------------------------------
FF NULL NULL NULL TEST('FF')
FF FF FF NULL TEST('FF', 'FF', 'FF')
FF FF NULL FF TEST('FF', 'FF', 'FF')
FF NULL FF NULL TEST('FF', 'FF')
SQL>

I use ALL key word explicitly (although it is default value) to distinguish that we don't want to loose values wich are duplicated ('FF' in our case).

Another important point here is making the second collection consist of not less elements than the first does (4 elements in our case - I use simple connect by level<=4), because if you leave only one 'FF' element in the second collection - you'll get also only one element in the resulting collection:


SQL> select col1, col2, col3, col4,
2 cast(multiset(select decode(level,1,col1,2,col2,3,col3,4,col4) from dual
3 connect by level<=4) as test)
4 multiset intersect all
5 cast(multiset (select 'FF' from dual) as test) coll from t
6 /

COL1 COL2 COL3 COL4 COLL
---------- ---------- ---------- ---------- ------------------------------
FF NULL NULL NULL TEST('FF')
FF FF FF NULL TEST('FF')
FF FF NULL FF TEST('FF')
FF NULL FF NULL TEST('FF')
SQL>

It is described in the documentation:
For example, if a particular value occurs m times in nested_table1 and n times in nested_table2, then the result would contain the element min(m,n) times.

Ok, now the last our step - is counting how many elements are there in the resulting collection. It can easily be done with a CARDINALITY function.

So the final query would be:


SQL> select col1,col2,col3,col4,
2 cardinality(cast(multiset(
3 select decode(level,1,col1,2,col2,3,col3,4,col4) from dual
4 connect by level<=4) as test)
5 multiset intersect all
6 cast(multiset (select 'FF' from dual connect by level <= 4) as test)) coll from t
7 /

COL1 COL2 COL3 COL4 COLL
---------- ---------- ---------- ---------- ----------
FF NULL NULL NULL 1
FF FF FF NULL 3
FF FF NULL FF 3
FF NULL FF NULL 2
SQL>

Alternative solution would be using as a second collection - a collection with null elements, then intersect them both, and counting the column value as 4 - number_of_null_columns, where 4 is total number of fields in our row.

But we can use here another new multiset operator MULTISET EXCEPT:


SQL> select col1, col2, col3, col4,
2 cardinality(cast(multiset(select decode(level, 1, col1, 2, col2, 3, col3, 4, col4) from dual
3 connect by level <= 4) as test)
4 multiset except all
5 cast(multiset(select null from dual connect by level<=4) as test)) coll from t
6 /

COL1 COL2 COL3 COL4 COLL
---------- ---------- ---------- ---------- ----------
FF NULL NULL NULL 1
FF FF FF NULL 3
FF FF NULL FF 3
FF NULL FF NULL 2
SQL>

It is returning all elements from the first collection which are not met in the second one.

PS
strange for me was the following problem:


SQL> select col1, col2, col3, col4,
2 cardinality(cast(multiset(select col1 from dual union all
3 select col2 from dual union all
4 select col3 from dual union all
5 select col4 from dual) as test)
6 multiset intersect all
7 cast(multiset (select 'FF' from dual connect by level <= 4) as test)) coll from t
8 /

...

ORA-03001: unimplemented feature
SQL>


So we can't pass a query with UNION ALL to the CAST(MULTISET(...) as ...) function. But I didn't find such a kind of limitation in the doc.

Friday, October 12, 2007

Learning XHTML & CSS

For everyone who'd like to get familiar with XHTML & CSS stuff I highly recommend the book
Head First HTML with CSS & XHTML
Head First HTML with CSS & XHTML.

A week ago I didn't even know what CSS is, but today I've changed the appearence of my blog :))
The one thing was irritating that the width of my blog was fixed, and when it was read from the monitors with high resolution it looked like a narrow column, but now the width of the post bodies is flexible depending on the width of the browser.
I like it.

Wednesday, October 3, 2007

Removing duplicate elements from the string

Hi there! :)
Well, this is actually my first post in this newly created blog.
The idea of keeping a blog was born a day ago, when there was a question on the Oracle forum (this thread). I’ve met the same one only a few days ago – but couldn’t find it, cause there were some problems with the search engine I usually face when I need to find something :))
So I decided – why don’t I keep my own blog – where I can post such solutions that can be interesting for others. And here we are!

Ok, let's go back to the problem, mentioned in those links.

Problem description:

Input data:

There is a string containing elements. They can be separated with any symbol, e.g. a comma:

SQL> with t as (select 'elem1, elem2, elem3, elem1, elem3, elem2, elem2' str
from dual)
--
select * from t
/

STR
-----------------------------------------------
elem1, elem2, elem3, elem1, elem3, elem2, elem2

SQL>

...or any other clear way, for example, it was formulated as "3 letter codes". So in this case there are no element delimiters, but we know that each element is of three same consequtive letters, e.g.:

SQL> with t as (select 'AAABBBCCCBBBDDDAAAEEEBBB' str from dual)
--
select * from t
/

STR
------------------------
AAABBBCCCBBBDDDAAAEEEBBB

SQL>

Goal:

The objective is to remove duplicate elements out of the string, to leave only one specimen of each element.

Expected output:

So in the first case the result should be:
elem1, elem2, elem3

In the second:
AAABBBCCCDDDEEE

Solution:

In both cases the first thing what we need - is to pick out elements from the string.
Regular expressions are very friendly for us in that job (espesially, when there are no delimiters).
So elements like AAA, BBB, CCC etc. can be written as '([[:alpha:]])\1{2}' - any letter trailed by two same letters.

Now we want to understand - what would be the mask for such an element followed by another elements and again the same first element, e.g. AAABBBCCCAAA, or it can be followed directly by the same element: AAAAAA.
In language of regular expressions it would look like '(([[:alpha:]])\2{2}).*\1'

That's it!
Now if we put regexp_replace(str, '(([[:alpha:]])\2{2})(.*)\1','\1\3') we would throw out one last element which is met firstly and has duplicate values in the string. It would be last, because we used greedy operator '*'.

So for example:
SQL> with t as (select 'AAABBBCCCBBBEEEDDDAAAEEEBBBEEE' str from dual)
--
select str,
regexp_replace(str, '(([[:alpha:]])\2{2})(.*)\1','\1\3') new_str
from t
/

STR NEW_STR
------------------------------ -------------------------------
AAABBBCCCBBBEEEDDDAAAEEEBBBEEE AAABBBCCCBBBEEEDDDEEEBBB

SQL>
What is the logic of this operation:

  1. We look for the first element which has duplicated values. In our case it is AAA.
  2. Then regular expression operator finds the last AAA met in the string and removes it.
  3. Then, it goes to the rest of the string and again finds the first element, which has duplicated values in the rest of the string, now it is EEE.
  4. And finally removes the last EEE element.
  5. In our example it is the end of the operations, but if the string is longer it would proceed the previously mentioned operations again and again.
It would be more comprehensible if we mark the first met element with green, and the last, which would be removed, with red:
'AAABBBCCCBBBEEEDDDAAAEEEBBBEEE'.
As you can see, the first EEE element is not really the first EEE element in the string. It is the first one in the rest of the string after we removed AAA element.

If we iteratively apply this regular expression to our string - finally we remove all duplicated elements. But about it later. Now let's improve our expression a little bit.
First we'll change greedy operator '*' to non-greedy '*?'. So that we will do our job in fewer iterations.

Let's look at the following example:
we have string 'AAABBBAAAAAACCCAAA'.
with greedy regexp_replace(str, '(([[:alpha:]])\2{2})(.*)\1','\1\3') we'll have iterations:

  1. AAABBBAAAAAACCCAAA
  2. AAABBBAAAAAACCC
  3. AAABBBAAACCC
result: 'AAABBBCCC'. So it took us 3 times to iterate.
And with the non-greedy regexp_replace(str, '(([[:alpha:]])\2{2})(.*?)\1','\1\3') we'll have iterations:

  1. AAABBBAAAAAACCCAAA
  2. AAABBBAAACCC
The same result and achieved in two iterations.

Let's make one more improvement: add '+' after the backreference '\1'. In the language of regular expressions - it means one or more occurencies of the element which is preceding this '+'.
Let's imagine we have a string: 'AAABBBAAAAAAAAAAAA'.
With regexp_replace(str, '(([[:alpha:]])\2{2})(.*?)\1','\1\3') we'll make three iterations:

  1. AAABBBAAAAAAAAAAAA
  2. AAABBBAAAAAA
  3. AAABBBAAA
If we place '+' and use regexp_replace(str, '(([[:alpha:]])\2{2})(.*?)\1+','\1\3') we'll have to make only one iteration:

  1. AAABBBAAAAAAAAAAAA
Finally, it is a time for implementing an iterative mechanism of applying the same function to the string. Starting from Oracle version 10g there was a nice Model clause introduced, which can be used to proceed operations iteratively.

So, the final query would look like:
SQL> WITH t AS (SELECT
'AAABBBCCCBBBDDDAAAEEEBBB' str FROM dual)
--
select str, str_new from t
model
dimension by (0 dim)
measures(str, str str_new)
rules iterate(100) until (str_new[0] = previous(str_new[0]))
(str_new[0]=regexp_replace(str_new[0],'(([[:alpha:]])\2{2})(.*?)\1+','\1\3'));

STR STR_NEW
------------------------ ------------------------
AAABBBCCCBBBDDDAAAEEEBBB AAABBBCCCDDDEEE

SQL>

Two words about Model clause here:

  • With regexp_replace you are already familiar. It is applyed to the string during each iteration;
  • iterate(100) means maximum of iterations could be proceeded is 100 (it can be increased if you want);
  • until (str[0] = previous(str[0])) means stop iterations when the string is not changed during the previous iteration.
Well, that's it.

Now let's take the case when there is a delimited string, e.g. with commas. In that case we need to change our regular expression a little bit. Aketi Jyuuzou made this in one of the mentioned links. I just clarify it for readers. So for a string like 'elem1, elem2, elem3, elem1, elem3, elem2, elem2' we'll need the following:
regexp_replace(str,'(^|,)([^,]+,)(.*?,)?\2+','\1\2\3')
I just added non-greedy '*?' and '+' in the end, but you have already read about the impact of these. So you understand what are they needed for.

Backreference \1 stands for (^|,) - this is the symbol before the first met duplicated element. It is either begining of the string (^), either a comma trailing the previous element (,).
Backreference \2 stands for the ([^,]+,) - this is the element itself, which means one or more non-comma symbols followed by a comma.
Backreference \3 stands for (.*?,)? which is the minimum (non-greedy) number of symbols before one or more duplicated value (\2+).

So the final query would look like:

SQL> WITH t AS (SELECT 'elem1,elem2,elem3,elem1,elem3,elem2,elem2' str FROM dual)
--
select str, rtrim(str_new,',') new_str from t
model
dimension by (0 dim)
measures(str, str||',' str_new)
rules iterate(100) until (str_new[0] = previous(str_new[0]))
(str_new[0]=regexp_replace(str_new[0],'(^|,)([^,]+,)(.*?,)?\2+','\1\2\3'));

STR NEW_STR
----------------------------------------- -------------------------
elem1,elem2,elem3,elem1,elem3,elem2,elem2 elem1,elem2,elem3

SQL>


Hope it was useful!

PS
It is my first post, so I'll be grateful if you post your comments and notices about it. Was it too comlicatedely stated or maybe too detailed. Well,waiting for your replies :)