Volder's Oracle Notes: December 2007

Wednesday, December 19, 2007

Reports: getting total of all children values in a tree

... previous

Well, actually, this note is a continuation of a model series on reports.
So the structure of an article is the same: the problem description -> non-model solution -> model clause solution.

But also this note is very similar to summation of tree values, I've written a little bit earlier.

Input data:

So let's get started.
Our input for this post would be a table t from here and an additional table t_hrchy with a hierarchical structure of departments:

SQL> drop table t_hrchy;

Table dropped

SQL>
SQL> create table t_hrchy as (select 'DEP2' dep, 'DEP1' dep_par from dual union all
2 select 'DEP3' dep, 'DEP2' dep_par from dual union all
3 select 'DEP4' dep, 'DEP3' dep_par from dual)
4 /

Table created

SQL> select * from t_hrchy;

DEP DEP_PAR
---- -------
DEP2 DEP1
DEP3 DEP2
DEP4 DEP3

SQL>

So this table tells us that DEP1 is a parent department for DEP2, DEP2 is a parent department for DEP3 and DEP3 is a parent department for DEP4.

Problem:

Now our task would be to create a report where in one column would be sales of a particular department output and in another column we need to get the sales of a current department and for all the departments that are under the current one in the organization structure.
The data should be consolidated only inside the same year and inside the same city.

Non-Model solution:

When the matter concerns the hierarchical structure the first thing that pops up in our head is hierarchical queries with CONNECT BY clause.

Elaborating this a little bit we can use the following query proposed by Rob van Wijk here or here:

SQL> select to_char(year,'yyyy') year,
2 city,
3 t.dep,
4 th.dep_par,
5 sales,
6 sum(connect_by_root sales) tot_sales
7 from t, t_hrchy th
8 where t.dep = th.dep(+)
9 connect by t.dep = prior dep_par
10 and prior year = year
11 and prior city = city
12 group by city, year, t.dep, dep_par, sales
13 order by 1,2,3;

YEAR CITY DEP DEP_PAR SALES TOT_SALES
---- ------ ---- ------- ---------- ----------
2006 Moscow DEP1 562 2254
2006 Moscow DEP2 DEP1 457 1692
2006 Moscow DEP3 DEP2 890 1235
2006 Moscow DEP4 DEP3 345 345
2006 Omsk DEP1 23 264
2006 Omsk DEP2 DEP1 154 241
2006 Omsk DEP3 DEP2 87 87
2007 Moscow DEP1 762 2926
2007 Moscow DEP2 DEP1 657 2164
2007 Moscow DEP3 DEP2 1020 1507
2007 Moscow DEP4 DEP3 487 487
2007 Omsk DEP1 34 403
2007 Omsk DEP2 DEP1 213 369
2007 Omsk DEP3 DEP2 156 156

14 rows selected

SQL>

Lets see what this query actually does.
First we use an outer join of two tables - so for each row we get a parent department name.

Then we build a hierarchy by using

connect by t.dep = prior dep_par

Mention that hierarchy is built in the backward direction starting from the children and going up to their parents.

Also we add in the connect by clause the following:

and prior year = year
and prior city = city

So we build hierarchy separartely for different years and cities.

By using the operator

connect_by_root sales

in every row we find a value for sales of the root elements for this or that thread in a hierarchy. As we didn't specify any START WITH condition - the hierarchy is built starting from all the departments present in the data.
For example, let's look what would happen inside the partition for 2007 year Moscow city:

SQL> select to_char(year,'yyyy') year,
2 city,
3 t.dep,
4 th.dep_par,
5 sales,
6 connect_by_root sales root_sales,
7 level,
8 sys_connect_by_path(t.dep,'/') hrchy_path
9 from t, t_hrchy th
10 where t.dep = th.dep(+)
11 and trunc(t.year,'y')=to_date('01.01.2007','dd.mm.yyyy')
12 and t.city='Moscow'
13 connect by t.dep = prior dep_par
14 and prior year = year
15 and prior city = city
16 /

YEAR CITY DEP DEP_PAR SALES ROOT_SALES LEVEL HRCHY_PATH
---- ------ ---- ------- ---------- ---------- ---------- ----------------------------
2007 Moscow DEP1 762 762 1 /DEP1
2007 Moscow DEP2 DEP1 657 657 1 /DEP2
2007 Moscow DEP1 762 657 2 /DEP2/DEP1
2007 Moscow DEP3 DEP2 1020 1020 1 /DEP3
2007 Moscow DEP2 DEP1 657 1020 2 /DEP3/DEP2
2007 Moscow DEP1 762 1020 3 /DEP3/DEP2/DEP1
2007 Moscow DEP4 DEP3 487 487 1 /DEP4
2007 Moscow DEP3 DEP2 1020 487 2 /DEP4/DEP3
2007 Moscow DEP2 DEP1 657 487 3 /DEP4/DEP3/DEP2
2007 Moscow DEP1 762 487 4 /DEP4/DEP3/DEP2/DEP1

10 rows selected

SQL>

As you see DEP1 is met four times:
- Firstly it stands for itself (level=1).
- Then we begin to build hierarchy starting from DEP2 and DEP1 is met as a "child" (remember we build a hierarchy in the opposite direction) for DEP2 (level=2)
- The next time we meet it as a "grandchild" for DEP3 (level=3)
- And finally as a "great-grandchild" for DEP4 (level=4).

In each row we have a value of sales for the root department. So if we group data by department and put a sum() aggregate function on connect_by_root(sales) column we get the sales of DEP1 and all the sales of all departments that are under DEP1 in one row.
The same stuff will happen to all the other departments:

SQL> select to_char(year,'yyyy') year,
2 city,
3 t.dep,
4 th.dep_par,
5 sales,
6 sum(connect_by_root sales) tot_salesh
7 from t, t_hrchy th
8 where t.dep = th.dep(+)
9 and trunc(t.year,'y')=to_date('01.01.2007','dd.mm.yyyy')
10 and t.city='Moscow'
11 connect by t.dep = prior dep_par
12 and prior year = year
13 and prior city = city
14 group by year,city,t.dep,th.dep_par,sales
15 order by 1,2,3
16 /

YEAR CITY DEP DEP_PAR SALES TOT_SALESH
---- ------ ---- ------- ---------- ----------
2007 Moscow DEP1 762 2926
2007 Moscow DEP2 DEP1 657 2164
2007 Moscow DEP3 DEP2 1020 1507
2007 Moscow DEP4 DEP3 487 487

SQL>

Model Solution

Now we are proceeding and will solve the same problem using Model clause without any connect by's.

To get less data to be outputed at every step - let's restrict our data to 2007 year and Moscow city only as we did in Non-Model solution:

SQL> select * from t, t_hrchy th
2 where t.dep = th.dep(+)
3 and trunc(t.year,'y')=to_date('01.01.2007','dd.mm.yyyy')
4 and t.city='Moscow'
5 order by 3,1,2
6 /

CITY DEP YEAR SALES DEP DEP_PAR
------ ---- ----------- ---------- ---- -------
Moscow DEP1 31.12.2007 762
Moscow DEP2 31.12.2007 657 DEP2 DEP1
Moscow DEP3 31.12.2007 1020 DEP3 DEP2
Moscow DEP4 31.12.2007 487 DEP4 DEP3

SQL>

Let's think of what woud be placed in PARTITION BY part of our model. Remeber that partitions should be absolutely independent one of another.
Yes - we would put YEAR and CITY there. As a task instructs us to treat sales of different years and cities separately.
In dimension we would place DEP and DEP_PART columns.
And in measures - we'll put SALES, of course. And create one more measure called TOT_SALES with initial value of zero.

SQL> select * from t, t_hrchy th
2 where t.dep = th.dep(+)
3 and trunc(t.year,'y')=to_date('01.01.2007','dd.mm.yyyy')
4 and t.city='Moscow'
5 model
6 partition by (year,city)
7 dimension by (t.dep dep,dep_par)
8 measures(sales, 0 tot_sales)
9 ()
10 order by 1,2,3
11 /

YEAR CITY DEP DEP_PAR SALES TOT_SALES
----------- ------ ---- ------- ---------- ----------
31.12.2007 Moscow DEP1 762 0
31.12.2007 Moscow DEP2 DEP1 657 0
31.12.2007 Moscow DEP3 DEP2 1020 0
31.12.2007 Moscow DEP4 DEP3 487 0

SQL>

Now we are going to write a rule for calculating the tot_sales.
For any cell in our model the tot_sales should be equal to a sum of sales of the current department and all the tot_sales of all departments under the current.
We can write it as:

tot_sales[any,any] = sales[CV(),CV()]+nvl(sum(tot_sales)[any,CV(dep)],0)

BTW it would be correct in case when the tot_sales of the departments under the current one already included all the tot_sales of the departments under them.

So what is important here is the ORDER in what the rules are executed.
If we don't specify anything about the order of execution to our model it uses the default SEQUENTIAL ORDER and can easily throw out an error for us:

SQL> select * from t, t_hrchy th
2 where t.dep = th.dep(+)
3 and trunc(t.year,'y')=to_date('01.01.2007','dd.mm.yyyy')
4 and t.city='Moscow'
5 model
6 partition by (year,city)
7 dimension by (t.dep dep,dep_par)
8 measures(sales, 0 tot_sales)
9 rules
10 (tot_sales[any,any] = sales[CV(),CV()]+nvl(sum(tot_sales)[any,CV(dep)],0))
11 order by 1,2,3
12 /

...

ORA-32637: Self cyclic rule in sequential order MODEL

SQL>

But putting AUTOMATIC ORDER causes all rules to be evaluated in an order based on their logical dependencies.
Let's check:

SQL> select * from t, t_hrchy th
2 where t.dep = th.dep(+)
3 and trunc(t.year,'y')=to_date('01.01.2007','dd.mm.yyyy')
4 and t.city='Moscow'
5 model
6 partition by (year,city)
7 dimension by (t.dep dep,dep_par)
8 measures(sales, 0 tot_sales)
9 rules automatic order
10 (tot_sales[any,any] = sales[CV(),CV()]+nvl(sum(tot_sales)[any,CV(dep)],0))
11 order by 1,2,3
12 /

YEAR CITY DEP DEP_PAR SALES TOT_SALES
----------- ------ ---- ------- ---------- ----------
31.12.2007 Moscow DEP1 762 2926
31.12.2007 Moscow DEP2 DEP1 657 2164
31.12.2007 Moscow DEP3 DEP2 1020 1507
31.12.2007 Moscow DEP4 DEP3 487 487

SQL>

Voilà! This is exactly what we were looking for.

Let's find a detailed description of what's happening when we don't use AUTOMATIC ORDER and what's changing when we begin to use it.
First of all, the equivalent of our single rule would be actually a list of four rules for every row (unique set of dimension values) in our data:

tot_sales['DEP1', null ]=sales['DEP1', null ]+nvl(sum(tot_sales)[any,'DEP1'],0),
tot_sales['DEP2','DEP1']=sales['DEP2','DEP1']+nvl(sum(tot_sales)[any,'DEP2'],0),
tot_sales['DEP3','DEP2']=sales['DEP3','DEP2']+nvl(sum(tot_sales)[any,'DEP3'],0),
tot_sales['DEP4','DEP3']=sales['DEP4','DEP3']+nvl(sum(tot_sales)[any,'DEP4'],0)

But using SEQUENTIAL ORDER doesn't guarantee us any order of these rules execution. Because as it stated in the documetation: SEQUENTIAL ORDER: This causes rules to be evaluated in the order they are written. This is the default.
But we have only one rule in our model. So the detailed rules can be run in any order:
for example first we find tot_sales['DEP1', null ] and then tot_sales['DEP2','DEP1'], or vice versa. And that is the problem, because this two possibilities will give us different results in the output.

What actually ORA-32637 stays for is:
Cause: A self-cyclic rule was detected in the sequential order MODEL. Sequential order MODELs cannot have self cyclic rules to guarantee that the results do not depend on the order of evaluation of the cells that are updated or upserted.
Action: Use ordered rule evaluation for this rule.

So if use order by DEP we shouldn't have such an error, let's try:

SQL> select * from t, t_hrchy th
2 where t.dep = th.dep(+)
3 and trunc(t.year,'y')=to_date('01.01.2007','dd.mm.yyyy')
4 and t.city='Moscow'
5 model
6 partition by (year,city)
7 dimension by (t.dep dep,dep_par)
8 measures(sales, 0 tot_sales)
9 rules
10 (tot_sales[any,any] order by dep = sales[CV(),CV()]+nvl(sum(tot_sales)[any,CV(dep)],0))
11 order by 1,2,3
12 /

YEAR CITY DEP DEP_PAR SALES TOT_SALES
----------- ------ ---- ------- ---------- ----------
31.12.2007 Moscow DEP1 762 762
31.12.2007 Moscow DEP2 DEP1 657 657
31.12.2007 Moscow DEP3 DEP2 1020 1020
31.12.2007 Moscow DEP4 DEP3 487 487

SQL>

Yes, the query returns no error - because we explicitly defined the order of rules execution. But as you see we have wrong results. Because actually in our case we should order by dep DESC! You can try it yourself.
While we don't know what are the dependences in the hierarchy data beforehand - let's give Oracle to decide - what are the dependencies between cells and simply use AUTOMATIC ORDER:

SQL> select * from t, t_hrchy th
2 where t.dep = th.dep(+)
3 model
4 partition by (to_char(year,'yyyy') year,city)
5 dimension by (t.dep dep,dep_par)
6 measures(sales, 0 tot_sales)
7 rules automatic order
8 (tot_sales[any,any] = sales[CV(),CV()]+nvl(sum(tot_sales)[any,CV(dep)],0))
9 order by 1,2,3
10 /

YEAR CITY DEP DEP_PAR SALES TOT_SALES
---- ------ ---- ------- ---------- ----------
2006 Moscow DEP1 562 2254
2006 Moscow DEP2 DEP1 457 1692
2006 Moscow DEP3 DEP2 890 1235
2006 Moscow DEP4 DEP3 345 345
2006 Omsk DEP1 23 264
2006 Omsk DEP2 DEP1 154 241
2006 Omsk DEP3 DEP2 87 87
2007 Moscow DEP1 762 2926
2007 Moscow DEP2 DEP1 657 2164
2007 Moscow DEP3 DEP2 1020 1507
2007 Moscow DEP4 DEP3 487 487
2007 Omsk DEP1 34 403
2007 Omsk DEP2 DEP1 213 369
2007 Omsk DEP3 DEP2 156 156

14 rows selected

SQL>

to be continued...

Sunday, December 16, 2007

Reports: forecasting

... previous

We carry on with our reports (the needed input data).
Now what we need is to make a little forecast. We are requested to find out what the approximate sales for the next year would be.

The growth rate is determined as a ratio of current year sales (2007) to the previous year sales (2006), but it should be calculated separately for different cities.
So to find the growth rate for each row - we need to find total sales for the particular city for the current year and divide it on the total sales for the same city for the previous year.

Non-Model solution
Without using model - we can find the growth rate in a scalar subquery:

select sum(decode(to_char(year, 'yyyy'), '2007', sales)) /
sum(decode(to_char(year, 'yyyy'), '2006', sales))
from t
where city = t1.city

where t1.city should be a reference to the main outer query.
If we put this scalar subquery directly into the query we already had, we get the following result:

SQL> select city,
2 dep,
3 sum(sales) sales_2007,
4 round(sum(sales) *
5 (select sum(decode(to_char(year, 'yyyy'), '2007', sales)) /
6 sum(decode(to_char(year, 'yyyy'), '2006', sales))
7 from t
8 where city = t1.city),
9 2) sales_2008_forecast
10 from t t1
11 where trunc(year, 'y') = to_date('01.01.2007', 'dd.mm.yyyy')
12 group by rollup(city, dep)
13 /

CITY DEP SALES_2007 SALES_2008_FORECAST
------ ---- ---------- -------------------
Omsk DEP1 34 51,9
Omsk DEP2 213 325,15
Omsk DEP3 156 238,14
Omsk 403 615,19
Moscow DEP1 762 989,18
Moscow DEP2 657 852,88
Moscow DEP3 1020 1324,1
Moscow DEP4 487 632,19
Moscow 2926 3798,35
3329

10 rows selected

SQL>

As you see - we don't get a total for year 2008. This happens because city value in this total row is equal to NULL. And while we calculating the growth rate for particular cities - we get no rate at all for this row. So as a result we get NULL.

The workaround here would be first to get forecasting for detailed sales and then group with ROLLUP the sales for 2007 year and also for the predictable 2008 year.

So it would look like the following:

SQL> select city,
2 dep,
3 sum(sales) sales_2007,
4 sum(sales2) sales_2008_forecast,
5 round(100 * (sum(sales2) / sum(sales) - 1), 2) || '%' growth_rate
6 from (select t0.*,
7 round(sales *
8 (select sum(decode(to_char(year, 'yyyy'), '2007', sales)) /
9 sum(decode(to_char(year, 'yyyy'), '2006', sales))
10 from t
11 where city = t0.city),
12 2) sales2
13 from t t0) t1
14 where trunc(year, 'y') = to_date('01.01.2007', 'dd.mm.yyyy')
15 group by rollup(city, dep)
16 /

CITY DEP SALES_2007 SALES_2008_FORECAST GROWTH_RATE
------ ---- ---------- ------------------- ---------------
Omsk DEP1 34 51,9 52.65%
Omsk DEP2 213 325,15 52.65%
Omsk DEP3 156 238,14 52.65%
Omsk 403 615,19 52.65%
Moscow DEP1 762 989,18 29.81%
Moscow DEP2 657 852,88 29.81%
Moscow DEP3 1020 1324,1 29.81%
Moscow DEP4 487 632,19 29.81%
Moscow 2926 3798,35 29.81%
3329 4413,54 32.58%

10 rows selected

SQL>

Now all the totals are where they should be. And we added an extra column with growth rate. So we can see, that sales in Omsk grew up ~53% and in Moscow increased ~30%. The average growth rate from 2006 until 2007 years was ~33%.

Model solution #1
With the model solution instead of scalar subquery - we can use reference model clause.
We put the following query into the reference model:

SQL> select city, to_char(year,'yyyy') y, sum(sales) sum_sales from t group by city,year;

CITY Y SUM_SALES
------ ---- ----------
Moscow 2006 2254
Omsk 2006 264
Moscow 2007 2926
Omsk 2007 403

SQL>

And then we'll use the values from that resultset.

SQL> select city,dep,sales,sales_fc,round(100*(growth-1),2)||'%' growth_rate
2 from t where trunc(year,'y') = to_date('01.01.2007','dd.mm.yyyy')
3 model
4 reference r on (select city, to_char(year,'yyyy') y, sum(sales) sum_sales from t group by city,year)
5 dimension by (y, city)
6 measures(sum_sales)
7 main m
8 dimension by (0 total,city, dep)
9 measures(sales, 0 sales_FC, 0 growth)
10 rules upsert all
11 (sales[1,any,null]=sum(sales)[0,CV(),any],
12 sales[1,null,null]=sum(sales)[0,any,any],
13 growth[any,any,any]=r.sum_sales['2007',CV(city)]/r.sum_sales['2006',CV(city)],
14 sales_FC[any,any,any]=round(sales[CV(),CV(),CV()]*growth[CV(),CV(),CV()],2))
15 order by 2,3;

CITY DEP SALES SALES_FC GROWTH_RATE
------ ---- ---------- ---------- -----------------------------------------
Omsk DEP1 34 51,9 52.65%
Moscow DEP1 762 989,18 29.81%
Omsk DEP2 213 325,15 52.65%
Moscow DEP2 657 852,88 29.81%
Omsk DEP3 156 238,14 52.65%
Moscow DEP3 1020 1324,1 29.81%
Moscow DEP4 487 632,19 29.81%
Omsk 403 615,19 52.65%
Moscow 2926 3798,35 29.81%
3329 %

10 rows selected

SQL>

So we put our query into the reference clause, so that we can use it's cell values in the main part.
We created a new measure growth and in the rules part wrote:

growth[any,any,any]=r.sum_sales['2007',CV(city)]/r.sum_sales['2006',CV(city)]

So for each cell we find a growth rate for the current row (current city) by dividing summary sales of the current year for this particular city to the same of the previous year.
The next our step (rule) is finding the predicted sales for the next year:

sales_FC[any,any,any]=round(sales[CV(),CV(),CV()]*growth[CV(),CV(),CV()],2)

And we are facing the same problem we had in the Non-Model part. We can't define any city in the total row - so we can't find a growth rate for this row.
As you already know - we can easily write a rule to find the total sales for the next year as a sum of other cells:

SQL> select city,dep,sales,sales_fc,round(100*(growth-1),2)||'%' growth_rate
2 from t where trunc(year,'y') = to_date('01.01.2007','dd.mm.yyyy')
3 model
4 reference r on (select city, to_char(year,'yyyy') y, sum(sales) sum_sales from t group by city,year)
5 dimension by (y, city)
6 measures(sum_sales)
7 main m
8 dimension by (0 total,city, dep)
9 measures(sales, 0 sales_FC, 0 growth)
10 rules upsert all
11 (sales[1,any,null]=sum(sales)[0,CV(),any],
12 sales[1,null,null]=sum(sales)[0,any,any],
13 sales_FC[any,city is not null,any]=round(sales[CV(),CV(),CV()]*r.sum_sales['2007',CV(city)]/r.sum_sales['2006',CV(city)],2),
14 sales_FC[1,null,null]=sum(sales_FC)[1,any,null],
15 growth[any,any,any]=sales_FC[CV(),CV(),CV()]/sales[CV(),CV(),CV()])
16 order by 1,2;

CITY DEP SALES SALES_FC GROWTH_RATE
------ ---- ---------- ---------- ------------------
Moscow DEP1 762 989,18 29.81%
Moscow DEP2 657 852,88 29.81%
Moscow DEP3 1020 1324,1 29.81%
Moscow DEP4 487 632,19 29.81%
Moscow 2926 3798,35 29.81%
Omsk DEP1 34 51,9 52.65%
Omsk DEP2 213 325,15 52.65%
Omsk DEP3 156 238,14 52.65%
Omsk 403 615,19 52.65%
3329 4413,54 32.58%

10 rows selected

SQL>

You see that we changed the order of the rules and the rules themselves (that is because we want to find the growth rate for the overall total row, and hence we would find growth rates AFTER we get overall total sales value).
First we find the forecasting sales for the next year for all the rows excluding the very total:

sales_FC[any,city is not null,any]=
round(sales[CV(),CV(),CV()]*r.sum_sales['2007',CV(city)]/r.sum_sales['2006',CV(city)],2)

Then we find this total by summing up all the city subtotals:

sales_FC[1,null,null]=sum(sales_FC)[1,any,null]

And finally we get the growth rate by dividing sales_FC (forecast) on the actual sales (current year sales):

growth[any,any,any]=sales_FC[CV(),CV(),CV()]/sales[CV(),CV(),CV()]

After all this is the same result we had in the Non-Model solution.

Model solution #2
Well, in both solutions (Non-Model & Model #1) we accessed the table twice. First to get the main result and second one - to get the growth rate.

Now we are going to access table only once by using extra dimension in Model and refusing from the reference model clause.

SQL> select city,dep,sales,sales_fc,round(100*(growth-1),2)||'%' growth_rate from t
2 model
3 return updated rows
4 dimension by (0 total,city, dep, to_char(year,'yyyy') year)
5 measures(sales, sales sales_FC,0 growth)
6 rules upsert all
7 (sales[1,any,null,'2007']=sum(sales)[0,CV(),any,CV()],
8 sales[1,null,null,'2007']=sum(sales)[0,any,any,CV()],
9 sales_FC[any,any,any,'2007']=round(sales[CV(),CV(),CV(),CV()]*
10 sales[1,CV(),null,CV()]/sum(sales)[0,CV(),any,'2006'],2),
11 sales_FC[1,null,null,'2007']=sum(sales_FC)[1,any,null,'2007'],
12 growth[any,any,any,'2007']=sales_FC[CV(),CV(),CV(),CV()]/sales[CV(),CV(),CV(),CV()])
13 order by 1,2
14 /

CITY DEP SALES SALES_FC GROWTH_RATE
------ ---- ---------- ---------- -----------------------------------------
Moscow DEP1 762 989,18 29.81%
Moscow DEP2 657 852,88 29.81%
Moscow DEP3 1020 1324,1 29.81%
Moscow DEP4 487 632,19 29.81%
Moscow 2926 3798,35 29.81%
Omsk DEP1 34 51,9 52.65%
Omsk DEP2 213 325,15 52.65%
Omsk DEP3 156 238,14 52.65%
Omsk 403 615,19 52.65%
3329 4413,54 32.58%

10 rows selected

SQL>

What you should mention first is that we removed

where trunc(year,'y') = to_date('01.01.2007','dd.mm.yyyy')

So now all the data (for all years: 2006 and 2007) is pumped into model clause.

And you probably didn't miss that we put RETURN UPDATED ROWS, so only rows where cells were updated will be returned in the resultset.
That's how we can avoid additional outer WHERE clause for our MODEL: we update only rows that are corresppondent to 2007 year, but not the 2006.

Another thing changed is we added a dimension

to_char(year,'yyyy') year

So now we can reference to sales of a particular year by putting it in the format like [...,'YYYY'].

And the final amendment - would be adding a value for a new dimension in every cell reference ([<total_dim>,<city_dim>,<dep_dim>,'2007']). If we put it on the left side of a formula we can put just CV() on the right side in the place for <year_dim>.

to be continued ...

Tuesday, December 11, 2007

Reports: ratios & percentages

... previous

Let's continue with our spreadsheet report (input data you can find here).

Imagine that we want to add a column that would represent sales percentage of each department in the total of it's city sales. So we need to know how much every department contributes to it's region sales.

Non-model solution:

Developing the group by solution with rollup - we could use RATIO_TO_REPORT function for that purpose.

Our query would look like:

SQL> select city, dep, sum(sales),
2 decode(grouping_id(city, dep),
3 0,
4 round(2 * 100 * ratio_to_report(sum(sales))
5 over(partition by city),
6 2) || '%') perc
7 from t
8 where trunc(year, 'y') = to_date('01.01.2007', 'dd.mm.yyyy')
9 group by rollup(city, dep)
10 /

CITY DEP SUM(SALES) PERC
------ ---- ---------- ---------
Moscow DEP1 762 26.04%
Moscow DEP2 657 22.45%
Moscow DEP3 1020 34.86%
Moscow DEP4 487 16.64%
Moscow 2926
Omsk DEP1 34 8.44%
Omsk DEP2 213 52.85%
Omsk DEP3 156 38.71%
Omsk 403
3329

10 rows selected

SQL>

Let's take a closer look.
First of all we use GROUPING_ID function - to understand when the data is detailed (comparing it to 0). When it is detailed - we return the ratio of the value in the current row to the total sales of the region. Otherwise we return NULL.
But we need to remember that analytic functions are executed after the grouping has been done. So when our ratio_to_report would be applied - we will have doubled sales for each city: on the one hand - the sales of each department, and on the other hand, the sum of all departments sales (as a city subtotal) will also influence our ratio_to_report.
That's why we multiply the function by 2, so we get correct results.
With multiplying by 100 we get percentage values.

Model solution
For getting the same result with Model clause - first of all we need to add a new measure:

cast(null as varchar2(7)) PERC

So it would be our percentage column in the result set:

SQL> select * from t where trunc(year,'y') = to_date('01.01.2007','dd.mm.yyyy')
2 model
3 dimension by (0 total,city, dep)
4 measures(sales, cast(null as varchar2(7)) perc)
5 rules upsert all
6 (sales[1,any,null]=sum(sales)[0,CV(),any],
7 sales[1,null,null]=sum(sales)[0,any,any])
8 order by 2,3
9 /

TOTAL CITY DEP SALES PERC
---------- ------ ---- ---------- -------
0 Moscow DEP1 762
0 Moscow DEP2 657
0 Moscow DEP3 1020
0 Moscow DEP4 487
1 Moscow 2926
0 Omsk DEP1 34
0 Omsk DEP2 213
0 Omsk DEP3 156
1 Omsk 403
1 3329

10 rows selected

SQL>

Now we need to create a rule for our column.
So first of all we need to calculate values for all the cells with detail data. On the language of our model it would be perc[0,any,any]. By using a "Total" dimension with value=0 we will get only cells with detailed data. City and Department could be any. That is what we put on the left side of our formula.

On the right side we need to find for each detailed row a ratio of the current row sales - to it's region sales total.
So we would write sales[0,CV(),CV()]/sales[1,CV(),null]
In the denominator we just put a reference to the cell with a region total. While it has been already calculated - we don't need to sum up all the sales of a particular city again.

SQL> select * from t where trunc(year,'y') = to_date('01.01.2007','dd.mm.yyyy')
2 model
3 dimension by (0 total,city, dep)
4 measures(sales, cast(null as varchar2(7)) perc)
5 rules upsert all
6 (sales[1,any,null]=sum(sales)[0,CV(),any],
7 sales[1,null,null]=sum(sales)[0,any,any],
8 perc[0,any,any]=round(sales[0,CV(),CV()]/sales[1,CV(),null]*100,2)||'%')
9 order by 2,3
10 /

TOTAL CITY DEP SALES PERC
---------- ------ ---- ---------- -------
0 Moscow DEP1 762 26.04%
0 Moscow DEP2 657 22.45%
0 Moscow DEP3 1020 34.86%
0 Moscow DEP4 487 16.64%
0 Moscow %
1 Moscow 2926
0 Omsk DEP1 34 8.44%
0 Omsk DEP2 213 52.85%
0 Omsk DEP3 156 38.71%
0 Omsk %
1 Omsk 403
1 3329
0 %

13 rows selected

SQL>

Ooops. As you can see there are three redundant rows with only a '%' in our new column.
Why did they appear? Because we placed UPSERT ALL.
If we don't want a particular rule to be under a general semantic, which is mentioned after the RULES word, we can put a needed behaviour name directly before the rule. In our case it is UPDATE (no new rows generation -> only update, for more information look Reports: totals & subtotals):

SQL> select * from t where trunc(year,'y') = to_date('01.01.2007','dd.mm.yyyy')
2 model
3 dimension by (0 total,city, dep)
4 measures(sales, cast(null as varchar2(7)) perc)
5 rules upsert all
6 (sales[1,any,null]=sum(sales)[0,CV(),any],
7 sales[1,null,null]=sum(sales)[0,any,any],
8 update perc[0,any,any]=round(sales[0,CV(),CV()]/sales[1,CV(),null]*100,2)||'%')
9 order by 2,3
10 /

TOTAL CITY DEP SALES PERC
---------- ------ ---- ---------- -------
0 Moscow DEP1 762 26.04%
0 Moscow DEP2 657 22.45%
0 Moscow DEP3 1020 34.86%
0 Moscow DEP4 487 16.64%
1 Moscow 2926
0 Omsk DEP1 34 8.44%
0 Omsk DEP2 213 52.85%
0 Omsk DEP3 156 38.71%
1 Omsk 403
1 3329

10 rows selected

SQL>

This is actually what we were looking for!

to be continued ...

Sunday, December 9, 2007

Reports: totals & subtotals

Starting from the current post I decided to write several notes on queries with Model clause. Mostly they would be dedicated to the topics of producing diverse reports.
In each case I'll give an opportunity how it can be done without using the model clause. And then - how the model clause could be used. So let's begin.
Input data:

SQL> drop table t;

Table dropped

SQL>
SQL> create table t as (select 'Omsk' city, 'DEP1' dep, to_date('31.12.2006','dd.mm.yyyy') year, 23 sales from dual union all
2 select 'Omsk' city, 'DEP2' dep, to_date('31.12.2006','dd.mm.yyyy') year, 154 sales from dual union all
3 select 'Omsk' city, 'DEP3' dep, to_date('31.12.2006','dd.mm.yyyy') year, 87 sales from dual union all
4 select 'Moscow' city, 'DEP1' dep, to_date('31.12.2006','dd.mm.yyyy') year, 562 sales from dual union all
5 select 'Moscow' city, 'DEP2' dep, to_date('31.12.2006','dd.mm.yyyy') year, 457 sales from dual union all
6 select 'Moscow' city, 'DEP3' dep, to_date('31.12.2006','dd.mm.yyyy') year, 890 sales from dual union all
7 select 'Moscow' city, 'DEP4' dep, to_date('31.12.2006','dd.mm.yyyy') year, 345 sales from dual union all
8 select 'Omsk' city, 'DEP1' dep, to_date('31.12.2007','dd.mm.yyyy') year, 34 sales from dual union all
9 select 'Omsk' city, 'DEP2' dep, to_date('31.12.2007','dd.mm.yyyy') year, 213 sales from dual union all
10 select 'Omsk' city, 'DEP3' dep, to_date('31.12.2007','dd.mm.yyyy') year, 156 sales from dual union all
11 select 'Moscow' city, 'DEP1' dep, to_date('31.12.2007','dd.mm.yyyy') year, 762 sales from dual union all
12 select 'Moscow' city, 'DEP2' dep, to_date('31.12.2007','dd.mm.yyyy') year, 657 sales from dual union all
13 select 'Moscow' city, 'DEP3' dep, to_date('31.12.2007','dd.mm.yyyy') year, 1020 sales from dual union all
14 select 'Moscow' city, 'DEP4' dep, to_date('31.12.2007','dd.mm.yyyy') year, 487 sales from dual)
15 /

Table created

SQL> commit;

Commit complete

SQL>

SQL>select * from t
2 order by 3,1,2
3 /

CITY DEP YEAR SALES
------ ---- ----------- ----------
Moscow DEP1 31.12.2006 562
Moscow DEP2 31.12.2006 457
Moscow DEP3 31.12.2006 890
Moscow DEP4 31.12.2006 345
Omsk DEP1 31.12.2006 23
Omsk DEP2 31.12.2006 154
Omsk DEP3 31.12.2006 87
Moscow DEP1 31.12.2007 762
Moscow DEP2 31.12.2007 657
Moscow DEP3 31.12.2007 1020
Moscow DEP4 31.12.2007 487
Omsk DEP1 31.12.2007 34
Omsk DEP2 31.12.2007 213
Omsk DEP3 31.12.2007 156

14 rows selected

SQL>

This is a table of sales for a company in Russia, which has it's departments in two cities: Omsk (Dep 1,2,3) and in Moscow (Dep 1,2,3,4). The data about sales is given for two years: 2006 and 2007.

Now we want to get sales for the year 2007. But we need not only detailed sales (as in a table), but also subtotals for each city. And a total for the whole company.

Non-Model Solution:
Usually such reports are created by using extensions to Group By: Cube, Rollup & Grouping Sets expression.

So our query would look like:

SQL> select city, dep, sum(sales) sales
2 from t
3 where trunc(year, 'y') = to_date('01.01.2007', 'dd.mm.yyyy')
4 group by rollup(city, dep)
5 order by 1, 2
6 /

CITY DEP SALES
------ ---- ----------
Moscow DEP1 762
Moscow DEP2 657
Moscow DEP3 1020
Moscow DEP4 487
Moscow 2926
Omsk DEP1 34
Omsk DEP2 213
Omsk DEP3 156
Omsk 403
3329

10 rows selected

SQL>

Model solution:
To create the same report using model clause, we have to execute the following query:

SQL> select city,dep,sales
2 from t
3 where trunc(year,'y') = to_date('01.01.2007','dd.mm.yyyy')
4 model
5 dimension by (0 total,city, dep)
6 measures(sales)
7 rules upsert all
8 (sales[1,any,null]=sum(sales)[0,CV(),any],
9 sales[1,null,null]=sum(sales)[0,any,any])
10 order by 1,2
11 /

CITY DEP SALES
------ ---- ----------
Moscow DEP1 762
Moscow DEP2 657
Moscow DEP3 1020
Moscow DEP4 487
Moscow 2926
Omsk DEP1 34
Omsk DEP2 213
Omsk DEP3 156
Omsk 403
3329

10 rows selected

SQL>

Let's elaborate how it works.

Using the condition in the WHERE clause - we get only data for the year 2007 in the result.
Then we create dimensions: City and Dep - are the dimensions from the table. But we add one more dimension called Total and by default assign value 0 to all the cells for that dimension.
As a measure we take the field from our table Sales.
So at this stage we get as a result:

SQL> select *
2 from t
3 where trunc(year,'y') = to_date('01.01.2007','dd.mm.yyyy')
4 model
5 dimension by (0 total,city, dep)
6 measures(sales)
7 ()
8 order by 2,3
9 /

TOTAL CITY DEP SALES
---------- ------ ---- ----------
0 Moscow DEP1 762
0 Moscow DEP2 657
0 Moscow DEP3 1020
0 Moscow DEP4 487
0 Omsk DEP1 34
0 Omsk DEP2 213
0 Omsk DEP3 156

7 rows selected

SQL>

Now we need to create extra lines for the subtotals and total.
We use rules for this.
You should remember that there are two ways to refer to a cell in a rule of a model clause: positional or symbolic.

Example of positional: sales['Omsk','DEP3'] or sales['Omsk','DEP4'].
When a cell which is positionally referenced exists - it's value is updated. When it doesn't exist - the model clause will create a cell with such dimensions.

Example of symbolic reference:
sales[city='Omsk',dep='DEP3'], sales[city is any,dep='DEP3'] ,sales[any,dep='DEP3'].
Actually the last one [any] is a positional reference - but I put it in the symbolic group because there's a remark in the documentation:
Note that ANY is treated as a symbolic reference even if it is specified positionally, because it really means that (dimension IS NOT NULL OR dimension IS NULL).
A typical behaviour for symbolically referenced cells: when it exists - the value gets updated. When it doesn't exist - no cells are created.

Don't forget about the cells which are referenced symbolically and positionally at the same time. By default they behave like all the references are symbolic.

Well, actually we could use 3 types of behaviour: UPDATE, UPSERT and UPSERT ALL.
For explicit determining we should place the name of behaviour directly after the RULES keyword. In this case it would be adjusted to all the rules.
If we want to specify a particular rule - we can put the name of behaviour right before the rule (we'll see how it works in the next post).

If we don't put any - the default behaviour UPSERT is used. How it works - I mentioned earlier.

When we use UPDATE - all cells are got updated, and if they don't exist - no cells are created, whichever reference (positional or symbolic) is used.

UPSERT ALL is a scpecial case - it can create cells even for symbolic reference.
But not in all cases, for example - it wouldn't create any cells if we put a rule:
sales[city='Voronezh','DEP5']=100 (because there's no city called 'Voronezh' in our table), but at the same time it would create a cell (which doesn't exist) for the rule sales[city='Omsk','DEP5']=100.
Why?
Because UPSERT ALL works in the following way (from the doc):

Step 1 Find Cells
Find the existing cells that satisfy all the symbolic predicates of the cell reference.

Step 2 Find Distinct Dimension Values
Using just the dimensions that have symbolic references, find the distinct dimension value combinations of these cells.

Step 3 Perform a Cross Product
Perform a cross product of these value combinations with the dimension values specified through positional references.

Step 4 Upsert New Cells
The results of Step 3 are then used to upsert new cells into the array

Just look:

SQL> select *
2 from t
3 where trunc(year,'y') = to_date('01.01.2007','dd.mm.yyyy')
4 model
5 return updated rows
6 dimension by (0 total,city, dep)
7 measures(sales)
8 rules upsert all
9 (sales[0,city='Omsk','DEP5']=100,
10 sales[0,city='Voronezh','DEP5']=200)
11 order by 2,3
12 /

TOTAL CITY DEP SALES
---------- ------ ---- ----------
0 Omsk DEP5 100

SQL>

By putting RETURN UPDATED ROWS we get as a result only updated or inserted cells.
So as you can see despite we had a rule with Voronezh - no cells were inserted.

Let's get back to our subtotals and total.
We had two rules:

sales[1,any,null]=sum(sales)[0,CV(),any],
sales[1,null,null]=sum(sales)[0,any,any]

But as you remember [any] is treated as a symbolic reference - and won't create any cells if we don't put UPSERT ALL semantics.

SQL> select city,dep,sales
2 from t
3 where trunc(year,'y') = to_date('01.01.2007','dd.mm.yyyy')
4 model return updated rows
5 dimension by (0 total,city, dep)
6 measures(sales)
7 rules
8 (sales[1,any,null]=sum(sales)[0,CV(),any])
9 order by 1,2
10 /

CITY DEP SALES
------ ---- ----------

SQL>

If we put UPSERT ALL semantics, our rules would do the following:

First rule: sales[1,any,null] will look for all dimension values that are referenced symbolically - city dimension: Omsk and Moscow.
Then create a cross product with positionally referenced dimensions: we'll get two cells - sales[1,'Omsk',null] and sales[1,'Moscow',null]
And these cells would be inserted.
BTW the dimension total=1 would mean that these are totals or subtotals.

Next rule sales[1,null,null] - will create a company total cell.
On the right side of the rules we have grouping formulas which will sum up all the sales values needed. By using CV() we reference to the dimension value of the cell on the left side of the rule.

PS
We could get the same result without Total dimension:

SQL> select city,dep,sales
2 from t
3 where trunc(year,'y') = to_date('01.01.2007','dd.mm.yyyy')
4 model
5 dimension by (city, dep)
6 measures(sales)
7 rules upsert all
8 (sales[any,null]=sum(sales)[CV(),any],
9 sales[null,null]=sum(sales)[any,null])
10 order by 1,2
11 /

CITY DEP SALES
------ ---- ----------
Moscow DEP1 762
Moscow DEP2 657
Moscow DEP3 1020
Moscow DEP4 487
Moscow 2926
Omsk DEP1 34
Omsk DEP2 213
Omsk DEP3 156
Omsk 403
3329

10 rows selected

SQL>

Here the rules are a little bit different.
The company Total is calculated from the cities subtotals, that has already been calculated, while in the main part (with Total dimension) we calculated all totals and subtotals on detailed sales data.

to be continued ...

Thursday, December 6, 2007

Summation of tree values

Input data:

SQL> drop table t;

Table dropped

SQL>
SQL> create table t as (select 1 id, 20 val from dual union all
2 select 2 id, 30 val from dual union all
3 select 3 id, 15 val from dual union all
4 select 4 id, 100 val from dual)
5 /

Table created

SQL> commit;

Commit complete

SQL> select * from t;

ID VAL
---------- ----------
1 20
2 30
3 15
4 100

SQL>

Problem:
You want to build a hierarchy on that data and get the sum of the values of all preceding ancestors including current value.

E.g. your hierarchy is:

SQL> select t1.*, level, sys_connect_by_path(id,'/') path
2 from t t1
3 connect by prior id = id - 1
4 start with id = 1
5 /

ID VAL LEVEL PATH
---------- ---------- ---------- --------------
1 20 1 /1
2 30 2 /1/2
3 15 3 /1/2/3
4 100 4 /1/2/3/4

SQL>

Solution here could probably be a scalar subquery, which will build tree in a backward direction starting from the current id and will compute sum of all the childs (ancestors in the outer query) values:

SQL> select t1.*,
2 level,
3 sys_connect_by_path(id,'/') path,
4 (select sum(val)
5 from t t2
6 connect by prior id = id + 1
7 start with t2.id = t1.id) summ
8 from t t1
9 connect by prior id = id - 1
10 start with id = 1
11 /

ID VAL LEVEL PATH SUMM
---------- ---------- ---------- ------------- ----------
1 20 1 /1 20
2 30 2 /1/2 50
3 15 3 /1/2/3 65
4 100 4 /1/2/3/4 165

SQL>

So while we have 'prior id = id - 1' rule in the outer query we put 'prior id = id + 1' in a scalar subquery to move backward.

But if the query was built in the following way:

SQL> select t1.*,
2 level,
3 sys_connect_by_path(id, '/') path
4 from t t1
5 connect by prior id < id
6 start with id = 1
7 /

ID VAL LEVEL PATH
---------- ---------- ---------- -----------------
1 20 1 /1
2 30 2 /1/2
3 15 3 /1/2/3
4 100 4 /1/2/3/4
4 100 3 /1/2/4
3 15 2 /1/3
4 100 3 /1/3/4
4 100 2 /1/4

8 rows selected

SQL>

What should we put in the connect by part of a scalar subquery?
If we reverse condition to 'prior id > id' in the subquery - we get wrong result. And it is not needed to be clarified why.

SQL> select t1.*,
2 level,
3 sys_connect_by_path(id,'/') path,
4 (select sum(val)
5 from t t2
6 connect by prior id > id
7 start with t2.id = t1.id) summ
8 from t t1
9 connect by prior id < id
10 start with id = 1
11 /

ID VAL LEVEL PATH SUMM
---------- ---------- ---------- ------------ ----------
1 20 1 /1 20
2 30 2 /1/2 50
3 15 3 /1/2/3 85
4 100 4 /1/2/3/4 255
4 100 3 /1/2/4 255
3 15 2 /1/3 85
4 100 3 /1/3/4 255
4 100 2 /1/4 255

8 rows selected

SQL>

We need to get a backward motion in a particular order. And what can show us the correct order - right, SYS_CONNECT_BY_PATH.

The solution could be:

SQL> select t1.*,
2 (select sum(val)
3 from t t2
4 connect by level <= t1.lvl
5 and id = regexp_substr(t1.path,'[^/]+',1,t1.lvl - level + 1)
6 start with t2.id = t1.id) summ
7 from (select t.*, level lvl, sys_connect_by_path(id, '/') path
8 from t
9 connect by prior id < id
10 start with id = 1) t1
11 /

ID VAL LVL PATH SUMM
---------- ---------- ---------- ------------ ----------
1 20 1 /1 20
2 30 2 /1/2 50
3 15 3 /1/2/3 65
4 100 4 /1/2/3/4 165
4 100 3 /1/2/4 150
3 15 2 /1/3 35
4 100 3 /1/3/4 135
4 100 2 /1/4 120

8 rows selected

SQL>

So at every level - we cut out and determine what the ID should be.

BTW Put these conditions exactly in that order:

...level <= t1.lvl and id = regexp_substr(t1.path,'[^/]+',1,t1.lvl - level + 1)...

if you put it

...id = regexp_substr(t1.path,'[^/]+',1,t1.lvl - level + 1)and level <= t1.lvl...

you'll get:

ORA-01428: argument '0' is out of range

PS Later I wrote another note on the same kinda a problem, which you can discover here: Reports: getting total of all children values in a tree

Sunday, December 2, 2007

Regexp 4: number of occurence (advanced)

... previous

In this post I'll talk about a technique - that I also learned from cd.
In the previous post - we got familiar with a technique to find the number of occurrencies of a substring in a string.

But there are some drawbacks, e.g.:

SQL> with t as (select '(111)111-11-11 words-non-stop(222)222-22-22' str from dual)
2 --
3 select str,
4 nvl(length(regexp_replace(str,
5 '$\d{3}$(\d)\d{2}-\d{2}-\d{2}|.',
6 '\1')),
7 0) occurrencies
8 from t
9 /

STR OCCURRENCIES
------------------------------------------- ------------
(111)111-11-11 words-non-stop(222)222-22-22 2

SQL>

As we can see here - we find both examples of phone numbers as fitting to our requirments. Actually, yes - there are two occurrencies of our pattern. But usually when it is combined with another alphanumeric value - we don't want to take it into account. So the result should be only 1 occurrence.

Well, let's assign the task more concrete: our phone numbers should be preceded or trailed by one of the following: space, comma or semicolon. And it also can be the first or the last structure in a string.

So we may think, that if we add such symbols before and after our pattern - it can help us:

SQL> with t as (select '(111)111-11-11 words-non-stop(222)222-22-22' str from dual)
2 --
3 select str,
4 nvl(length(regexp_replace(str,
5 '(^|[ ,;])$\d{3}$(\d)\d{2}-\d{2}-\d{2}([ ,;]|$)|.',
6 '\2')),
7 0) occurrencies
8 from t
9 /

STR OCCURRENCIES
------------------------------------------- ------------
(111)111-11-11 words-non-stop(222)222-22-22 1

SQL>

Yep, in that case it was helpful.

But if we take an example, when two phone numbers are coming one after another - separated by one space - it is not working:

SQL> with t as (select '(111)111-11-11 (222)222-22-22' str from dual)
2 --
3 select str,
4 nvl(length(regexp_replace(str,
5 '(^|[ ,;])$\d{3}$(\d)\d{2}-\d{2}-\d{2}([ ,;]|$)|.',
6 '\2')),
7 0) occurrencies
8 from t
9 /

STR OCCURRENCIES
----------------------------- ------------
(111)111-11-11 (222)222-22-22 1

SQL>

Why? because it works in the following way:
1. Finds the first pattern '(111)111-11-11 '.
As you see - there's a space at the end.
2. Replaces it with backreference - which in this case would be '1'.
3. Then goes to the rest of the string which is '(222)222-22-22' and it is not the beginning of the string.
So such a structure doesn't suit to our pattern - because it requires - space, comma or colon at the beginning. And we got no space - cause it was taken by the previous structure '(111)111-11-11 '.
4. The function replaces this structure out.
5. We get 1 occurrence as a result, instead of 2.

What we can do here:
The technique is very interesting:
Firstly we place all the patterns - which don't fulfil our requirments. We separate them with the pipe (|), which means OR on the language of regular expressions.
Then we place our desired pattern. And at the end we place '|.' as usual :)

So the generalised pattern would look in the following way:

regexp_replace(your_string,'[symbol1](our_pattern)+|(our_pattern)[symbol2]+|(our_pattern)|.','\3')

where symbol1 - symbol or list of symbols, that can't precede our pattern,
symbol2 - symbol or list of symbols, that can't trail our pattern.

And when we determine the backreference - we would place a reference to the third our pattern.
So when the regexp engine will meet the '[symbol1](our_pattern)+' or '(our_pattern)[symbol2]+', the regexp will remove such structures (actually, it will replace it with our backreference, which points to the 3rd structure - so it would be null). And when it will meet the needed pattern (third one) - it would leave it for us.
The |. at the end - will remove everything - that doesn't suits the previous three structures.

If we get back to our example, we get:

SQL> with t as (select '(111)111-11-11 (222)222-22-22' str from dual)
2 --
3 select str,
4 nvl(length(regexp_replace(str,
5 '[^ ,;]($\d{3}$\d{3}-\d{2}-\d{2})+|$\d{3}$\d{3}-\d{2}-\d{2}[^ ,;]+|$\d{3}$(\d)\d{2}-\d{2}-\d{2}|.',
6 '\2')),
7 0) occurrencies
8 from t
9 /

STR OCCURRENCIES
----------------------------- ------------
(111)111-11-11 (222)222-22-22 2

SQL>

Ok. Let's investigate why we put additional '+' two times.
The first one was when we put it in the first OR structure.
Let's remove it and look at the following example:

SQL> with t as (select '(111)111-11-11 (222)222-22-22 3(333)333-33-33(444)444-44-44' str from dual)
2 --
3 select str,
4 nvl(length(regexp_replace(str,
5 '[^ ,;]($\d{3}$\d{3}-\d{2}-\d{2})|$\d{3}$\d{3}-\d{2}-\d{2}[^ ,;]+|$\d{3}$(\d)\d{2}-\d{2}-\d{2}|.',
6 '\2')),
7 0) occurrencies_INCORRECT,
8 nvl(length(regexp_replace(str,
9 '[^ ,;]($\d{3}$\d{3}-\d{2}-\d{2})+|$\d{3}$\d{3}-\d{2}-\d{2}[^ ,;]+|$\d{3}$(\d)\d{2}-\d{2}-\d{2}|.',
10 '\2')),
11 0) occurrencies_CORRECT
12 from t
13 /

STR OCCURRENCIES_INCORRECT OCCURRENCIES_CORRECT
----------------------------------------------------------- ---------------------- --------------------
(111)111-11-11 (222)222-22-22 3(333)333-33-33(444)444-44-44 3 2

SQL>

Actually, when we removed '+' (occurrencies_incorrect field) we got 3 as a result insted of 2 - why?
Well, the problem as how you guessed in the '3(333)333-33-33(444)444-44-44' part.
If we don't put '+' in the '[^ ,;]($\d{3}$\d{3}-\d{2}-\d{2})' - it would meet 3(333)333-33-33 and stop. After this structure will be removed - the regexp is going forward.
It finds (444)444-44-44.
Well it doesn't suit to the '[^ ,;]($\d{3}$\d{3}-\d{2}-\d{2})' - because there's no symbol [^ ,;] before it (we removed 3(333)333-33-33).
Then it checks whether it suits to '$\d{3}$\d{3}-\d{2}-\d{2}[^ ,;]+'. Again negative - cause there is no '[^ ,;]+' at the end. Finally it suits to our third structure '$\d{3}$(\d)\d{2}-\d{2}-\d{2}' - and that's why it hits the final result set.
If we put '+' after the first pattern - it would mean one or more occurrences of our pattern. So the whole 3(333)333-33-33(444)444-44-44 will fall under the first structure - and gets removed.

Let's continue with the second '+'.
Imagine the following situation:

SQL> with t as (select '(111)111-11-11 (222)222-22-22 (333)333-33-33A(444)444-44-44' str from dual)
2 --
3 select str,
4 nvl(length(regexp_replace(str,
5 '[^ ,;]($\d{3}$\d{3}-\d{2}-\d{2})+|$\d{3}$\d{3}-\d{2}-\d{2}[^ ,;]|$\d{3}$(\d)\d{2}-\d{2}-\d{2}|.',
6 '\2')),
7 0) occurrencies_INCORRECT,
8 nvl(length(regexp_replace(str,
9 '[^ ,;]($\d{3}$\d{3}-\d{2}-\d{2})+|$\d{3}$\d{3}-\d{2}-\d{2}[^ ,;]+|$\d{3}$(\d)\d{2}-\d{2}-\d{2}|.',
10 '\2')),
11 0) occurrencies_CORRECT
12 from t
13 /

STR OCCURRENCIES_INCORRECT OCCURRENCIES_CORRECT
----------------------------------------------------------- ---------------------- --------------------
(111)111-11-11 (222)222-22-22 (333)333-33-33A(444)444-44-44 3 2

SQL>

So the problem again in the last two phone numbers. Now we put an extra 'A' between them. What causes the problem in that case:
If we don't place '+' after the '[^ ,;]' in the second structure,
the pattern '$\d{3}$\d{3}-\d{2}-\d{2}[^ ,;]' will meet (333)333-33-33A, with extra 'A' at the end - and removes it all.
So when it continues with the rest of the line it recognizes (444)444-44-44 as a valid phone number.
But when we add a '+' after '[^ ,;]' the pattern '$\d{3}$\d{3}-\d{2}-\d{2}[^ ,;]+' will cover the whole (333)333-33-33A(444)444-44-44, and remove it.

PS
Our mask in the regexp can be simplified a little.
In general it would look like:

regexp_replace(your_string,'[symbol1](our_pattern)+|(our_pattern)[symbol3]|.','\2')

Here everything means the same as in the previous examples - except symbol3. Now it means any symbol that can trail our pattern.
And in that case we should place the second structure as a backreference (instead of third in the previous examples).

SQL> with t as (select '(111)111-11-11 (222)222-22-22 3(333)333-33-33 (444)444-44-44' str from dual)
2 --
3 select str,
4 nvl(length(regexp_replace(str,
5 '[^ ,;]($\d{3}$\d{3}-\d{2}-\d{2})+|$\d{3}$(\d)\d{2}-\d{2}-\d{2}([ ,;]|$)|.',
6 '\2')),
7 0) occurrencies
8 from t
9 /

STR OCCURRENCIES
------------------------------------------------------------ ------------
(111)111-11-11 (222)222-22-22 3(333)333-33-33 (444)444-44-44 3

SQL>

Hope you can find out how it works by yourself now :)

Volder's Oracle Notes

Wednesday, December 19, 2007

Reports: getting total of all children values in a tree

Sunday, December 16, 2007

Reports: forecasting

Tuesday, December 11, 2007

Reports: ratios & percentages

Sunday, December 9, 2007

Reports: totals & subtotals

Thursday, December 6, 2007

Summation of tree values

Sunday, December 2, 2007

Regexp 4: number of occurence (advanced)

About me

Links

Archive

Labels

Visitors Info

Social Networking