Friday, January 11, 2008

Grouping overlapping number intervals, date periods etc...

Today I spent some time on a quite simple query. I'm speaking about this thread on OTN.

It is a simple task - to get overlapping intervals grouped. Sometimes this problem is formulated as to find a continious date period out of several separate periods.
Simple solution here would be to use analytic functions an so called start_of_group type of query:
SQL> with mytable as (select 1 id, 1 begin_data, 10 end_data from dual union all
2 select 1 id, 5 begin_data, 7 end_data from dual union all
3 select 1 id, 4 begin_data, 8 end_data from dual union all
4 select 1 id, 11 begin_data, 15 end_data from dual union all
5 select 1 id, 13 begin_data, 18 end_data from dual union all
6 select 2 id, 1 begin_data, 18 end_data from dual union all
7 select 2 id, 13 begin_data, 23 end_data from dual union all
8 select 2 id, 31 begin_data, 34 end_data from dual)
9 select id, min(begin_data) beg_d, max(end_data) end_d
10 from (select t1.*,
11 sum(start_of_group) over(partition by id order by begin_data, end_data) gr
12 from (select t.*,
13 case
14 when begin_data >
15 nvl(max(end_data)
16 over(partition by id order by begin_data,end_data
17 rows between unbounded preceding and 1 preceding),
18 begin_data-1)
19 then 1
20 else 0
21 end start_of_group
22 from mytable t) t1)
23 group by id, gr
24 order by 1, 2
25 /

ID BEG_D END_D
---------- ---------- ----------
1 1 10
1 11 18
2 1 23
2 31 34

SQL>

BTW if the table contains duplicate rows - I'd better add some unique key (e.g. add the most inner inline view with row_number() over() in there), and then use this row_number value to get rows sorted in the same order at all levels of query.

I'm speaking about the following case:
ID BEGIN_DATA END_DATA
---------- ---------- ----------
1 1 10
1 2 7
1 11 15
1 11 15
1 14 17

Although, I couldn't simulate such a situation, but it is not clear for me, if there are two separate (same) sortings in the query, would oracle sort rows equally, when there are duplicate rows.

I mean can't it happen, that while getting the value of start_of_group the Oracle gives a value of 1 to one row of the duplicate rows, and while summing this start_of_group exchange the places of these rows:
ID BEGIN_DATA END_DATA START_OF_GROUP
---------- ---------- ---------- --------------
1 1 10 1
1 2 7 0
1 11 15 0 <--+
|
1 11 15 1 <--+
1 14 17 0

In that case, the overall result would be wrong.
If you know how Oracle will behave in such a case, please, leave a comment.

PS
A small test case shows that you shouldn't rely on it:
SQL> with test as (select 1 id, 1 val from dual union all
2 select 1 id, 10 val from dual union all
3 select 1 id, 5 val from dual union all
4 select 1 id, 7 val from dual)
5 --
6 select t2.*, row_number() over(order by id) rn3
7 from (select t1.*, row_number() over(order by id, val) rn2
8 from (select test.*, row_number() over(order by id) rn1 from test) t1
9 order by val) t2
10 /

ID VAL RN1 RN2 RN3
---------- ---------- ---------- ---------- ----------
1 1 1 1 1
1 10 4 4 2
1 7 2 3 3
1 5 3 2 4

SQL>

row_number() over(order by id) is absoultely the same in the most inner query and the most outer. But produce different results, because another sortings happened between these two layers. So even if there are no sortings between them happen, I'd shurely won't rely on it, and rewrite the query in the way I mentioned before:
select id, min(begin_data) beg_d, max(end_data) end_d
from (select t1.*,
sum(start_of_group) over(partition by id order by rn) gr
from (select t.*,
case
when begin_data >
nvl(max(end_data)
over(partition by id order by rn
rows between unbounded preceding and 1 preceding),
begin_data - 1)
then 1
else 0
end start_of_group
from (select tt.*,
row_number() over(partition by id order by begin_data, end_data) rn
from mytable tt) t) t1)
group by id, gr
order by 1, 2

0 comments: