Saturday, March 22, 2008

Fragments of string between special patterns

Although my solution wasn't rated highly on the forum, I decided to put it here :)

Task:
There is a string value given, comprising a number of elements that are put into particular patterns on both sides.
We need to get all such patterns and inside value out of the string.
E.g. the beginning pattern is '<<<' and the ending pattern is '>>>'.

Solution:
SQL> with t as (select 'xyz<<<testdata>>>123' col1 from dual union all
2 select 'zzz<<<test><<<data>>>ssf' from dual union all
3 select 'sss<<<test><data>>>sffsd' from dual union all
4 select 'ggg<<<test>>>34345<<<data>>>lks' from dual)
5 --
6 select t.*,
7 reverse(regexp_replace(reverse(regexp_replace(col1, '(<<<.*?>>>)|.', '\1')),
8 '(>>>.*?<<<)|.', '\1')) col2
9 from t;

COL1 COL2
------------------------------- ----------------------------------
xyz<<<testdata>>>123 <<<testdata>>>
zzz<<<test><<<data>>>ssf <<<data>>>
sss<<<test><data>>>sffsd <<<test><data>>>
ggg<<<test>>>34345<<<data>>>lks <<<test>>><<<data>>>

SQL>

Explanation:
The first thing what we are doing - is getting all patterns that start from '<<<' and end with '>>>'. By using non-greedy search we have as a result not only one long value, but separated patterns (it is well seen in the fourth row):
SQL> with t as (select 'xyz<<<testdata>>>123' col1 from dual union all
2 select 'zzz<<<test><<<data>>>ssf' from dual union all
3 select 'sss<<<test><data>>>sffsd' from dual union all
4 select 'ggg<<<test>>>34345<<<data>>>lks' from dual)
5 --
6 select t.*,
7 regexp_replace(col1, '(<<<.*?>>>)|.', '\1') col2
8 from t;

COL1 COL2
------------------------------- ----------------------------------
xyz<<<testdata>>>123 <<<testdata>>>
zzz<<<test><<<data>>>ssf <<<test><<<data>>>
sss<<<test><data>>>sffsd <<<test><data>>>
ggg<<<test>>>34345<<<data>>>lks <<<test>>><<<data>>>

SQL>

This technique of getting only desired pattern from the string is described here.

But the result we got is not 100% what were going to receive.
As you can see in the second line there is a resulting value '<<<test><<<data>>>', which is not correct. Because there is an opening pattern '<<<' is met inside the value. So it looks as only '<<<data>>>' should be returned.
For that purpose we need to read the string starting from the end and moving in a backward direction to the very beginning of the value.
In such cases using REVERSE() function could be a solution, although it is not documented. So first we reverse the string and secondly we pass it in the usual way from beginning to the end.
And by the way as we have it reversed all our patterns should also be reversed: the first one '>>>' would be met and then closing '<<<'. In the end we just put the second REVERSE() to get the initial direction of the string:
SQL> with t as (select 'xyz<<<testdata>>>123' col1 from dual union all
2 select 'zzz<<<test><<<data>>>ssf' from dual union all
3 select 'sss<<<test><data>>>sffsd' from dual union all
4 select 'ggg<<<test>>>34345<<<data>>>lks' from dual)
5 --
6 select t.*,
7 regexp_replace(col1, '(<<<.*?>>>)|.', '\1') col2,
8 reverse(regexp_replace(col1, '(<<<.*?>>>)|.', '\1')) col3,
9 regexp_replace(reverse(regexp_replace(col1, '(<<<.*?>>>)|.', '\1')),'(>>>.*?<<<)|.', '\1') col4,
10 reverse(regexp_replace(reverse(regexp_replace(col1, '(<<<.*?>>>)|.', '\1')),'(>>>.*?<<<)|.', '\1')) col5
11 from t;

COL1 COL2 COL3 COL4 COL5
------------------------------- ------------------------ ------------------------ ---------------------- ---------------------
xyz<<<testdata>>>123 <<<testdata>>> >>>atadtset<<< >>>atadtset<<< <<<testdata>>>
zzz<<<test><<<data>>>ssf <<<test><<<data>>> >>>atad<<<>tset<<< >>>atad<<< <<<data>>>
sss<<<test><data>>>sffsd <<<test><data>>> >>>atad<>tset<<< >>>atad<>tset<<< <<<test><data>>>
ggg<<<test>>>34345<<<data>>>lks <<<test>>><<<data>>> >>>atad<<<>>>tset<<< >>>atad<<<>>>tset<<< <<<test>>><<<data>>>

SQL>

As you can see the redundant part '<<<test>' have dissapeared from the line #2.

PS
The opening pattern and closing pattern could be any, e.g. '<<~' as opening and '><<-' as closing, in this case the query would look like:
SQL> with t as (select 'xyz<<~testdata><<-123' col1 from dual union all
2 select 'zzz<<~test><<-data><<-ssf' from dual union all
3 select 'sss<<~test>>>><<-data>>~sffsd' from dual union all
4 select 'ggg<<~test><<-34345<<~data<<~data2><<-lks' from dual)
5 --
6 select t.*,
7 reverse(regexp_replace(reverse(regexp_replace(col1, '(<<~.*?><<-)|.', '\1')),'(-<<>.*?~<<)|.', '\1')) col2
8 from t;

COL1 COL2
----------------------------------------- -----------------------------------
xyz<<~testdata><<-123 <<~testdata><<-
zzz<<~test><<-data><<-ssf <<~test><<-
sss<<~test>>>><<-data>>~sffsd <<~test>>>><<-
ggg<<~test><<-34345<<~data<<~data2><<-lks <<~test><<-<<~data2><<-

SQL>