Wednesday, October 3, 2007

Removing duplicate elements from the string

Hi there! :)
Well, this is actually my first post in this newly created blog.
The idea of keeping a blog was born a day ago, when there was a question on the Oracle forum (this thread). I’ve met the same one only a few days ago – but couldn’t find it, cause there were some problems with the search engine I usually face when I need to find something :))
So I decided – why don’t I keep my own blog – where I can post such solutions that can be interesting for others. And here we are!

Ok, let's go back to the problem, mentioned in those links.

Problem description:

Input data:

There is a string containing elements. They can be separated with any symbol, e.g. a comma:

SQL> with t as (select 'elem1, elem2, elem3, elem1, elem3, elem2, elem2' str
from dual)
--
select * from t
/

STR
-----------------------------------------------
elem1, elem2, elem3, elem1, elem3, elem2, elem2

SQL>

...or any other clear way, for example, it was formulated as "3 letter codes". So in this case there are no element delimiters, but we know that each element is of three same consequtive letters, e.g.:

SQL> with t as (select 'AAABBBCCCBBBDDDAAAEEEBBB' str from dual)
--
select * from t
/

STR
------------------------
AAABBBCCCBBBDDDAAAEEEBBB

SQL>

Goal:

The objective is to remove duplicate elements out of the string, to leave only one specimen of each element.

Expected output:

So in the first case the result should be:
elem1, elem2, elem3

In the second:
AAABBBCCCDDDEEE

Solution:

In both cases the first thing what we need - is to pick out elements from the string.
Regular expressions are very friendly for us in that job (espesially, when there are no delimiters).
So elements like AAA, BBB, CCC etc. can be written as '([[:alpha:]])\1{2}' - any letter trailed by two same letters.

Now we want to understand - what would be the mask for such an element followed by another elements and again the same first element, e.g. AAABBBCCCAAA, or it can be followed directly by the same element: AAAAAA.
In language of regular expressions it would look like '(([[:alpha:]])\2{2}).*\1'

That's it!
Now if we put regexp_replace(str, '(([[:alpha:]])\2{2})(.*)\1','\1\3') we would throw out one last element which is met firstly and has duplicate values in the string. It would be last, because we used greedy operator '*'.

So for example:
SQL> with t as (select 'AAABBBCCCBBBEEEDDDAAAEEEBBBEEE' str from dual)
--
select str,
regexp_replace(str, '(([[:alpha:]])\2{2})(.*)\1','\1\3') new_str
from t
/

STR NEW_STR
------------------------------ -------------------------------
AAABBBCCCBBBEEEDDDAAAEEEBBBEEE AAABBBCCCBBBEEEDDDEEEBBB

SQL>
What is the logic of this operation:

  1. We look for the first element which has duplicated values. In our case it is AAA.
  2. Then regular expression operator finds the last AAA met in the string and removes it.
  3. Then, it goes to the rest of the string and again finds the first element, which has duplicated values in the rest of the string, now it is EEE.
  4. And finally removes the last EEE element.
  5. In our example it is the end of the operations, but if the string is longer it would proceed the previously mentioned operations again and again.
It would be more comprehensible if we mark the first met element with green, and the last, which would be removed, with red:
'AAABBBCCCBBBEEEDDDAAAEEEBBBEEE'.
As you can see, the first EEE element is not really the first EEE element in the string. It is the first one in the rest of the string after we removed AAA element.

If we iteratively apply this regular expression to our string - finally we remove all duplicated elements. But about it later. Now let's improve our expression a little bit.
First we'll change greedy operator '*' to non-greedy '*?'. So that we will do our job in fewer iterations.

Let's look at the following example:
we have string 'AAABBBAAAAAACCCAAA'.
with greedy regexp_replace(str, '(([[:alpha:]])\2{2})(.*)\1','\1\3') we'll have iterations:

  1. AAABBBAAAAAACCCAAA
  2. AAABBBAAAAAACCC
  3. AAABBBAAACCC
result: 'AAABBBCCC'. So it took us 3 times to iterate.
And with the non-greedy regexp_replace(str, '(([[:alpha:]])\2{2})(.*?)\1','\1\3') we'll have iterations:

  1. AAABBBAAAAAACCCAAA
  2. AAABBBAAACCC
The same result and achieved in two iterations.

Let's make one more improvement: add '+' after the backreference '\1'. In the language of regular expressions - it means one or more occurencies of the element which is preceding this '+'.
Let's imagine we have a string: 'AAABBBAAAAAAAAAAAA'.
With regexp_replace(str, '(([[:alpha:]])\2{2})(.*?)\1','\1\3') we'll make three iterations:

  1. AAABBBAAAAAAAAAAAA
  2. AAABBBAAAAAA
  3. AAABBBAAA
If we place '+' and use regexp_replace(str, '(([[:alpha:]])\2{2})(.*?)\1+','\1\3') we'll have to make only one iteration:

  1. AAABBBAAAAAAAAAAAA
Finally, it is a time for implementing an iterative mechanism of applying the same function to the string. Starting from Oracle version 10g there was a nice Model clause introduced, which can be used to proceed operations iteratively.

So, the final query would look like:
SQL> WITH t AS (SELECT
'AAABBBCCCBBBDDDAAAEEEBBB' str FROM dual)
--
select str, str_new from t
model
dimension by (0 dim)
measures(str, str str_new)
rules iterate(100) until (str_new[0] = previous(str_new[0]))
(str_new[0]=regexp_replace(str_new[0],'(([[:alpha:]])\2{2})(.*?)\1+','\1\3'));

STR STR_NEW
------------------------ ------------------------
AAABBBCCCBBBDDDAAAEEEBBB AAABBBCCCDDDEEE

SQL>

Two words about Model clause here:

  • With regexp_replace you are already familiar. It is applyed to the string during each iteration;
  • iterate(100) means maximum of iterations could be proceeded is 100 (it can be increased if you want);
  • until (str[0] = previous(str[0])) means stop iterations when the string is not changed during the previous iteration.
Well, that's it.

Now let's take the case when there is a delimited string, e.g. with commas. In that case we need to change our regular expression a little bit. Aketi Jyuuzou made this in one of the mentioned links. I just clarify it for readers. So for a string like 'elem1, elem2, elem3, elem1, elem3, elem2, elem2' we'll need the following:
regexp_replace(str,'(^|,)([^,]+,)(.*?,)?\2+','\1\2\3')
I just added non-greedy '*?' and '+' in the end, but you have already read about the impact of these. So you understand what are they needed for.

Backreference \1 stands for (^|,) - this is the symbol before the first met duplicated element. It is either begining of the string (^), either a comma trailing the previous element (,).
Backreference \2 stands for the ([^,]+,) - this is the element itself, which means one or more non-comma symbols followed by a comma.
Backreference \3 stands for (.*?,)? which is the minimum (non-greedy) number of symbols before one or more duplicated value (\2+).

So the final query would look like:

SQL> WITH t AS (SELECT 'elem1,elem2,elem3,elem1,elem3,elem2,elem2' str FROM dual)
--
select str, rtrim(str_new,',') new_str from t
model
dimension by (0 dim)
measures(str, str||',' str_new)
rules iterate(100) until (str_new[0] = previous(str_new[0]))
(str_new[0]=regexp_replace(str_new[0],'(^|,)([^,]+,)(.*?,)?\2+','\1\2\3'));

STR NEW_STR
----------------------------------------- -------------------------
elem1,elem2,elem3,elem1,elem3,elem2,elem2 elem1,elem2,elem3

SQL>


Hope it was useful!

PS
It is my first post, so I'll be grateful if you post your comments and notices about it. Was it too comlicatedely stated or maybe too detailed. Well,waiting for your replies :)

20 comments:

Andrey Ermakov said...

Hi, countryman!:)))
I watching for your posts in oracle.com and sql.ru:)
And I learn from this posts. Thank you, Volder!
You are right that you made this blog, so to hold!;) Please, forgive me for my english.
P.S. xymbo

Andrey Ermakov said...
This comment has been removed by the author.
Anonymous said...

Very good explanation. Really appreciate your answers on OTN.

SnippetyJoe said...

Hi Volder,

Glad to see you have your own blog now! I'm looking forward to following it. I'm sure there are lots of clever and insightful solutions yet to come, just like your OTN posts. :-)

--
Joe Fuda
SQL Snippets

Rob van Wijk said...

Hi Volder,

Good to see you started blogging too. Knowing the quality of your posts on the SQL and PL/SQL forum, I look forward to see more of them here.

And if you are posting frequently, then I'll even put a link from my own blog to here :-)

Regards,
Rob.

Anonymous said...

I was searching for it and you had it.

Good Job in explaining the details.

Thanks for the blog.

Just me said...

This one was really good and very helpful. I have a similar requirement and now I am trying to incultate the same trick in my sql.

Anonymous said...

if i want to use '/' instead of ',' as a separator then what I have to do for that...

Volder said...

you need to make some amendments in regex - instead of comma to use a slash.
I'm sure you will find the way - if you know what regular expressions are.

Anonymous said...

I want to use the last sql statement (WITH t) in the Stored Procedure. But it is giving error. Can you tell me how to do that?

Volder said...

having information you gave - sorry, I cannot.

with such a question you better go to Oracle forum - where you will for sure get the answer.

hello_earth said...

thanks

arul said...

thank you for your great posts. i am trying to remove only consecutive duplicates from a + delimited string like 'ipd_vupp_prem3_8_d_d_es+ipd_vupp_prem3_8_d_d_es+ipd_vupp_prem3_9_d_d_es+ipd_vupp_prem3_8_d_d_es' where i want the output to look like 'ipd_vupp_prem3_8_d_d_es+ipd_vupp_prem3_9_d_d_es+ipd_vupp_prem3_8_d_d_es'

gcshekar@yahoo.com said...

Thanks for this very helpful solution.
Regards
GC Shekar

RAJAT ARORA said...

Hi Volder,

helpful article!

I was using your last Query for removing duplicates in a string with delimiter. It works fine as a SQL
but when i tried using it in a store Procedure , i get compilation error as "(S42) Expecting: ) , CONNECT CROSS FOR FULL GROUP HAVING INNER INTERSECT JOIN LEFT MINUS NATURAL ORDER RIGHT START UNION WHERE" at the following line :

dimension by (0 dim).

Couldn't find much on internet about it. Please see if u can help!

Thanks,
Rajat

Anonymous said...

This was very usefulfulful!!! Thank you very very much much Sir.

aws said...

Thank you for this, the fastest and least resource consuming solution.

Thank you again!

Aru said...

it's really working is pl\sql code to do the same

Anonymous said...

nice explanation..and very helpful..thank you

Caloveechy said...

Hey there, great stuff!! Question, how about removing duplicates when the last letter of the word is the same letter of the next word?

For instance: 4YR,CAS,SUB1

In this instance, the following results: 4YR,CASUB1

This happens because of the 'S' at the end of 'CAS' and at the beginning of 'SUB1'; how can you combat this? I've looked ALL OVER the internet and it seems like it's not doable unless you create a function and I'd like to avoid that if possible.

Thanks!