Wednesday, April 30, 2008

Wrong encoding with HTTPURITYPE()

Several days ago my colleague asked me - do you have any idea why the following query returns Currency Name in the wrong encoding:
SQL> SELECT EXTRACTVALUE(VALUE(P), '/Valute/NumCode') NUMCODE,
2 EXTRACTVALUE(VALUE(P), '/Valute/CharCode') CHARCODE,
3 EXTRACTVALUE(VALUE(P), '/Valute/Nominal') NOMINAL,
4 EXTRACTVALUE(VALUE(P), '/Valute/Name') NAME,
5 EXTRACTVALUE(VALUE(P), '/Valute/Value') VALUE
6 FROM TABLE(XMLSEQUENCE(HTTPURITYPE('http://www.cbr.ru/scripts/XML_daily.asp?date_req=22.03.2008')
7 .GETXML().EXTRACT('//ValCurs/Valute'))) P
8 WHERE EXTRACTVALUE(VALUE(P), '/Valute/CharCode') = 'USD'
9 /

NUMCODE CHARCODE NOMINAL NAME VALUE
-------------------- -------------------- -------------------- -------------------- --------------------
840 USD 1 Aieea? NOA 23,7773

SQL>
It is a link to our central bank site where rates of different currencies are listed in rouble equivalent.

I have not come across such stuff before - so after reading manual on this topic, I found out that the encoding is not sent explicitly from this cbr http url in the content-type:
SQL> set serveroutput on
SQL>
SQL> DECLARE
2 httpuri HTTPURIType;
3 BEGIN
4 httpuri := HTTPURIType('http://www.rambler.ru/news/top/rss.html');
5 DBMS_OUTPUT.put_line('rambler: ' || httpuri.getContentType());
6 httpuri := HTTPURIType('http://www.cbr.ru/scripts/XML_daily.asp?date_req=22.03.2008');
7 DBMS_OUTPUT.put_line('cbr: ' || httpuri.getContentType());
8 END;
9 /

rambler: text/xml; charset=windows-1251
cbr: text/xml

PL/SQL procedure successfully completed

SQL>
as we can find from the documentation:
The HTTPURITYPE is a subtype of the UriType that provides support for the HTTP protocol. This uses the UTL_HTTP package underneath to access the HTTP URLs.

And in the chapter about UTL_HTTP we can find that:
Following the HTTP protocol standard specification, if the media type of a request or a response is text, but the character set information is missing in the Content-Type header, the character set of the request or response body should default to ISO-8859-1. A response created for a request inherits the default body character set of the request instead of the body character set of the current session. The default body character set is ISO-8859-1 in a database user session.

So this is what happens in our case - default encoding is ISO-8859-1 and names of currencies are wrongly displayed.

The easiest way is to specify the default charset with SET_BODY_CHARSET:
SQL> set colwidth 20
SQL> select utl_http.get_body_charset() from dual;

UTL_HTTP.GET_BODY_CH
--------------------
ISO-8859-1

SQL>
SQL> select * from v$nls_valid_values
2 where value like '%1251%'
3 /

PARAMETER VALUE ISDEPRECATED
-------------------- -------------------- ------------
CHARACTERSET CL8MSWIN1251 FALSE

SQL> exec utl_http.set_body_charset('CL8MSWIN1251');

PL/SQL procedure successfully completed

SQL> select utl_http.get_body_charset() from dual;

UTL_HTTP.GET_BODY_CH
--------------------
CL8MSWIN1251

SQL>
SQL> SELECT EXTRACTVALUE(VALUE(P), '/Valute/NumCode') NUMCODE,
2 EXTRACTVALUE(VALUE(P), '/Valute/CharCode') CHARCODE,
3 EXTRACTVALUE(VALUE(P), '/Valute/Nominal') NOMINAL,
4 EXTRACTVALUE(VALUE(P), '/Valute/Name') NAME,
5 EXTRACTVALUE(VALUE(P), '/Valute/Value') VALUE
6 FROM TABLE(XMLSEQUENCE(HTTPURITYPE('http://www.cbr.ru/scripts/XML_daily.asp?date_req=22.03.2008')
7 .GETXML().EXTRACT('//ValCurs/Valute'))) P
8 WHERE EXTRACTVALUE(VALUE(P), '/Valute/CharCode') = 'USD'
9 /

NUMCODE CHARCODE NOMINAL NAME VALUE
-------------------- -------------------- -------------------- -------------------- --------------------
840 USD 1 Доллар США 23,7773

SQL>
So now the currency name is properly displayed. But this method affects only current session. And this default charset will be left for future requests also.

Another way - is a little bit difficult.
It is based on the fact, that in case you pump blob from the http link - the result encoding is not changed:
URIType Methods:
getBLOB()Returns the target of the URI as a BLOB value. No character conversion is performed, and the character encoding is that of the URI target. This method can also be used to fetch binary data.

So the idea is to transform this blob value into XML with the needed encoding.
I found a link from the OTN forum, which can be very useful: http://forums.oracle.com/forums/thread.jspa?messageID=1864895�
We can easily create a function, that will take as an input the blob value and the needed encoding for conversion and would return as a result XMLType value:
SQL> create or replace function BLOB2XML(l_blob blob, l_charset varchar2)
2 return xmltype is
3 l_clob clob;
4 l_src_offset number;
5 l_dest_offset number;
6 l_blob_csid number := nvl(nls_charset_id(l_charset), dbms_lob.default_csid);
7 v_lang_context number := dbms_lob.default_lang_ctx;
8 l_warning number;
9 l_amount number;
10 begin
11 dbms_lob.createtemporary(l_clob, true);
12 l_src_offset := 1;
13 l_dest_offset := 1;
14 l_amount := dbms_lob.getlength(l_blob);
15 dbms_lob.converttoclob(l_clob,
16 l_blob,
17 l_amount,
18 l_dest_offset,
19 l_src_offset,
20 l_blob_csid,
21 v_lang_context,
22 l_warning);
23 return xmltype(l_clob);
24 end;
25 /

Function created

SQL> show err blob2xml;
No errors for FUNCTION SYSTEM.BLOB2XML

SQL> set colwidth 20
SQL> select utl_http.get_body_charset() from dual;

UTL_HTTP.GET_BODY_CH
--------------------
ISO-8859-1

SQL>
SQL> SELECT EXTRACTVALUE(VALUE(P), '/Valute/NumCode') NUMCODE,
2 EXTRACTVALUE(VALUE(P), '/Valute/CharCode') CHARCODE,
3 EXTRACTVALUE(VALUE(P), '/Valute/Nominal') NOMINAL,
4 EXTRACTVALUE(VALUE(P), '/Valute/Name') NAME,
5 EXTRACTVALUE(VALUE(P), '/Valute/Value') VALUE
6 FROM TABLE(XMLSEQUENCE(blob2XML(HTTPURITYPE('http://www.cbr.ru/scripts/XML_daily.asp?date_req=22.03.2008')
7 .GETblob(),'CL8MSWIN1251').EXTRACT('//ValCurs/Valute'))) P
8 WHERE EXTRACTVALUE(VALUE(P), '/Valute/CharCode') = 'USD'
9 /

NUMCODE CHARCODE NOMINAL NAME VALUE
-------------------- -------------------- -------------------- -------------------- --------------------
840 USD 1 Доллар США 23,7773

SQL> select utl_http.get_body_charset() from dual;

UTL_HTTP.GET_BODY_CH
--------------------
ISO-8859-1

SQL>
As we can see the default charset for UTL_HTTP package has not being changed.
Just to be sure that the problem was in encoding - let's try our query with ISO-8859-1 charset:
SQL> select * from v$nls_valid_values
2 where upper(value) like '%ISO%8859%1'
3 /

PARAMETER VALUE ISDEPRECATED
-------------------- -------------------- ------------
CHARACTERSET WE8ISO8859P1 FALSE

SQL>
SQL> SELECT EXTRACTVALUE(VALUE(P), '/Valute/NumCode') NUMCODE,
2 EXTRACTVALUE(VALUE(P), '/Valute/CharCode') CHARCODE,
3 EXTRACTVALUE(VALUE(P), '/Valute/Nominal') NOMINAL,
4 EXTRACTVALUE(VALUE(P), '/Valute/Name') NAME,
5 EXTRACTVALUE(VALUE(P), '/Valute/Value') VALUE
6 FROM TABLE(XMLSEQUENCE(blob2XML(HTTPURITYPE('http://www.cbr.ru/scripts/XML_daily.asp?date_req=22.03.2008')
7 .GETblob(),'WE8ISO8859P1').EXTRACT('//ValCurs/Valute'))) P
8 WHERE EXTRACTVALUE(VALUE(P), '/Valute/CharCode') = 'USD'
9 /

NUMCODE CHARCODE NOMINAL NAME VALUE
-------------------- -------------------- -------------------- -------------------- --------------------
840 USD 1 Aieea? NOA 23,7773

SQL>
Yes, this is exactly what we got at our initial query.

Saturday, March 22, 2008

Fragments of string between special patterns

Although my solution wasn't rated highly on the forum, I decided to put it here :)

Task:
There is a string value given, comprising a number of elements that are put into particular patterns on both sides.
We need to get all such patterns and inside value out of the string.
E.g. the beginning pattern is '<<<' and the ending pattern is '>>>'.

Solution:
SQL> with t as (select 'xyz<<<testdata>>>123' col1 from dual union all
2 select 'zzz<<<test><<<data>>>ssf' from dual union all
3 select 'sss<<<test><data>>>sffsd' from dual union all
4 select 'ggg<<<test>>>34345<<<data>>>lks' from dual)
5 --
6 select t.*,
7 reverse(regexp_replace(reverse(regexp_replace(col1, '(<<<.*?>>>)|.', '\1')),
8 '(>>>.*?<<<)|.', '\1')) col2
9 from t;

COL1 COL2
------------------------------- ----------------------------------
xyz<<<testdata>>>123 <<<testdata>>>
zzz<<<test><<<data>>>ssf <<<data>>>
sss<<<test><data>>>sffsd <<<test><data>>>
ggg<<<test>>>34345<<<data>>>lks <<<test>>><<<data>>>

SQL>

Explanation:
The first thing what we are doing - is getting all patterns that start from '<<<' and end with '>>>'. By using non-greedy search we have as a result not only one long value, but separated patterns (it is well seen in the fourth row):
SQL> with t as (select 'xyz<<<testdata>>>123' col1 from dual union all
2 select 'zzz<<<test><<<data>>>ssf' from dual union all
3 select 'sss<<<test><data>>>sffsd' from dual union all
4 select 'ggg<<<test>>>34345<<<data>>>lks' from dual)
5 --
6 select t.*,
7 regexp_replace(col1, '(<<<.*?>>>)|.', '\1') col2
8 from t;

COL1 COL2
------------------------------- ----------------------------------
xyz<<<testdata>>>123 <<<testdata>>>
zzz<<<test><<<data>>>ssf <<<test><<<data>>>
sss<<<test><data>>>sffsd <<<test><data>>>
ggg<<<test>>>34345<<<data>>>lks <<<test>>><<<data>>>

SQL>

This technique of getting only desired pattern from the string is described here.

But the result we got is not 100% what were going to receive.
As you can see in the second line there is a resulting value '<<<test><<<data>>>', which is not correct. Because there is an opening pattern '<<<' is met inside the value. So it looks as only '<<<data>>>' should be returned.
For that purpose we need to read the string starting from the end and moving in a backward direction to the very beginning of the value.
In such cases using REVERSE() function could be a solution, although it is not documented. So first we reverse the string and secondly we pass it in the usual way from beginning to the end.
And by the way as we have it reversed all our patterns should also be reversed: the first one '>>>' would be met and then closing '<<<'. In the end we just put the second REVERSE() to get the initial direction of the string:
SQL> with t as (select 'xyz<<<testdata>>>123' col1 from dual union all
2 select 'zzz<<<test><<<data>>>ssf' from dual union all
3 select 'sss<<<test><data>>>sffsd' from dual union all
4 select 'ggg<<<test>>>34345<<<data>>>lks' from dual)
5 --
6 select t.*,
7 regexp_replace(col1, '(<<<.*?>>>)|.', '\1') col2,
8 reverse(regexp_replace(col1, '(<<<.*?>>>)|.', '\1')) col3,
9 regexp_replace(reverse(regexp_replace(col1, '(<<<.*?>>>)|.', '\1')),'(>>>.*?<<<)|.', '\1') col4,
10 reverse(regexp_replace(reverse(regexp_replace(col1, '(<<<.*?>>>)|.', '\1')),'(>>>.*?<<<)|.', '\1')) col5
11 from t;

COL1 COL2 COL3 COL4 COL5
------------------------------- ------------------------ ------------------------ ---------------------- ---------------------
xyz<<<testdata>>>123 <<<testdata>>> >>>atadtset<<< >>>atadtset<<< <<<testdata>>>
zzz<<<test><<<data>>>ssf <<<test><<<data>>> >>>atad<<<>tset<<< >>>atad<<< <<<data>>>
sss<<<test><data>>>sffsd <<<test><data>>> >>>atad<>tset<<< >>>atad<>tset<<< <<<test><data>>>
ggg<<<test>>>34345<<<data>>>lks <<<test>>><<<data>>> >>>atad<<<>>>tset<<< >>>atad<<<>>>tset<<< <<<test>>><<<data>>>

SQL>

As you can see the redundant part '<<<test>' have dissapeared from the line #2.

PS
The opening pattern and closing pattern could be any, e.g. '<<~' as opening and '><<-' as closing, in this case the query would look like:
SQL> with t as (select 'xyz<<~testdata><<-123' col1 from dual union all
2 select 'zzz<<~test><<-data><<-ssf' from dual union all
3 select 'sss<<~test>>>><<-data>>~sffsd' from dual union all
4 select 'ggg<<~test><<-34345<<~data<<~data2><<-lks' from dual)
5 --
6 select t.*,
7 reverse(regexp_replace(reverse(regexp_replace(col1, '(<<~.*?><<-)|.', '\1')),'(-<<>.*?~<<)|.', '\1')) col2
8 from t;

COL1 COL2
----------------------------------------- -----------------------------------
xyz<<~testdata><<-123 <<~testdata><<-
zzz<<~test><<-data><<-ssf <<~test><<-
sss<<~test>>>><<-data>>~sffsd <<~test>>>><<-
ggg<<~test><<-34345<<~data<<~data2><<-lks <<~test><<-<<~data2><<-

SQL>