Wednesday, April 30, 2008

Wrong encoding with HTTPURITYPE()

Several days ago my colleague asked me - do you have any idea why the following query returns Currency Name in the wrong encoding:
SQL> SELECT EXTRACTVALUE(VALUE(P), '/Valute/NumCode') NUMCODE,
2 EXTRACTVALUE(VALUE(P), '/Valute/CharCode') CHARCODE,
3 EXTRACTVALUE(VALUE(P), '/Valute/Nominal') NOMINAL,
4 EXTRACTVALUE(VALUE(P), '/Valute/Name') NAME,
5 EXTRACTVALUE(VALUE(P), '/Valute/Value') VALUE
6 FROM TABLE(XMLSEQUENCE(HTTPURITYPE('http://www.cbr.ru/scripts/XML_daily.asp?date_req=22.03.2008')
7 .GETXML().EXTRACT('//ValCurs/Valute'))) P
8 WHERE EXTRACTVALUE(VALUE(P), '/Valute/CharCode') = 'USD'
9 /

NUMCODE CHARCODE NOMINAL NAME VALUE
-------------------- -------------------- -------------------- -------------------- --------------------
840 USD 1 Aieea? NOA 23,7773

SQL>
It is a link to our central bank site where rates of different currencies are listed in rouble equivalent.

I have not come across such stuff before - so after reading manual on this topic, I found out that the encoding is not sent explicitly from this cbr http url in the content-type:
SQL> set serveroutput on
SQL>
SQL> DECLARE
2 httpuri HTTPURIType;
3 BEGIN
4 httpuri := HTTPURIType('http://www.rambler.ru/news/top/rss.html');
5 DBMS_OUTPUT.put_line('rambler: ' || httpuri.getContentType());
6 httpuri := HTTPURIType('http://www.cbr.ru/scripts/XML_daily.asp?date_req=22.03.2008');
7 DBMS_OUTPUT.put_line('cbr: ' || httpuri.getContentType());
8 END;
9 /

rambler: text/xml; charset=windows-1251
cbr: text/xml

PL/SQL procedure successfully completed

SQL>
as we can find from the documentation:
The HTTPURITYPE is a subtype of the UriType that provides support for the HTTP protocol. This uses the UTL_HTTP package underneath to access the HTTP URLs.

And in the chapter about UTL_HTTP we can find that:
Following the HTTP protocol standard specification, if the media type of a request or a response is text, but the character set information is missing in the Content-Type header, the character set of the request or response body should default to ISO-8859-1. A response created for a request inherits the default body character set of the request instead of the body character set of the current session. The default body character set is ISO-8859-1 in a database user session.

So this is what happens in our case - default encoding is ISO-8859-1 and names of currencies are wrongly displayed.

The easiest way is to specify the default charset with SET_BODY_CHARSET:
SQL> set colwidth 20
SQL> select utl_http.get_body_charset() from dual;

UTL_HTTP.GET_BODY_CH
--------------------
ISO-8859-1

SQL>
SQL> select * from v$nls_valid_values
2 where value like '%1251%'
3 /

PARAMETER VALUE ISDEPRECATED
-------------------- -------------------- ------------
CHARACTERSET CL8MSWIN1251 FALSE

SQL> exec utl_http.set_body_charset('CL8MSWIN1251');

PL/SQL procedure successfully completed

SQL> select utl_http.get_body_charset() from dual;

UTL_HTTP.GET_BODY_CH
--------------------
CL8MSWIN1251

SQL>
SQL> SELECT EXTRACTVALUE(VALUE(P), '/Valute/NumCode') NUMCODE,
2 EXTRACTVALUE(VALUE(P), '/Valute/CharCode') CHARCODE,
3 EXTRACTVALUE(VALUE(P), '/Valute/Nominal') NOMINAL,
4 EXTRACTVALUE(VALUE(P), '/Valute/Name') NAME,
5 EXTRACTVALUE(VALUE(P), '/Valute/Value') VALUE
6 FROM TABLE(XMLSEQUENCE(HTTPURITYPE('http://www.cbr.ru/scripts/XML_daily.asp?date_req=22.03.2008')
7 .GETXML().EXTRACT('//ValCurs/Valute'))) P
8 WHERE EXTRACTVALUE(VALUE(P), '/Valute/CharCode') = 'USD'
9 /

NUMCODE CHARCODE NOMINAL NAME VALUE
-------------------- -------------------- -------------------- -------------------- --------------------
840 USD 1 Доллар США 23,7773

SQL>
So now the currency name is properly displayed. But this method affects only current session. And this default charset will be left for future requests also.

Another way - is a little bit difficult.
It is based on the fact, that in case you pump blob from the http link - the result encoding is not changed:
URIType Methods:
getBLOB()Returns the target of the URI as a BLOB value. No character conversion is performed, and the character encoding is that of the URI target. This method can also be used to fetch binary data.

So the idea is to transform this blob value into XML with the needed encoding.
I found a link from the OTN forum, which can be very useful: http://forums.oracle.com/forums/thread.jspa?messageID=1864895�
We can easily create a function, that will take as an input the blob value and the needed encoding for conversion and would return as a result XMLType value:
SQL> create or replace function BLOB2XML(l_blob blob, l_charset varchar2)
2 return xmltype is
3 l_clob clob;
4 l_src_offset number;
5 l_dest_offset number;
6 l_blob_csid number := nvl(nls_charset_id(l_charset), dbms_lob.default_csid);
7 v_lang_context number := dbms_lob.default_lang_ctx;
8 l_warning number;
9 l_amount number;
10 begin
11 dbms_lob.createtemporary(l_clob, true);
12 l_src_offset := 1;
13 l_dest_offset := 1;
14 l_amount := dbms_lob.getlength(l_blob);
15 dbms_lob.converttoclob(l_clob,
16 l_blob,
17 l_amount,
18 l_dest_offset,
19 l_src_offset,
20 l_blob_csid,
21 v_lang_context,
22 l_warning);
23 return xmltype(l_clob);
24 end;
25 /

Function created

SQL> show err blob2xml;
No errors for FUNCTION SYSTEM.BLOB2XML

SQL> set colwidth 20
SQL> select utl_http.get_body_charset() from dual;

UTL_HTTP.GET_BODY_CH
--------------------
ISO-8859-1

SQL>
SQL> SELECT EXTRACTVALUE(VALUE(P), '/Valute/NumCode') NUMCODE,
2 EXTRACTVALUE(VALUE(P), '/Valute/CharCode') CHARCODE,
3 EXTRACTVALUE(VALUE(P), '/Valute/Nominal') NOMINAL,
4 EXTRACTVALUE(VALUE(P), '/Valute/Name') NAME,
5 EXTRACTVALUE(VALUE(P), '/Valute/Value') VALUE
6 FROM TABLE(XMLSEQUENCE(blob2XML(HTTPURITYPE('http://www.cbr.ru/scripts/XML_daily.asp?date_req=22.03.2008')
7 .GETblob(),'CL8MSWIN1251').EXTRACT('//ValCurs/Valute'))) P
8 WHERE EXTRACTVALUE(VALUE(P), '/Valute/CharCode') = 'USD'
9 /

NUMCODE CHARCODE NOMINAL NAME VALUE
-------------------- -------------------- -------------------- -------------------- --------------------
840 USD 1 Доллар США 23,7773

SQL> select utl_http.get_body_charset() from dual;

UTL_HTTP.GET_BODY_CH
--------------------
ISO-8859-1

SQL>
As we can see the default charset for UTL_HTTP package has not being changed.
Just to be sure that the problem was in encoding - let's try our query with ISO-8859-1 charset:
SQL> select * from v$nls_valid_values
2 where upper(value) like '%ISO%8859%1'
3 /

PARAMETER VALUE ISDEPRECATED
-------------------- -------------------- ------------
CHARACTERSET WE8ISO8859P1 FALSE

SQL>
SQL> SELECT EXTRACTVALUE(VALUE(P), '/Valute/NumCode') NUMCODE,
2 EXTRACTVALUE(VALUE(P), '/Valute/CharCode') CHARCODE,
3 EXTRACTVALUE(VALUE(P), '/Valute/Nominal') NOMINAL,
4 EXTRACTVALUE(VALUE(P), '/Valute/Name') NAME,
5 EXTRACTVALUE(VALUE(P), '/Valute/Value') VALUE
6 FROM TABLE(XMLSEQUENCE(blob2XML(HTTPURITYPE('http://www.cbr.ru/scripts/XML_daily.asp?date_req=22.03.2008')
7 .GETblob(),'WE8ISO8859P1').EXTRACT('//ValCurs/Valute'))) P
8 WHERE EXTRACTVALUE(VALUE(P), '/Valute/CharCode') = 'USD'
9 /

NUMCODE CHARCODE NOMINAL NAME VALUE
-------------------- -------------------- -------------------- -------------------- --------------------
840 USD 1 Aieea? NOA 23,7773

SQL>
Yes, this is exactly what we got at our initial query.