How to convert xml entities to unicode characters (with read-only access) in Oracle

character-setoracleoracle-11g-r2unicodexml

I'm connected to an Oracle Database (11g Release 2 – 11.2.0.4), with read-only access.

Into this database, some of the data is uploaded via, or rather as, XML – and quite a few entries contain multiple occurences of XML (special) character entities in the format of &#nnnn;.

So far I have dealt with these &#nnnn; in an Excel VBA script to convert them to Unicode characters, but I'd rather do that already in the SQL script I'm running to export the data (to Excel).

This Q&A covers pretty much the same issue, but I can't yet successfully replicate or implement the answers in my case and therefore need help.

The accepted answer in that Q&A contains SQL commands which I assume (wrongly?) I cannot use (with read-only access), like for instance create table, insert into, declare and loop.

Another answer works for me in that I can reproduce it (not in a online fiddle (how?) but in Oracle SQL Developer), albeit with two handicaps: 1) it doesn't loop and therefore would only work if the field contained only one special &#nnnn; character (one or multiple times) but not different &#nnnn; characters and 2) it fails to work with the   (non-breaking space) for a so far unidentified reason.

Building on the Q&A cited above, how can I convert these XML (special) characters to Unicode with read-only access in Oracle 11g?


Related link(s):


(Failed) Attempts thus far:

SQL 1

select regexp_replace(s, '&#([0-9]+);', u) from
(select s, unistr('\0' || REPLACE(TO_CHAR(TO_NUMBER(c), 'xxxx'), ' ', '')) u from
(select s, regexp_replace(s, '.*&#([0-9]+);.*', '\1') c from
(select 'Hallöle slovenĈina Hallöle slovenđina' s from dual)))

OUTPUT 1

Hallöle slovenđina Hallöle slovenđina

COMMENT 1

Ĉ ( = Ĉ ) is effectively "overwritten" by đ ( = đ ). That is, this script will only work for fields which contain only one and the same special character; it will overwrite all other special characters with the one character (which quite obviously is undesirable).


SQL 2

select regexp_replace(s, '&#([0-9]+);', u) from
(select s, unistr('\0' || REPLACE(TO_CHAR(TO_NUMBER(c), 'xxxx'), ' ', '')) u from
(select s, regexp_replace(s, '.*&#([0-9]+);.*', '\1') c from
(select 'Hallöle sloven ina' s from dual)))

OUTPUT 2 (error message)

ORA-30186: '\' must be followed by four hexdecimal characters or
another '\'
30186. 00000 – "'\' must be followed by four hexdecimal characters or another '\'"
*Cause: In the argument of SQL function UNISTR, a '\' must be followed by
four hexdecimal characters or another '\'
*Action: Fix the string format

COMMENT 2

For some reason, the non-breaking space ( ) seems to behave differently to other special characters here; maybe it's an Oracle exception?


SQL 3

select REGEXP_REPLACE(specialCharData,'&#([0-9]+);',unistr('\' || replace(to_char(to_number(regexp_replace(specialCharData, '.*?&#([0-9]+);.*$', '\1')), 'xxx'), ' ', '0')),1,1) as "bla", ................

OUTPUT 3 (error message)

ORA-01722: invalid number
01722. 00000 -  "invalid number"
*Cause:    The specified number was invalid.
*Action:   Specify a valid number.

COMMENT 3

specialCharData would be the name of the field/column in my database.


SQL 4

select REGEXP_REPLACE(specialCharData,'&#([0-9]+);',unistr('\' || replace(regexp_replace(specialCharData, '.*?&#([0-9]+);.*$', '\1'), ' ', '0')),1,1) as "specialChar", ................

OUTPUT 4 (error message)

ORA-30186: '\' must be followed by four hexdecimal characters or
another '\'
30186. 00000 – "'\' must be followed by four hexdecimal characters or another '\'"
*Cause: In the argument of SQL function UNISTR, a '\' must be followed by
four hexdecimal characters or another '\'
*Action: Fix the string format

COMMENT 4

specialCharData would be the name of the field/column in my database. Here I tried to prune SQL 3 by cutting away the to_char(to_number( section. Not that helpful, probably… random testing idea…

Best Answer

The data looks like it is raw XML data without XML Tags, not the translated text. You should probably get that fixed first.

In the meantime, you can run the data through XMLTable() to do the translation for you.

with data as (
  select 'Hallöle slovenĈina Hallöle slovenđina' str from dual
)
select a.str, b.str2
from data a, xmltable( '/'
  passing xmltype( '<dat>' || a.str || '</dat>' )
  columns
   str2 varchar2(4000) path '/dat'
) b

Produces the desired result: enter image description here