The reason for the truncation is quite simple. Some characters (accented ones, for example) in the WE8ISO8859P1 character set are stored as a single byte, but in AL32UTF8 they end up being stored as multiple bytes. As a result of conversion, a 4000 character string may end up actually requiring more than 4000 bytes.
By way of example, this query shows you that the Euro symbol (0x80 in WE8ISO8859P1) becomes 2 bytes in AL32UTF8:
SQL> select length(convert(chr(128),'AL32UTF8','WE8ISO8859P1')) from dual
2 /
LENGTH(CONVERT(CHR(128),'AL32UTF8','WE8ISO8859P1'))
---------------------------------------------------
2
SQL>
To list all characters that will be affected by the change, you can use the following query:
with n as
(
select level as c from dual
connect by level <= 255
)
select c as "WE8ISO8859P1 value",
'"'||chr(c)||'"' as Character,
length(convert(chr(c),'AL32UTF8','WE8ISO8859P1')) as "New length"
from n
where length(convert(chr(c),'AL32UTF8','WE8ISO8859P1'))>1;
Unfortunately the maximum length a CHAR
or VARCHAR
string can be in Oracle is 4000 bytes. The only option available to you if characterset conversion pushes you over this limit is to convert the columns to use the CLOB
datatype, but we warned - CLOB
s are difficult to deal with and can present challenges.
You need to take the data from UTF-8 and convert it into UCS-2LE using something like iconv. For example, using the character in your example:
echo "010000: dcb3" | xxd -r -s -0x10000 | iconv -f "UTF-8" -t "UCS-2LE" | xxd
0000000: 3307
Now I'm not sure what character UTF-8 \xdcb3 is, but apparently it's correct translation to UCS-2LE is \U0733
. If you have \0xDCB3
in the SQL Server it means it was not translated into UCS-2LE before import. You should not have surrogates in the NVARCHAR fields, UCS-2 is "surrogate agnostic". See UCS-2 vs. UTF-16 (not quite Kramer vs. Kramer).
I'm not an expert in the MySQL tool set so I can't say what step is missing that was supposed to do the iconv.
Update
to locate the records with surrogates you must turn to the binary representation, since any character function will threat the surrogates as 'special':. Luckily the string manipulation functions work on binary too with the expected semantics. Eg. CHARINDEX:
insert into test(a) values (N'a');
insert into test(a) values (NCHAR(0xdc83));
insert into test(a) values (N'b');
go
select * from test where charindex(0x83dc, cast(a as varbinary(8000))) > 0;
Best Answer
I had to avoid the escape character in the pattern matching string (after the LIKE), and append
ESCAPE ''
to avoid using the backslash as an escape character. In this example, I had_
characters in the literals, as well as\
characters.