Sql-server – SQL Server to MySQL migration – how can I remove UCS-2 surrogate pairs in SQL Server

character-setmigrationMySQLsql server

I have been tasked with migrating a SQL Server 2005 database to MySQL 5.6 (these are both database servers runnig locally) and would really appreciate some help.

——-Edit – Answered——-

This issue has now been resolved. I used Remus Rusanu's suggestion for finding the rows with these surrogate pair characters using CHARINDEX and have decided to use SUBSTRING to exclude the troublesome characters like so:

UPDATE test
SET a = SUBSTRING(a,  1,   (CHARINDEX(0x83dc, CAST(a AS VARBINARY(8000)))+1)/2 - 1) -- string before the unwanted character
+ SUBSTRING(a, (CHARINDEX(0x83dc, CAST(a AS VARBINARY(8000)))+1)/2 +1, LEN(a) ) -- string after the unwanted character
WHERE CHARINDEX(0x83dc, CAST(a AS VARBINARY(8000))) % 2 = 1 -- only odd numbered charindexes (to signify match at beginning of byte pair character)

—————————–

SQL Server source database has latin1 collation (so has ISO 8859-1 character set right?) but doesn't have any char/varchar fields (any string field is nvarchar/nchar) so all this data should be using the UCS-2 character set.
MySQL target database wants the character set UTF-8.

I decided to use the database migration toolkit in the latest version of the MySQL workbench. at first it worked fine and migrated everything as expected. But I have been totally tripped up upon encountering UCS-2 surrogate pair characters in the SQL Server database.

The migration toolkit copytable program did not provide a very useful error message:

Error during charset conversion of wstring: No error.

It also did not provide any field/row information on the problem-causing data and would fail within chunks of 100 rows. So after searching through the 100 rows after the last successful insert I found that the issue seemed to be caused by two UCS-2 characters in one of the nvarchar fields. They are listed as surrogate pairs in the UCS-2 character set. They were specifically the characters DBC0 and DC83 (I got this by looking at the binary data for the field and comparing byte pairs (little endian) with data that was being migrated successfully).

When this surrogate pair was removed from the SQL Server database the row was migrated successfully to MySQL.

Here is the problem:

I have tried to search for these characters in a test SQL Server table (this chartest table is just various test strings an nvarchar field) to prepare a replacement script and keep getting strange results… I must be doing something incorrectly.

Searching for

SELECT * FROM chartest WHERE text LIKE NCHAR(0xdc83)

Will return any surrogate pair character (whether or not it uses DC83), but obviously, only if it is the only character (or part of the pair) in that field. This isn't a big deal since I would like to remove any instance of these anyway (I dont like to remove data like this but I think we can afford it).

Searching for

SELECT * FROM chartest WHERE text LIKE '%' + NCHAR(0xdc83) + '%'

Will return every row! Regardless of whether it even has a unicode character present in the field let alone the DC83 character. Is there a better way to find and replace these characters? Or something else I should try?

I have also tried setting the target databse, table, and field character set to UCS-2 but it seems as though it does not make a difference.

I should also mention that this migration is using live data (~50GB database!) while one of the sites that feeds it is taken offline so any solutions to this need to have a quick running time…

I would appreciate any suggestions very much! Please let me know if there is any information I have left out.

Best Answer

You need to take the data from UTF-8 and convert it into UCS-2LE using something like iconv. For example, using the character in your example:

echo "010000: dcb3" | xxd -r -s -0x10000 | iconv -f "UTF-8" -t "UCS-2LE" | xxd
0000000: 3307

Now I'm not sure what character UTF-8 \xdcb3 is, but apparently it's correct translation to UCS-2LE is \U0733. If you have \0xDCB3 in the SQL Server it means it was not translated into UCS-2LE before import. You should not have surrogates in the NVARCHAR fields, UCS-2 is "surrogate agnostic". See UCS-2 vs. UTF-16 (not quite Kramer vs. Kramer).

I'm not an expert in the MySQL tool set so I can't say what step is missing that was supposed to do the iconv.

Update

to locate the records with surrogates you must turn to the binary representation, since any character function will threat the surrogates as 'special':. Luckily the string manipulation functions work on binary too with the expected semantics. Eg. CHARINDEX:

insert into test(a) values  (N'a');
insert into test(a) values  (NCHAR(0xdc83));
insert into test(a) values  (N'b');
go

select * from test where charindex(0x83dc, cast(a as varbinary(8000))) > 0;

Related Solutions

Character Set Database Migration

The reason for the truncation is quite simple. Some characters (accented ones, for example) in the WE8ISO8859P1 character set are stored as a single byte, but in AL32UTF8 they end up being stored as multiple bytes. As a result of conversion, a 4000 character string may end up actually requiring more than 4000 bytes.

By way of example, this query shows you that the Euro symbol (0x80 in WE8ISO8859P1) becomes 2 bytes in AL32UTF8:

SQL>  select length(convert(chr(128),'AL32UTF8','WE8ISO8859P1')) from dual
  2  /
LENGTH(CONVERT(CHR(128),'AL32UTF8','WE8ISO8859P1'))
---------------------------------------------------
                                                  2
SQL>

To list all characters that will be affected by the change, you can use the following query:

with n as 
(
select level as c from dual
connect by level <= 255
)
select c as "WE8ISO8859P1 value",
       '"'||chr(c)||'"' as Character, 
       length(convert(chr(c),'AL32UTF8','WE8ISO8859P1')) as "New length"
from n
where length(convert(chr(c),'AL32UTF8','WE8ISO8859P1'))>1;

Unfortunately the maximum length a CHAR or VARCHAR string can be in Oracle is 4000 bytes. The only option available to you if characterset conversion pushes you over this limit is to convert the columns to use the CLOB datatype, but we warned - CLOBs are difficult to deal with and can present challenges.

Mysql – NVARCHAR issue when migrating from SQL Server to MySQL with MySQL migration wizard

You are trying to execute this statement on the SQL Server side:

SELECT  [id],
        CAST([endretFra] AS NVARCHAR(8000)) AS [endretFra],
        CAST([endretTil] AS NVARCHAR(8000)) AS [endretTil],
        CAST([endringskilde] AS NVARCHAR(255)) AS [endringskilde],
        CAST([sistEndretBruker] AS NVARCHAR(255)) AS [sistEndretBruker],
        CAST([sistEndretFulltNavn] AS NVARCHAR(255)) AS [sistEndretFulltNavn],
        [sistEndretTidspunkt],
        CAST([felt] AS NVARCHAR(255)) AS [felt],
        CAST([handling] AS NVARCHAR(255)) AS [handling],
        [aliste_id],
        [utvalg_id]
FROM    [gruppergister].[dbo].[Endring]

That statement tries to convert two columns to the NVARCHAR data type with a length of 8000. However,the maximum allowed length of NVARCHAR in SQL Server is 4000. So, whatever tool you are using is producing invalid SQL Server syntax.

As a side note, if you are trying to create a development environment for a SQL Server hosted database, you probably should use SQL Server. The syntax for some statements is quite different and using the from DBMS can only cause confusion and grief.

You can install a trial version of SQL Server for free, if your project is less than 180 days, or you can buy the developer edition for about $50 (e.g. here). If your database is less than 4GB and does not use advanced features, you could also go with the free SQL Server Express Edition.

Best Answer

Related Solutions

Character Set Database Migration

Mysql – NVARCHAR issue when migrating from SQL Server to MySQL with MySQL migration wizard

Related Question