I need to convert Unicode column values from UTF16-LE to UTF-8 to then import into MySQL.
Is it safe to assume columns of NVARCHAR
, NTEXT
, NCHAR
, BIT
, INT
, DECIMAL
, FLOAT
, and DATETIME
all must be Unicode and therefore will not have any characters unable to be converted from UTF-16LE to UTF-8 simply by exporting the values to a .txt file and resaving them with the UTF-8 Encoding prior to importing them to MySQL?
Is it safe to assume any unicode columns in SQL Server will not include any characters incapable of being converted from UTF-16LE to UTF-8 after they've been exported to CSV files?
My import fails, regardless if I re-save the CSV with UTF-8 encoding or not. So, I assume either:
- SQL Server allows non-unicode character in unicode columns that cannot convert correctly to UTF-8 (which I doubt, hence my question to check my assumption); or
- It's failing elsewhere in the process – e.g. re-saving the CSV files adds something MySQL doesn't like.
I'm using bcp to export the values to a CSV. Then, I resave it with the UTF-8 encoding because MS removed the ability to export directly as UTF-8. Finally, I use MySQL's LOAD DATA INFILE
to import where it fails.
Best Answer
Only the
XML
andN
-prefixed types (NCHAR
,NVARCHAR
, andNTEXT
[which has been deprecated since SQL Server 2005 was released so please do not use it]) are Unicode. Those other types you mentioned are not strings and are not stored as strings, hence they are not relevant to this question.This is not exactly a valid question. Unicode characters are Unicode characters regardless of their encoding, whether it is UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BE. Now, it is possible for the data itself to contain invalid sequences, such as invalid Surrogate Pairs. But then those aren't valid characters in the UTF-8 or UTF-32 encodings either.
Well, you need to be sure to save the initial export file with a Unicode-encoding. So you would use either the
-N
or-w
options with BCP.Also, make sure that you are doing more than just changing the Byte Order Mark (BOM) of the file and are actually converting the Unicode / UT-16LE characters to UTF-8.