Sql-server – Can Unicode columns include non-Unicode values in SQL Server

bcpmigrationMySQLsql serverunicode

I need to convert Unicode column values from UTF16-LE to UTF-8 to then import into MySQL.

Is it safe to assume columns of NVARCHAR, NTEXT, NCHAR, BIT, INT, DECIMAL, FLOAT, and DATETIME all must be Unicode and therefore will not have any characters unable to be converted from UTF-16LE to UTF-8 simply by exporting the values to a .txt file and resaving them with the UTF-8 Encoding prior to importing them to MySQL?

Is it safe to assume any unicode columns in SQL Server will not include any characters incapable of being converted from UTF-16LE to UTF-8 after they've been exported to CSV files?

My import fails, regardless if I re-save the CSV with UTF-8 encoding or not. So, I assume either:

  1. SQL Server allows non-unicode character in unicode columns that cannot convert correctly to UTF-8 (which I doubt, hence my question to check my assumption); or
  2. It's failing elsewhere in the process – e.g. re-saving the CSV files adds something MySQL doesn't like.

I'm using bcp to export the values to a CSV. Then, I resave it with the UTF-8 encoding because MS removed the ability to export directly as UTF-8. Finally, I use MySQL's LOAD DATA INFILE to import where it fails.

Best Answer

Is it safe to assume columns of NVARCHAR, NTEXT, NCHAR, BIT, INT, DECIMAL, FLOAT, and DATETIME all MUST be UNICODE...

Only the XML and N-prefixed types (NCHAR, NVARCHAR, and NTEXT [which has been deprecated since SQL Server 2005 was released so please do not use it]) are Unicode. Those other types you mentioned are not strings and are not stored as strings, hence they are not relevant to this question.

... and therefore WILL NOT have any characters unable to be converted from UTF-16LE to UTF-8...

This is not exactly a valid question. Unicode characters are Unicode characters regardless of their encoding, whether it is UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BE. Now, it is possible for the data itself to contain invalid sequences, such as invalid Surrogate Pairs. But then those aren't valid characters in the UTF-8 or UTF-32 encodings either.

... simply by exporting the values to a .txt file and resaving them with the UTF-8 Encoding prior to importing them to MySQL?

Well, you need to be sure to save the initial export file with a Unicode-encoding. So you would use either the -N or -w options with BCP.

Also, make sure that you are doing more than just changing the Byte Order Mark (BOM) of the file and are actually converting the Unicode / UT-16LE characters to UTF-8.