Sql-server – Can Unicode columns include non-Unicode values in SQL Server

bcpmigrationMySQLsql serverunicode

I need to convert Unicode column values from UTF16-LE to UTF-8 to then import into MySQL.

Is it safe to assume columns of NVARCHAR, NTEXT, NCHAR, BIT, INT, DECIMAL, FLOAT, and DATETIME all must be Unicode and therefore will not have any characters unable to be converted from UTF-16LE to UTF-8 simply by exporting the values to a .txt file and resaving them with the UTF-8 Encoding prior to importing them to MySQL?

Is it safe to assume any unicode columns in SQL Server will not include any characters incapable of being converted from UTF-16LE to UTF-8 after they've been exported to CSV files?

My import fails, regardless if I re-save the CSV with UTF-8 encoding or not. So, I assume either:

SQL Server allows non-unicode character in unicode columns that cannot convert correctly to UTF-8 (which I doubt, hence my question to check my assumption); or
It's failing elsewhere in the process – e.g. re-saving the CSV files adds something MySQL doesn't like.

I'm using bcp to export the values to a CSV. Then, I resave it with the UTF-8 encoding because MS removed the ability to export directly as UTF-8. Finally, I use MySQL's LOAD DATA INFILE to import where it fails.

Best Answer

Is it safe to assume columns of NVARCHAR, NTEXT, NCHAR, BIT, INT, DECIMAL, FLOAT, and DATETIME all MUST be UNICODE...

Only the XML and N-prefixed types (NCHAR, NVARCHAR, and NTEXT [which has been deprecated since SQL Server 2005 was released so please do not use it]) are Unicode. Those other types you mentioned are not strings and are not stored as strings, hence they are not relevant to this question.

... and therefore WILL NOT have any characters unable to be converted from UTF-16LE to UTF-8...

This is not exactly a valid question. Unicode characters are Unicode characters regardless of their encoding, whether it is UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BE. Now, it is possible for the data itself to contain invalid sequences, such as invalid Surrogate Pairs. But then those aren't valid characters in the UTF-8 or UTF-32 encodings either.

... simply by exporting the values to a .txt file and resaving them with the UTF-8 Encoding prior to importing them to MySQL?

Well, you need to be sure to save the initial export file with a Unicode-encoding. So you would use either the -N or -w options with BCP.

Also, make sure that you are doing more than just changing the Byte Order Mark (BOM) of the file and are actually converting the Unicode / UT-16LE characters to UTF-8.

Related Solutions

Mysql – How to convert control characters in MySQL from latin1 to UTF-8

I'm not certain. I tried to start out be reproducing your problem but the alter worked fine for me.

test > CREATE TABLE `bar` (  `content` text ) ENGINE=MyISAM DEFAULT CHARSET=latin1;  INSERT INTO bar VALUES (0x8081828384858687898A8B8C8D8E8F909192939495969798999A9B9C9D9E9F);
Query OK, 0 rows affected (0.02 sec)

Query OK, 1 row affected (0.00 sec)

test > ALTER TABLE bar CHANGE content content TEXT CHARACTER SET UTF8;
Query OK, 1 row affected (0.04 sec)
Records: 1  Duplicates: 0  Warnings: 0

test > select * from bar;
+---------------------------------+
| content                         |
+---------------------------------+
| ����������������������������� |
+---------------------------------+
1 row in set (0.00 sec)

test > set names utf8;
Query OK, 0 rows affected (0.00 sec)

test > select * from bar;
+---------------------------------------------------------------------------------+
| content                                                                         |
+---------------------------------------------------------------------------------+
| €‚ƒ„…†‡‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ |
+---------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Here's my related char settings

test > show variables like '%char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+

Edit

My char settings before running set names utf8

test > show variables like '%char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | latin1                     |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

Version

test > select version();
+-------------------------+
| version()               |
+-------------------------+
| 5.1.41-3ubuntu12.10-log |
+-------------------------+
1 row in set (0.00 sec)

Mysql – Why would SequelPro only import 23k rows out of 130k

This may depend on where you generated the CSV file. If the CSV file was generated on a Windows machine, there could be some character set issues

See https://code.google.com/p/sequel-pro/issues/detail?id=1629

See the following URLs as SequelPro's character set problems are not new

If the CSV file was generated on another Mac OSx server, you should not be having this issue.

You may have to resort to setting the default character set to match that CSV file. Sounds weird to here it goes:

Please run this query and you will see something like this:

mysql> show variables like 'character_set%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

mysql>

You can also see the character set of the database

mysql> show create database mydb\G
*************************** 1. row ***************************
       Database: mydb
Create Database: CREATE DATABASE `mydb` /*!40100 DEFAULT CHARACTER SET latin1 */
1 row in set (0.00 sec)

mysql>

Perhaps you should load another table that has the matching character set:

CREATE TABLE anothertable LIKE mytable;

Change the whole table's character set

ALTER TABLE anothertable CONVERT TO CHARACTER SET charset_name [COLLATE collation_name];

or change a column's character set

ALTER TABLE anothertable MODIFY col1 CHAR(50) CHARACTER SET utf8;

Then, have SequalPro load anothertable.

I guess to be less aggressive, just change the column's character set.

Best Answer

Related Solutions

Mysql – How to convert control characters in MySQL from latin1 to UTF-8

Mysql – Why would SequelPro only import 23k rows out of 130k

Related Question