Mysql – Tell MySQL to start using utf-8 encoding without `convert to`ing it

character-setencodingMySQLutf-8

In a fairly unique situation, my team has ended up with UTF-8 bytes in a database that thinks the data is encoded as latin1.

At least, I'm 85% certain that this is the situation at hand.

For example, a right single quotation mark was handed to the database by a programming language that had no concept of encodings (Ruby 1.8) and just treated the data as raw bytes (0xE2 0x80 0x99). This data, as far as I can tell (how to verify?), was stored as those actual bytes. So now when the data is read out by a more intelligent programming language (Ruby 1.9), the database helpfully says "Oh! 0xE2 is 'â', 0x80 is '€', 0x99 is '™'", and so instead of "Mike’s", we end up with "Mikeâ€™s". This is also what I get in the mysql prompt when SELECTing that value.

So, essentially, we have a bunch of utf-8 encoded data stored in a database that thinks the data is encoded as latin1.

This makes me to somehow tell the database "No, no matter what you think, this stuff is actually utf-8". CONVERT TO doesn't seem like the right tool, because then I'll end up with permanent "Mikeâ€™s".

Failed/moronic attempt #1

I noticed this:

> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+-------------------+
| Variable_name            | Value             |
+--------------------------+-------------------+
| character_set_client     | utf8              |
| character_set_connection | utf8              |
| character_set_database   | utf8              |
| character_set_filesystem | binary            |
| character_set_results    | utf8              |
| character_set_server     | latin1            |
| character_set_system     | utf8              |
| collation_connection     | utf8_general_ci   |
| collation_database       | utf8_unicode_ci   |
| collation_server         | latin1_swedish_ci |
+--------------------------+-------------------+

And thought that maybe changing character_set_results to latin1 would trick it into not doing any conversion of the bytes, resulting in the proper display of data on my utf8 OS.

Sure enough, SET character_set_results=latin1; results in ’ instead of â€™. Cool!

So I added this to my ~/.my.cnf (which is the only my.cnf, I checked):

[mysqld]
...
character-set-results=latin1

and when I go back to the MySQL prompt & check the character_set_% variables, it's still utf8.

Yes, it just occurred to me that mysqld is a deamon, which means I probably need to restart the whole mysql process for this to take effect. But whoever installed MySQL on this machine used the dmg instead of the brew (wasn't me!) and the MySQL pref pane is currently telling me that MySQL isn't running even though it clearly is, and anyhow before I go down that rabbit hole, I want to check with an actual DBA and see how ridiculous this is, or if there's just a better, cleaner way to do it.

Best Answer

The solution isn't precisely the same but this question is where I originally found direction for a similar issue and the concepts there should take you where you want to go. MySQL has a BINARY character set and from all appearances, by converting through it, you can prevent MySQL from realizing what you're actually doing and being "too helpful."

Test case with character_set_client = utf8:

mysql> select CONVERT(CONVERT(CONVERT('Mikeâ€™s' USING latin1) USING binary) USING utf8);
+--------------------------------------------------------------------------------+
| CONVERT(CONVERT(CONVERT('Mikeâ€™s' USING latin1) USING binary) USING utf8)     |
+--------------------------------------------------------------------------------+
| Mike’s                                                                         |
+--------------------------------------------------------------------------------+
1 row in set (0.00 sec)

You could use that logic to populate a new column that MySQL believes to be utf8.

Related Solutions

Mysql – How to convert control characters in MySQL from latin1 to UTF-8

I'm not certain. I tried to start out be reproducing your problem but the alter worked fine for me.

test > CREATE TABLE `bar` (  `content` text ) ENGINE=MyISAM DEFAULT CHARSET=latin1;  INSERT INTO bar VALUES (0x8081828384858687898A8B8C8D8E8F909192939495969798999A9B9C9D9E9F);
Query OK, 0 rows affected (0.02 sec)

Query OK, 1 row affected (0.00 sec)

test > ALTER TABLE bar CHANGE content content TEXT CHARACTER SET UTF8;
Query OK, 1 row affected (0.04 sec)
Records: 1  Duplicates: 0  Warnings: 0

test > select * from bar;
+---------------------------------+
| content                         |
+---------------------------------+
| ����������������������������� |
+---------------------------------+
1 row in set (0.00 sec)

test > set names utf8;
Query OK, 0 rows affected (0.00 sec)

test > select * from bar;
+---------------------------------------------------------------------------------+
| content                                                                         |
+---------------------------------------------------------------------------------+
| €‚ƒ„…†‡‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ |
+---------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Here's my related char settings

test > show variables like '%char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+

Edit

My char settings before running set names utf8

test > show variables like '%char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | latin1                     |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

Version

test > select version();
+-------------------------+
| version()               |
+-------------------------+
| 5.1.41-3ubuntu12.10-log |
+-------------------------+
1 row in set (0.00 sec)

MySql: Change from latin1 to utf8

Setting the default character set and collation is completely safe. This will ensure that future DDL changes will use utf8, but will not affect existing columns that use latin1. Those will have to be converted to utf8. Make a backup of the data, because there are risks of data corruption (one example).

And for completeness, I will point out that adding the changes in the my.cnf will require a server restart.

Failed/moronic attempt #1

Best Answer

Related Solutions

Mysql – How to convert control characters in MySQL from latin1 to UTF-8

MySql: Change from latin1 to utf8

Related Question