Mysql – Tell MySQL to start using utf-8 encoding without `convert to`ing it

character-setencodingMySQLutf-8

In a fairly unique situation, my team has ended up with UTF-8 bytes in a database that thinks the data is encoded as latin1.

At least, I'm 85% certain that this is the situation at hand.

For example, a right single quotation mark was handed to the database by a programming language that had no concept of encodings (Ruby 1.8) and just treated the data as raw bytes (0xE2 0x80 0x99). This data, as far as I can tell (how to verify?), was stored as those actual bytes. So now when the data is read out by a more intelligent programming language (Ruby 1.9), the database helpfully says "Oh! 0xE2 is 'â', 0x80 is '€', 0x99 is '™'", and so instead of "Mike’s", we end up with "Mike’s". This is also what I get in the mysql prompt when SELECTing that value.

So, essentially, we have a bunch of utf-8 encoded data stored in a database that thinks the data is encoded as latin1.

This makes me to somehow tell the database "No, no matter what you think, this stuff is actually utf-8". CONVERT TO doesn't seem like the right tool, because then I'll end up with permanent "Mike’s".


Failed/moronic attempt #1

I noticed this:

> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+-------------------+
| Variable_name            | Value             |
+--------------------------+-------------------+
| character_set_client     | utf8              |
| character_set_connection | utf8              |
| character_set_database   | utf8              |
| character_set_filesystem | binary            |
| character_set_results    | utf8              |
| character_set_server     | latin1            |
| character_set_system     | utf8              |
| collation_connection     | utf8_general_ci   |
| collation_database       | utf8_unicode_ci   |
| collation_server         | latin1_swedish_ci |
+--------------------------+-------------------+

And thought that maybe changing character_set_results to latin1 would trick it into not doing any conversion of the bytes, resulting in the proper display of data on my utf8 OS.

Sure enough, SET character_set_results=latin1; results in instead of ’. Cool!

So I added this to my ~/.my.cnf (which is the only my.cnf, I checked):

[mysqld]
...
character-set-results=latin1

and when I go back to the MySQL prompt & check the character_set_% variables, it's still utf8.

Yes, it just occurred to me that mysqld is a deamon, which means I probably need to restart the whole mysql process for this to take effect. But whoever installed MySQL on this machine used the dmg instead of the brew (wasn't me!) and the MySQL pref pane is currently telling me that MySQL isn't running even though it clearly is, and anyhow before I go down that rabbit hole, I want to check with an actual DBA and see how ridiculous this is, or if there's just a better, cleaner way to do it.

Best Answer

The solution isn't precisely the same but this question is where I originally found direction for a similar issue and the concepts there should take you where you want to go. MySQL has a BINARY character set and from all appearances, by converting through it, you can prevent MySQL from realizing what you're actually doing and being "too helpful."

Test case with character_set_client = utf8:

mysql> select CONVERT(CONVERT(CONVERT('Mike’s' USING latin1) USING binary) USING utf8);
+--------------------------------------------------------------------------------+
| CONVERT(CONVERT(CONVERT('Mike’s' USING latin1) USING binary) USING utf8)     |
+--------------------------------------------------------------------------------+
| Mike’s                                                                         |
+--------------------------------------------------------------------------------+
1 row in set (0.00 sec)

You could use that logic to populate a new column that MySQL believes to be utf8.