Mysql – converting latin to utf8mb4 causes questionmarks

character-setMySQLtype conversionutf-8

The original format of the data is unknown
The new table is in utf8mb4_general_ci

If I do CONVERT(BINARY CONVERT(column USING latin1) USING UTF8) as mentioned here – it fixes all text, but converts something like: © in the original column to ? in the new column.

If it helps to determine what original encoding it was in, the original text renders as e.g. KotaÄiÄ‡i and converts to Kotačići.

Is there a way to both preserve special characters and restore correct utf8 text format?

As requested in the comments an example via hex:

HEX(col):

C398C2A3C398C2BAC399E280A0C399C5A0C398C2A920C398C2B3C399E280A620C398C2A7C399E2809EC399E2809EC399E280A1

CONVERT(BINARY CONVERT(col USING latin1) USING UTF8):

أغنية سم الله

Just raw:

Ø£ØºÙ†ÙŠØ© Ø³Ù… Ø§Ù„Ù„Ù‡

The dump-file starts with:

SET SQL_MODE = "NO_AUTO_VALUE_ON_ZERO";
SET AUTOCOMMIT = 0;
START TRANSACTION;
SET time_zone = "+00:00";

/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8mb4 */;

Tables get created with: ENGINE=InnoDB DEFAULT CHARSET=utf8mb4

Best Answer

Ah, I was afraid of that. That hex string is the "double encoding" of أغنية سم الله Does that look about right?

This is the expression to 'fix' it: CONVERT(BINARY(CONVERT(CONVERT(UNHEX('C398C2A3C398C2BAC399E280A0C399C5A0C398C2A920C398C2B3C399E280A620C398C2A7C399E2809EC399E2809EC399E280A1') USING utf8mb4) USING latin1)) USING utf8mb4)

This discusses how it probably came about. You need to fix your code and fix the data: https://stackoverflow.com/questions/38363566/trouble-with-utf8-characters-what-i-see-is-not-what-i-stored (Look for "double encod")

(utf8 and utf8mb4 work equally well for Arabic.)

That short conversion you found needs to (sort of) be repeated to fix "double" encoding.

Further research

It is OK to be a mixture of Arabic and other languages in a utf8 or utf8mb4 column. It is not ok to have double-encoding, especially if some cells are correctly encoded. Somewhere the Arabic text was encoded an extra time, but the copyright symbol was not. Did they come from different sources? Maybe the problem came before this database we are looking at?

That is, dig into the client(s) you are using and dump the hex of text that is about to be INSERTed. Arabic should be two hex bytes: Dxyy; copyright (and many other popular symbols) should be Cxyy.

Fixing

If you find that some rows have double-encoded Arabic (or whatever) an d some rows have correctly encoded copyright, and if you can distinguish which rows are which, then applying the single-fix vs the double-fix should be 'easy'.

Ditto for columns. Perhaps the copyright is never in the same column as the Arabic text?

Even messier is when a single cell has both. That would strongly imply the client is "at fault".

Related Solutions

MySQL – How to Convert Control Characters from Latin1 to UTF-8

I'm not certain. I tried to start out be reproducing your problem but the alter worked fine for me.

test > CREATE TABLE `bar` (  `content` text ) ENGINE=MyISAM DEFAULT CHARSET=latin1;  INSERT INTO bar VALUES (0x8081828384858687898A8B8C8D8E8F909192939495969798999A9B9C9D9E9F);
Query OK, 0 rows affected (0.02 sec)

Query OK, 1 row affected (0.00 sec)

test > ALTER TABLE bar CHANGE content content TEXT CHARACTER SET UTF8;
Query OK, 1 row affected (0.04 sec)
Records: 1  Duplicates: 0  Warnings: 0

test > select * from bar;
+---------------------------------+
| content                         |
+---------------------------------+
| ����������������������������� |
+---------------------------------+
1 row in set (0.00 sec)

test > set names utf8;
Query OK, 0 rows affected (0.00 sec)

test > select * from bar;
+---------------------------------------------------------------------------------+
| content                                                                         |
+---------------------------------------------------------------------------------+
| €‚ƒ„…†‡‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ |
+---------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Here's my related char settings

test > show variables like '%char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+

Edit

My char settings before running set names utf8

test > show variables like '%char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | latin1                     |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

Version

test > select version();
+-------------------------+
| version()               |
+-------------------------+
| 5.1.41-3ubuntu12.10-log |
+-------------------------+
1 row in set (0.00 sec)

MySQL database drop insanely slow

I hate the checking permissions issue.

You may have to disable key checks before the DROP DATABASE

SET unique_checks = 0;
SET foreign_key_checks = 0;
SET GLOBAL innodb_stats_on_metadata = 0;
DROP DATABASE db_madeintouch;
SET GLOBAL innodb_stats_on_metadata = 1;
SET foreign_key_checks = 1;
SET unique_checks = 1;

UPDATE 2013-04-15 18:04 EDT

I just noticed you have innodb_file_per_table OFF. What gives ?

You currently have all the InnoDB data and the corresponding index sitting in a single file.
Any CREATE TABLE statement must make data dictionary updates and look for space (small but annoying in this instance)
Internal Fragmentation of ibdata1
Dropping a table means scanning the table and its indexes for availability to lock. With data and index pages possibly fragmented, this takes spindles, seek time, and latency.
See Pictorial Representation of ibdata1 to see everything that goes into ibdata1

Recommendation : Remove all Data and Index Pages from ibdata1

This will give ibdata1 a breather to handle just data dictionary and MVCC management. In addition, ibdata1 will stay rather lean and mean and can be read more quickly.

You will need to perform the InnoDB Infrastructure Cleanup. I wrote out all the steps back on October 29, 2010 in StackOverflow.

UPDATE 2013-04-22 08:10 EDT

Three suggestions

SUGGESTION 1 : I just noticed something else. You are using an ancient version of MySQL (5.0.45). You should think about upgrading to MySQL 5.6.11 as it performs significantly faster that MySQL 5.5 and way faster than MySQL 5.0.

SUGGESTION 2 : You should also go ahead and implement the InnoDB Infrastructure Cleanup.

SUGGESTION 3 : You should also check the disk itself. If the data is sitting on a RAID10 set, one of the disks may have an issues. Check the disk controller's battery as well because it can slow down disk caching and affect read performance.