Mysql – converting latin to utf8mb4 causes questionmarks

character-setMySQLtype conversionutf-8

  • The original format of the data is unknown
  • The new table is in utf8mb4_general_ci

If I do CONVERT(BINARY CONVERT(column USING latin1) USING UTF8) as mentioned here – it fixes all text, but converts something like: © in the original column to ? in the new column.

If it helps to determine what original encoding it was in, the original text renders as e.g. KotaÄići and converts to Kotačići.

Is there a way to both preserve special characters and restore correct utf8 text format?


As requested in the comments an example via hex:

HEX(col):

C398C2A3C398C2BAC399E280A0C399C5A0C398C2A920C398C2B3C399E280A620C398C2A7C399E2809EC399E2809EC399E280A1

CONVERT(BINARY CONVERT(col USING latin1) USING UTF8):

أغنية سم الله

Just raw:

أغنية سم الله

The dump-file starts with:

SET SQL_MODE = "NO_AUTO_VALUE_ON_ZERO";
SET AUTOCOMMIT = 0;
START TRANSACTION;
SET time_zone = "+00:00";

/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8mb4 */;

Tables get created with: ENGINE=InnoDB DEFAULT CHARSET=utf8mb4

Best Answer

Ah, I was afraid of that. That hex string is the "double encoding" of أغنية سم الله Does that look about right?

This is the expression to 'fix' it: CONVERT(BINARY(CONVERT(CONVERT(UNHEX('C398C2A3C398C2BAC399E280A0C399C5A0C398C2A920C398C2B3C399E280A620C398C2A7C399E2809EC399E2809EC399E280A1') USING utf8mb4) USING latin1)) USING utf8mb4)

This discusses how it probably came about. You need to fix your code and fix the data: https://stackoverflow.com/questions/38363566/trouble-with-utf8-characters-what-i-see-is-not-what-i-stored (Look for "double encod")

(utf8 and utf8mb4 work equally well for Arabic.)

That short conversion you found needs to (sort of) be repeated to fix "double" encoding.

Further research

It is OK to be a mixture of Arabic and other languages in a utf8 or utf8mb4 column. It is not ok to have double-encoding, especially if some cells are correctly encoded. Somewhere the Arabic text was encoded an extra time, but the copyright symbol was not. Did they come from different sources? Maybe the problem came before this database we are looking at?

That is, dig into the client(s) you are using and dump the hex of text that is about to be INSERTed. Arabic should be two hex bytes: Dxyy; copyright (and many other popular symbols) should be Cxyy.

Fixing

If you find that some rows have double-encoded Arabic (or whatever) an d some rows have correctly encoded copyright, and if you can distinguish which rows are which, then applying the single-fix vs the double-fix should be 'easy'.

Ditto for columns. Perhaps the copyright is never in the same column as the Arabic text?

Even messier is when a single cell has both. That would strongly imply the client is "at fault".