Thesqldump shows pairs of utf8 chars when dumping a utf8 database

encodingMySQLutf-8

I'll describe right away the problem, and put details at the bottom.

I am dumping a database which I am fairly sure contains (almost: a few cells have wrong, latin1 encoded data) only UTF8 data. Also I have done my utmost to ensure every part of the chain defaults/is set to UTF8.

Problem: all UTF8 chars appearing in the output are screwed up as in this example:

è => è

This is what happens if – at some point – a UTF8-encoded character is interpreted bytewise as two latin1 characters which are then UTF8-encoded.

Edit: What is happening is along the lines of:

U(è) => (c3,a8) => L( è ) => (c383,c2a8) => U( è) 

ie: 1 UTF8-encoded char is seen as 2 L1-octets and each one gets encoded in UTF8 again, ends up as 2 UTF-8 encoded chars.

So the first question is "where is this happening"?

But the really funky part is that

  1. The website (html5,utf8 all over) using the database displays those characters OK
  2. I can import the dump without issues with point (1) above still holding
  3. If i "fix" the dump – replacing the pairs with the char that should be there, point (1) above unravels – the browser now shows the familiar question marks on inverted background that is the familiar placeholder for "bad" utf8 sequences.
  4. The issue depends on the client: I can SELECT from mysql and have the double-encoding as in mysqldump; OTOH if i GET the web page (composed by php with charset UTF8 connection) I see regular (not doubled) UTF8 sequences.
  5. Even weirder, I can obtain correct looking results in mysql/mysqldump if I specify –default-character-set=latin1 (!!!). In this case, however what I get does not round trip cleanly as the UTF8 counterpart does.

All of the above confirmed by examining outputs with a binary editor (even though i have LANG=en_US.UTF-8 and UTF8 char display is OK)

What I would like to see, obviously, is a clean UTF8 mysql environment which displays the same independently of the client.

This is really driving me up the wall. Any ideas?

DETAILS:
Running mysql-server-5.1.73-5.el6_6.x86_64 on a Centos6 machine.
my.cnf (relevant) settings:

[mysqld]
  character-set-server=utf8
  ; will inhibit start on 5.5 (a bug)
  default-character-set=utf8
  collation-server = utf8_unicode_ci
  init-connect='SET NAMES utf8'


  [mysql.server]
  ...
  default-character-set=utf8


  [client]
  default-character-set=utf8

in mysql:

mysql> show variables like '%char%';

  +--------------------------+----------------------------+
  | Variable_name            | Value                      |
  +--------------------------+----------------------------+
  | character_set_client     | utf8                       |
  | character_set_connection | utf8                       |
  | character_set_database   | utf8                       |
  | character_set_filesystem | binary                     |
  | character_set_results    | utf8                       |
  | character_set_server     | utf8                       |
  | character_set_system     | utf8                       |
  | character_sets_dir       | /usr/share/mysql/charsets/ |
  +--------------------------+----------------------------+

PHP connects with explicit utf8 charset and

PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES utf8"

EDIT:
Create table statements are like:

CREATE TABLE `fec_news` (
  `ID` bigint(20) NOT NULL DEFAULT '0',
  `NOME_ITA` mediumtext NOT NULL,
  `TESTO_ITA` longtext NOT NULL,
  `NOME_ENG` mediumtext NOT NULL,
  `TESTO_ENG` longtext NOT NULL,
  `NOME_ESP` mediumtext NOT NULL,
  `TESTO_ESP` longtext NOT NULL,
  `NOME_FRA` mediumtext NOT NULL,
  `TESTO_FRA` longtext NOT NULL,
  `DATA` mediumtext NOT NULL,
  `FOTO1` mediumtext NOT NULL,
  `FOTO2` mediumtext NOT NULL,
  `FOTO3` mediumtext NOT NULL,
  `FOTO4` mediumtext NOT NULL,
  `FOTO5` mediumtext NOT NULL,
  `HP` mediumtext NOT NULL,
  `LINK1` mediumtext NOT NULL,
  `POS` bigint(20) NOT NULL DEFAULT '0',
  PRIMARY KEY (`ID`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8

Edit: The accepted answer is right on the money. I had since found about mojibake through different channel and expanded my woes in a blog post.

Best Answer

Mojibake or double encoding.

Please provide SELECT col, HEX(col) FROM tbl WHERE ... to see whether the data is stored correctly. If correctly encoded for storage, è will be hex c3a8.

Don't run your app as root; it ignores init-connect, but you need the SET NAMES (or equivalent).