Mysql – Can’t set up Unicode support

MySQLphpmyadminunicodeutf-8

I am creating a toy website locally with XAMPP and I want it to support Unicode.

index.php contains a simple form that I can use to add a new entry (a UTF-8 string) to a table "myTable" in the DB. In addition to that, I am also echoing with PHP every entry from "myTable" on the body of index.php, so that I can see if the UTF-8 strings that I am entering through the form make all the way into the database and back without any issues.


Here are the measures that I took toward Unicode support:

  1. Every file on my website (ex.: index.php) is encoded in UTF-8 without BOM;
  2. <meta charset='utf-8'> everywhere applicable too;
  3. mysqli_set_charset($myConnection, "utf-8"); in my "db.inc.php" file;
  4. At the phpMyAdmin "control panel" (?), under "Database Server" I see "Server charset: UTF-8 Unicode (utf8)";
  5. I created a database and a table for this specific purpose. Here is the result of SHOW CREATE TABLE `myTable`:

    CREATE TABLE `myTable` (
    `row1` varchar(16) COLLATE utf8mb4_bin NOT NULL,
    `row2` varchar(32) COLLATE utf8mb4_bin NOT NULL
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin


The outcome was as follows: when I print the entries of "myTable", everything does get displayed correctly on the browser indeed, however only owing to a stroke of luck, because apparently my strings are being internally represented badly. When I browse my table with phpMyAdmin I see mojibake, and I looked over at myTable.ibd and I see the same mojibake.
The string joão e maría for example, is displayed as joão e maría; for the first ã -> ã, it could be UTF-8 bytes being interpreted as latin-1, though I see ã even when I tell npp to interpret the file as UTF-8.

Running the query SELECT `row1`, HEX(`row1`) FROM `myTable` WHERE 1 I see 6A6FC383C2A36F2065206D6172C383C2AD61 as the bytes of joão e maría, so in fact the mojibake is "correctly" encoded as UTF-8 (C3 83 C2 A3 is UTF-8 for ã).

Ideas?


UPDATE

After more careful inspection I verified that mysqli_set_charset($myConnection, "utf-8"); was returning false. I replaced it with mysqli_set_charset($myConnection, "utf8"); and voilà, strings with fancy characters that I enter through the form now make it all the way into the database unscathed. However, these new strings aren't correctly displayed by the browser.

For example: the string bênção is inserted into the DB through the form. I query the table with SELECT `row`, HEX(`row1`) (...) and find out that it is stored as 62C3AA6EC3A7C3A36F inside the table, which is the correct UTF-8 representation of the string in question. I access index.php and I see b�n��o being displayed. I download the HTML of the webpage and inspect it, and I find that the offending bytes are respectively EA, E7, E3, which, surely enough, correspond to ê, ç and ã in latin-1.

Best Answer

<meta charset="UTF-8"> -- correct for outside MySQL

utf8mb4 is the term you need to use inside MySQL, including

mysqli_set_charset($myConnection, "utf8mb4");

ã is Mojibake for ã. This happens when latin1 is mistakenly somewhere in the data flow.

Hex EA E7 E3 is the the latin1 encoding for êçã.
Hex C3AA C3A7 C3A3 is the UTF-8 (utf8mb4) encoding for them.

Since you have presented different symptoms, let me present to you 'Best Practice' and how to diagnose the symptoms for 'black diamond with question mark', 'Mojibake', etc. Note that there are 4-5 places where you have to consistently use UTF-8.