MySQL silently replaces UTF chars with literal question marks

character-setencodingMySQLutf-8

I'm experiencing a situation similar to this SO question i.e. I'm working with a legacy database which has UTF8 content in latin1 tables (pretty ugly I know).

Now I'm getting new data from a new application which is completely utf8 and works with its database. To support other legacy system the application also writes a copy of its utf8 data in legacy tables. As far as I know it should be possible to write utf8 stuff in latin1 tables as long as you read it back and show it those data as UTF8. There are a lot of tutorials explaining how to fix this situation for the long term but I'd prefer to not apply them unless absolutely necessary (legacy system will be dismissed soon and I don't want to have downtimes to fix this if possible)

Here is a minimal SQL script which reproduces my issue:

CREATE TABLE `articles` (
  `content` mediumtext NOT NULL,
  FULLTEXT KEY `content` (`content`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

SET NAMES utf8;
SET CHARACTER SET utf8;
-- Turkish word for Croatia, second char is \xC4\xB1
INSERT INTO `articles` (`content`) VALUES ('Hırvatistan');

In my system I get no errors from MySQL but after the INSERT statement second char of the word is silently dropped and replaced by a literal ? ('\x3F').

mysql> SELECT content, HEX(content), HEX('Hırvatistan') FROM articles;
+-------------+------------------------+--------------------------+
| content     | HEX(content)           | HEX('Hırvatistan')       |
+-------------+------------------------+--------------------------+
| H?rvatistan | 483F72766174697374616E | 48C4B172766174697374616E |
+-------------+------------------------+--------------------------+

However if I paste the same script on http://sqlfiddle.com/ I get an error when I press 'build schema' which states:

Incorrect string value: '\xC4\xB1rvat...' for column 'content' at row 1

Why on my system the invalid utf8 char is simply dropped and I get no errors? Is there any mysql config value to enable in order to avoid this?

Which is the simplest way to allow any kind of char inside my current latin1 (with utf8 content) table? I have a lot of content and I'd prefer to avoid solutions like dump the content and reimport with other charset

Best Answer

I made some attempts to dig into this issue, here are the results.

When you set a connection charset (i.e. SET NAMES utf8) MySQL transparently handle encoding conversion for you. For instance if I insert a à (\xE0 in latin1 \xC3A0 in utf8) in a latin1 table using a UTF8 connection it reads the UTF 8 value and store it in table as \xE0

mysql> SELECT HEX('à');
+-----------+
| HEX('à')  |
+-----------+
| C3A0      |
+-----------+

mysql> INSERT INTO articles VALUES(50001, 'à');
Query OK, 1 row affected (0,00 sec)

mysql> SELECT content, HEX(content) FROM articles WHERE id_p = 50001;
+---------+--------------+
| content | HEX(content) |
+---------+--------------+
| à       | E0           |
+---------+--------------+
1 row in set (0,00 sec)

When I insert invalid utf8 chars into a latin1 it replaces them with question marks as I shown in the original question.

In order to fix my issue I had to run this command on the original table (actually I tried it on a small copy of it). It takes care of changing charset, collation and also convert existing data. I took a record with a char which is different in latin1 and utf8 encoding

mysql> select HEX(BINARY SUBSTRING(content, 17, 1)), SUBSTRING(content, 17, 1) from articles where id_p = 40\G
*************************** 1. row ***************************
HEX(BINARY SUBSTRING(content, 17, 1)): 93
            SUBSTRING(content, 17, 1): “
1 row in set (0,00 sec)

mysql> ALTER TABLE `articles` CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
Query OK, 34905 rows affected (1 min 10,73 sec)
Records: 34905  Duplicates: 0  Warnings: 0

mysql> select HEX(BINARY SUBSTRING(content, 17, 3)), SUBSTRING(content, 17, 3) from articles where id_p = 40\G
*************************** 1. row ***************************
HEX(BINARY SUBSTRING(content, 17, 1)): E2809C
            SUBSTRING(content, 17, 1): “
1 row in set (0,00 sec)

After the conversion the char was replaced by its utf8 encoding in the content and all data are still readable. The conversion also changed content column type from MEDIUMTEXT to LONGTEXt because latin1 uses 1 byte per char and utf8 up to 3 bytes per char to avoid data truncation.

Now I'm experimenting with inserting invalid utf8 chars into the converted table and I get a different result for that. It seems that invalid (or not supported 4 bytes) utf chars are simply dropped from the stored value with a warning (shown only if warning enabled)

$ mysql --show-warnings

mysql> INSERT INTO articles VALUES(90000, 0xC328);
Query OK, 1 row affected, 1 warning (0,00 sec)

Warning (Code 1366): Incorrect string value: '\xC3(' for column 'content' at row 1
mysql> SELECT 0xf09f8eb6;
+------------+
| 0xf09f8eb6 |
+------------+
| ?           |
+------------+
1 row in set (0,00 sec)

mysql> INSERT INTO articles VALUES(90001, 0xf09f8eb6);
Query OK, 1 row affected, 1 warning (0,00 sec)

Warning (Code 1366): Incorrect string value: '\xF0\x9F\x8E\xB6' for column 'content' at row 1

After this I found that also in my original example a warning is shown if they are enabled:

-- With warnings enabled
mysql> INSERT INTO `articles` VALUES (50000, 'Hırvatistan');
Query OK, 1 row affected, 1 warning (0,00 sec)

Warning (Code 1366): Incorrect string value: '\xC4\xB1rvat...' for column 'content' at row 1

Finally to trigger an error instead of just a warning (to avoid data loss) just change SQL mode for session or globally (at server level)

mysql> SET SESSION sql_mode = 'TRADITIONAL';
Query OK, 0 rows affected (0,00 sec)

mysql> INSERT INTO `articles` VALUES (50000, 'Hırvatistan');
ERROR 1366 (HY000): Incorrect string value: '\xC4\xB1rvat...' for column 'content' at row 1