https://dev.mysql.com/doc/refman/5.7/en/replication-features-differing-tables.html describes the supported variations in table definitions.
For one, it appears that this is supported with statement-based replication:
When using statement-based replication, a simple rule of thumb to follow is, “If the statement run on the master would also execute successfully on the slave, it should also replicate successfully”.
With row-based replication, only certain type conversions are supported, which are affected by the slave_type_conversions
variable. The key quotes according to my reading are:
[Conversions are supported between] any of the string types CHAR, VARCHAR, and TEXT, including conversions between different widths.
However:
[Row-based] Replication between columns using different character sets is not supported.
I made some attempts to dig into this issue, here are the results.
When you set a connection charset (i.e. SET NAMES utf8
) MySQL transparently handle encoding conversion for you. For instance if I insert a à (\xE0 in latin1 \xC3A0 in utf8)
in a latin1 table using a UTF8 connection it reads the UTF 8 value and store it in table as \xE0
mysql> SELECT HEX('à');
+-----------+
| HEX('à') |
+-----------+
| C3A0 |
+-----------+
mysql> INSERT INTO articles VALUES(50001, 'à');
Query OK, 1 row affected (0,00 sec)
mysql> SELECT content, HEX(content) FROM articles WHERE id_p = 50001;
+---------+--------------+
| content | HEX(content) |
+---------+--------------+
| à | E0 |
+---------+--------------+
1 row in set (0,00 sec)
When I insert invalid utf8 chars into a latin1 it replaces them with question marks as I shown in the original question.
In order to fix my issue I had to run this command on the original table (actually I tried it on a small copy of it). It takes care of changing charset, collation and also convert existing data. I took a record with a char which is different in latin1 and utf8 encoding
mysql> select HEX(BINARY SUBSTRING(content, 17, 1)), SUBSTRING(content, 17, 1) from articles where id_p = 40\G
*************************** 1. row ***************************
HEX(BINARY SUBSTRING(content, 17, 1)): 93
SUBSTRING(content, 17, 1): “
1 row in set (0,00 sec)
mysql> ALTER TABLE `articles` CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
Query OK, 34905 rows affected (1 min 10,73 sec)
Records: 34905 Duplicates: 0 Warnings: 0
mysql> select HEX(BINARY SUBSTRING(content, 17, 3)), SUBSTRING(content, 17, 3) from articles where id_p = 40\G
*************************** 1. row ***************************
HEX(BINARY SUBSTRING(content, 17, 1)): E2809C
SUBSTRING(content, 17, 1): “
1 row in set (0,00 sec)
After the conversion the “
char was replaced by its utf8 encoding in the content and all data are still readable. The conversion also changed content
column type from MEDIUMTEXT
to LONGTEXt
because latin1 uses 1 byte per char and utf8 up to 3 bytes per char to avoid data truncation.
Now I'm experimenting with inserting invalid utf8 chars into the converted table and I get a different result for that. It seems that invalid (or not supported 4 bytes) utf chars are simply dropped from the stored value with a warning (shown only if warning enabled)
$ mysql --show-warnings
mysql> INSERT INTO articles VALUES(90000, 0xC328);
Query OK, 1 row affected, 1 warning (0,00 sec)
Warning (Code 1366): Incorrect string value: '\xC3(' for column 'content' at row 1
mysql> SELECT 0xf09f8eb6;
+------------+
| 0xf09f8eb6 |
+------------+
| ? |
+------------+
1 row in set (0,00 sec)
mysql> INSERT INTO articles VALUES(90001, 0xf09f8eb6);
Query OK, 1 row affected, 1 warning (0,00 sec)
Warning (Code 1366): Incorrect string value: '\xF0\x9F\x8E\xB6' for column 'content' at row 1
After this I found that also in my original example a warning is shown if they are enabled:
-- With warnings enabled
mysql> INSERT INTO `articles` VALUES (50000, 'Hırvatistan');
Query OK, 1 row affected, 1 warning (0,00 sec)
Warning (Code 1366): Incorrect string value: '\xC4\xB1rvat...' for column 'content' at row 1
Finally to trigger an error instead of just a warning (to avoid data loss) just change SQL mode for session or globally (at server level)
mysql> SET SESSION sql_mode = 'TRADITIONAL';
Query OK, 0 rows affected (0,00 sec)
mysql> INSERT INTO `articles` VALUES (50000, 'Hırvatistan');
ERROR 1366 (HY000): Incorrect string value: '\xC4\xB1rvat...' for column 'content' at row 1
Best Answer
You'll have to specify a character set of UTF8 on the table schema. See http://dev.mysql.com/doc/refman/5.5/en/charset-syntax.html
Depending on your needs you can specify table defaults which then apply to all unspecified text columns (char/varchar/text) or you can specify on a per column level.
You'll also need to have your applications to specify a UTF-8 character encoding. The specifics on how to do this will depend on the language you are using.