psql
detects the client_encoding
from the LC_CTYPE
variable in the environment; this falls back to LC_ALL
and then LANG
if unset.
In the terminal you're launching psql
from, run locale
. e.g.
$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
Paste the result as an edit to your question. I strongly suspect your terminal locale will be en_CA
, not en_CA.UTF-8
.
I made some attempts to dig into this issue, here are the results.
When you set a connection charset (i.e. SET NAMES utf8
) MySQL transparently handle encoding conversion for you. For instance if I insert a à (\xE0 in latin1 \xC3A0 in utf8)
in a latin1 table using a UTF8 connection it reads the UTF 8 value and store it in table as \xE0
mysql> SELECT HEX('à');
+-----------+
| HEX('à') |
+-----------+
| C3A0 |
+-----------+
mysql> INSERT INTO articles VALUES(50001, 'à');
Query OK, 1 row affected (0,00 sec)
mysql> SELECT content, HEX(content) FROM articles WHERE id_p = 50001;
+---------+--------------+
| content | HEX(content) |
+---------+--------------+
| à | E0 |
+---------+--------------+
1 row in set (0,00 sec)
When I insert invalid utf8 chars into a latin1 it replaces them with question marks as I shown in the original question.
In order to fix my issue I had to run this command on the original table (actually I tried it on a small copy of it). It takes care of changing charset, collation and also convert existing data. I took a record with a char which is different in latin1 and utf8 encoding
mysql> select HEX(BINARY SUBSTRING(content, 17, 1)), SUBSTRING(content, 17, 1) from articles where id_p = 40\G
*************************** 1. row ***************************
HEX(BINARY SUBSTRING(content, 17, 1)): 93
SUBSTRING(content, 17, 1): “
1 row in set (0,00 sec)
mysql> ALTER TABLE `articles` CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
Query OK, 34905 rows affected (1 min 10,73 sec)
Records: 34905 Duplicates: 0 Warnings: 0
mysql> select HEX(BINARY SUBSTRING(content, 17, 3)), SUBSTRING(content, 17, 3) from articles where id_p = 40\G
*************************** 1. row ***************************
HEX(BINARY SUBSTRING(content, 17, 1)): E2809C
SUBSTRING(content, 17, 1): “
1 row in set (0,00 sec)
After the conversion the “
char was replaced by its utf8 encoding in the content and all data are still readable. The conversion also changed content
column type from MEDIUMTEXT
to LONGTEXt
because latin1 uses 1 byte per char and utf8 up to 3 bytes per char to avoid data truncation.
Now I'm experimenting with inserting invalid utf8 chars into the converted table and I get a different result for that. It seems that invalid (or not supported 4 bytes) utf chars are simply dropped from the stored value with a warning (shown only if warning enabled)
$ mysql --show-warnings
mysql> INSERT INTO articles VALUES(90000, 0xC328);
Query OK, 1 row affected, 1 warning (0,00 sec)
Warning (Code 1366): Incorrect string value: '\xC3(' for column 'content' at row 1
mysql> SELECT 0xf09f8eb6;
+------------+
| 0xf09f8eb6 |
+------------+
| ? |
+------------+
1 row in set (0,00 sec)
mysql> INSERT INTO articles VALUES(90001, 0xf09f8eb6);
Query OK, 1 row affected, 1 warning (0,00 sec)
Warning (Code 1366): Incorrect string value: '\xF0\x9F\x8E\xB6' for column 'content' at row 1
After this I found that also in my original example a warning is shown if they are enabled:
-- With warnings enabled
mysql> INSERT INTO `articles` VALUES (50000, 'Hırvatistan');
Query OK, 1 row affected, 1 warning (0,00 sec)
Warning (Code 1366): Incorrect string value: '\xC4\xB1rvat...' for column 'content' at row 1
Finally to trigger an error instead of just a warning (to avoid data loss) just change SQL mode for session or globally (at server level)
mysql> SET SESSION sql_mode = 'TRADITIONAL';
Query OK, 0 rows affected (0,00 sec)
mysql> INSERT INTO `articles` VALUES (50000, 'Hırvatistan');
ERROR 1366 (HY000): Incorrect string value: '\xC4\xB1rvat...' for column 'content' at row 1
Best Answer
Sometimes ago, I faced a similar issue and I solved that issue with the help of my friend. How I did was I changed the collation to uft8_turkish_ci in MySQL. Also from MySQL website, I found the following document, which may be helpful for you,
MySQL Character set
Thanks.