MySQL – Handling Invalid UTF-8 Returns

MySQLutf-8

I am wondering: assuming everything is set up properly, meaning a column with charset utf8mb4 and a client connecting to the database setting the connection encoding to utf8mb4, is there any way in which said client could SELECT data which contains invalid UTF-8 byte sequences? Is there any way to insert data into a utf8mb4 column which is not in fact valid UTF-8?

Put another way: is it reasonable to assume that SELECTing any non-binary column over a client connection set to utf8mb4 should always return valid UTF-8 byte sequences?

(Let's assume plain SELECT * statements without any CAST shenanigans or such.)

Best Answer

I don't think so. But I have come close:

mysql> CREATE TABLE invalid (c VARBINARY(22));
Query OK, 0 rows affected (0.02 sec)

mysql> INSERT INTO invalid (c) VALUES (UNHEX('65c3c3c366'));
Query OK, 1 row affected (0.01 sec)

mysql> SELECT HEX(c) FROM invalid;
+------------+
| HEX(c)     |
+------------+
| 65C3C3C366 |
+------------+
1 row in set (0.00 sec)

mysql> ALTER TABLE invalid MODIFY COLUMN c VARCHAR(22) CHARACTER SET utf8mb4;
Query OK, 1 row affected, 1 warning (0.04 sec)
Records: 1  Duplicates: 0  Warnings: 1

mysql> SHOW WARNINGS;
+---------+------+-----------------------------------------------------------------+
| Level   | Code | Message                                                         |
+---------+------+-----------------------------------------------------------------+
| Warning | 1366 | Incorrect string value: '\xC3\xC3\xC3f' for column 'c' at row 1 |
+---------+------+-----------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> SELECT HEX(c) FROM invalid;
+--------+
| HEX(c) |
+--------+
| 65     |
+--------+
1 row in set (0.00 sec)

By starting with BINARY, then converting to CHAR, I can get invalid data into the column. However, the ALTER gives a warning when I try to convert it.

Note in the final SELECT, the data is truncated. The truncation happened in the ALTER, not the SELECT, where your Question wanted it to happen. This can be verified by converting back to BINARY:

mysql> ALTER TABLE invalid MODIFY COLUMN c VARBINARY(22);
Query OK, 1 row affected (0.03 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> SELECT HEX(c) FROM invalid;
+--------+
| HEX(c) |
+--------+
| 65     |
+--------+
1 row in set (0.00 sec)

If you have seen the problem happen, please provide details; perhaps I can dissect the problem.