Find Non-UTF-8 Data in MySQL – Methods and Tools

MySQL

Ok so I have a mysql database containing data that is input from various customers. The customers are told that their data should only be encoded as utf8 however I have become aware that this is currently not being checked and some people are using characters not encoded in utf8. We intend to add some checking but in order to clean the data already there is there a sql or mysql query (or admin command) that will show me the data that is NOT encoded as utf8.

Sorry if this is a dumb question I'm a Linux Sysadmin and have limited exposure to databases.

Thanks

B

Best Answer

You can check for the existence of (non-)UTF-8 data by comparing byte length to character length on a column, e.g.:

SELECT * FROM MyTable
WHERE LENGTH(MyColumn) <> CHAR_LENGTH(MyColumn)

Multibyte characters will have a greater LENGTH (bytes), so you'll need to look for where that condition isn't met.

Note that MySQL's utf8 character set isn't true Unicode UTF-8 as it only supports a maximum of 3 bytes per character. If your MySQL is later than 5.5.3 you can use utf8mb4 to get 4 bytes per character.