Ok so I have a mysql database containing data that is input from various customers. The customers are told that their data should only be encoded as utf8 however I have become aware that this is currently not being checked and some people are using characters not encoded in utf8. We intend to add some checking but in order to clean the data already there is there a sql or mysql query (or admin command) that will show me the data that is NOT encoded as utf8.
Sorry if this is a dumb question I'm a Linux Sysadmin and have limited exposure to databases.
Thanks
B
Best Answer
You can check for the existence of (non-)UTF-8 data by comparing byte length to character length on a column, e.g.:
Multibyte characters will have a greater
LENGTH
(bytes), so you'll need to look for where that condition isn't met.Note that MySQL's
utf8
character set isn't true Unicode UTF-8 as it only supports a maximum of 3 bytes per character. If your MySQL is later than 5.5.3 you can useutf8mb4
to get 4 bytes per character.