Mysql – How to find an invalid utf8 character “somewhere” in the MySQL/trac database

character-setMySQLutf-8

I have an installation of trac, which uses MySQL. It has been upgraded so many times, moved servers etc. and chances are that the MySQL character sets were not always set correctly or consistently over the years. Currently all of them are utf8.

When attempting to upgrade the data using "trac-admin wiki update", I'm getting an error message that a byte ("UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 4274: invalid start byte") is not valid unicode. Unfortunately trac-admin gives me no hint where (table/row/column) to look for that byte sequence, or what I could do to fix it.

My question is not about trac/trac-admin, however, it's about the database. How would you go about finding, "somewhere" in the database, the offending bytes, and replacing them with something that is at least valid utf8. I have attempted to mysqldump the database and to re-import it, but MySQL gives no indication that anything might be wrong. The invalid bytes get re-imported.

Ideas?

Best Answer

You could try turning on the general query log, which logs every query the server receives, and then use that information to see what's being queried, and then issue those queries yourself and examine the data.

To see the current setting for the general log (location and whether it's enabled):

mysql> SHOW VARIABLES LIKE 'general_log%';  # shows the on/off status and file path

To turn it on:

mysql> SET GLOBAL general_log = on; # or off to disable

Remember to turn if off again when you're done, to avoid filling the disk and potentially degrading performance with excessive I/O.

There are also options for logging to a table instead of a file.

mysql> SHOW VARIABLES LIKE 'log_output';    # shows whether logging to FILE or TABLE