Mysql – Solutions to 4-byte utf-8 characters in thesql 5.1

character-setMySQLmysql-5.1utf-8

I'm running into a problem that seems to be an issue w/ mysql 5.1s limitations on UTF8 characters.

From http://dev.mysql.com/doc/refman/5.1/en/charset-unicode-utf8.html

Currently, MySQL support for UTF-8 does not include 4-byte sequences.

Our app is throwing periodic exceptions like

java.sql.SQLException: Incorrect string value: '\xF4\x80\x82\x8C  ...'

The ultimate solution looks to be migrating to 5.5 which supports UTF-8 > 3 bytes

Unfortunately this wouldn't be as simple as bouncing the instances under newer binaries. Being a major version dump we'd need to do a full dump reload which will require some scheduled downtime.

Has anyone else had to deal this situation before? Are there any good work arounds?

The naive approach seems to be have the app search and replace multi-byte sequences with question marks or � . This seems pretty hacky and not a very palatable option to me or the developers.

Best Answer

Opinion

It's probably better to throw an exception during your applications input checking and not pass the buck to the database.

Workaround

There is a "workaround" but your mileage may vary: http://forge.mysql.com/worklog/task.php?id=3780

Brute Force?

You could convert your front end table VARCHAR field to a BLOB and store as binary data to cure the current problem. ...and of course that invites a host of other problems by using BLOBs instead of VARCHAR.

Upgrading

UTF32 may solve the problem and upgrading to 5.5.x isn't has hard as you'd think. Create a replicated slave (farm), sync up and promote it to master.