Mysql – Gotchas converting latin1_swedish_ci to utf8mb4_unicode_ci

character-setMySQLutf-8

I have a MySQL 5.5.31 database which has approx 220 tables – of these 220 tables, around half of them are already using utf8mb4_unicode_ci but the "older" tables are still using latin1_swedish_ci. The fields in the tables are a mix of integer, varchar, longtext, date, datetime and decimal and there are no views or stored procedures.

I now have a requirement to support Russian text so I need to convert the remaining tables to UTF8. Based on my reading, I have opted for utf8mb4_unicode_ci as I believe it offers the broadest support including Chinese.

I have tested as much as I can with individual tables in a test environment without any issues and now need to bite the bullet and do this on my production database – I will, of course, take a full backup just before I make any changes.

I have read of some issues with converting TEXT and CHAR fields as well as some considerations on index sizes for VARCHAR fields but nothing else significant.

I know how to do this but I wanted a final sanity check to make sure there is nothing major I am missing before taking the plunge?

Best Answer

I assume you will do this for each table? ALTER TABLE tbl CONVERT TO CHARACTER SET utf8mb4;

One potential issue involves the maximum of 767 bytes per column in an INDEX. If you currently have a VARHAR(255) latin1 field in an index, you will need to rethink it.

  • Decrease it to VARCHAR(191) if you are sure that 191 will suffice (into the future)
  • Use a 'prefix' index: INDEX(foo(191)). This is often not worth doing.
  • DROP the index. (But not if you need the index; let's see the SELECT that needs it.)

Edit

For shrinking to VARCHAR(191), you could write a Stored Procedure to look in information_schema to discover all the VARCHARs, and create and execute SELECT MAX(CHAR_LENGTH(col)) ... to see if any are over 191 (or even dangerously close to 191). You might be surprised to find a few with max=255, implying that you may already have truncated some text.

The other two options won't damage data, but could have performance problems.