I'm planning to convert a table from utf8
(_general_ci
) to utf8mb4
(utf8mb4_unicode_ci
). I want to make sure I don't have any malformed data after the conversion though so I was planning to duplicate the data, run a join on the two tables, and see if there are any differences.
Is this a good way to go about it?
1.
Create table Email_utf8mb4 like Email;
2.
insert into Email_utf8mb4 select * from Email;
3.
ALTER TABLE Email_utf8mb4
CONVERT TO CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
4.
select * from Email as old
join Email_utf8mb4 as new
on old.notificationid =new.notificationid
and (old.subject <> new.subject OR old.content <>
new.content)
4.5. Assuming no rows are returned…
5.
drop Email;
6.
RENAME TABLE Email_utf8mb4 to Email;
subject
and content
are the only alpha columns in my table.
Best Answer
That won't fix or recognize any malformed data already in the table.
Yes, that is a good technique.
ALTER..CONVERT TO..
does the bulk of the work. And, since utf8 is a subset of utf8mb4, there should be no differences discovered in step 4.However, there is still a possibility of step 4 showing something. This is because the definition of "equal" (hence
<>
) is different for_general_ci
versus_unicode_ci
.For example, in German,
ss
andß
are unequal in _general_ci (either utf8 or utf8mb4), but equal in virtually all other collations. If, for example, you currently have aUNIQUE
(orPRIMARY KEY
) with a value that differs in justss
vsß
, the conversion will have a problem with "duplicate key" error.Another problematic pair:
ae
vsæ
A different issue - not equality, but ordering -
Ð
<E
for _unicode_ci, but not for some other collations.Meanwhile, as long as you are changing the
COLLATION
, you may as well go to the newer Unicode Algorithm inutf8mb4_unicode_520_ci
.