MYSQL dump and problems character encoding

character-setMySQL

I have a 2 gb old database sql backup file. The problem is that text is encoded in UTF-8. Actually this isnt the real problem. This database contains greek and english characters.

The problem is that for example greek Π letter is encoded as
C3 90 UTF-8 which is unicode U+00D0 , clearly not Π letter. But in ISO-8859-7 D0 is Π!

This occurs almost everywhere.. Is there a fix i can apply ?Ideally i want to convert it to utf-8..

NOTICE
ISO8859-7 was used

Best Answer

Assuming your source data contained only English and Greek characters, and was mis-exported with an ISO-8859-1 to UTF-8 conversion (rather than an ISO-8859-7 to UTF-8 conversion), you can get your data back by first repeating the missed conversion the other way around, and then doing the right one.

You could use iconv for this (available on pretty much all Linux distributions, and on Windows via gnuwin32).

You'd do something like:

$ iconv -f UTF-8 -t ISO-8859-1 < bad_dump_file.sql | \
  iconv -f ISO-8859-7 -t UTF-8 > good_dump_file.sql

As an example:

$ echo -n $'\xC3'$'\x90' | iconv -f UTF-8 -t ISO-8859-1 | \
   iconv -f ISO-8859-7 -t UTF-8 | hexdump -C
00000000  ce a0                                             |Î |
00000002

And 0xCE 0xA0 is the right UTF-8 encoding for Π.

Related Solutions

Mysql – Character Set Encoding in a Table

Here is the list of collations under UTF-8

mysql> select * from information_schema.collations where CHARACTER_SET_NAME = 'utf8';
+--------------------+--------------------+-----+------------+-------------+---------+
| COLLATION_NAME     | CHARACTER_SET_NAME | ID  | IS_DEFAULT | IS_COMPILED | SORTLEN |
+--------------------+--------------------+-----+------------+-------------+---------+
| utf8_general_ci    | utf8               |  33 | Yes        | Yes         |       1 |
| utf8_bin           | utf8               |  83 |            | Yes         |       1 |
| utf8_unicode_ci    | utf8               | 192 |            | Yes         |       8 |
| utf8_icelandic_ci  | utf8               | 193 |            | Yes         |       8 |
| utf8_latvian_ci    | utf8               | 194 |            | Yes         |       8 |
| utf8_romanian_ci   | utf8               | 195 |            | Yes         |       8 |
| utf8_slovenian_ci  | utf8               | 196 |            | Yes         |       8 |
| utf8_polish_ci     | utf8               | 197 |            | Yes         |       8 |
| utf8_estonian_ci   | utf8               | 198 |            | Yes         |       8 |
| utf8_spanish_ci    | utf8               | 199 |            | Yes         |       8 |
| utf8_swedish_ci    | utf8               | 200 |            | Yes         |       8 |
| utf8_turkish_ci    | utf8               | 201 |            | Yes         |       8 |
| utf8_czech_ci      | utf8               | 202 |            | Yes         |       8 |
| utf8_danish_ci     | utf8               | 203 |            | Yes         |       8 |
| utf8_lithuanian_ci | utf8               | 204 |            | Yes         |       8 |
| utf8_slovak_ci     | utf8               | 205 |            | Yes         |       8 |
| utf8_spanish2_ci   | utf8               | 206 |            | Yes         |       8 |
| utf8_roman_ci      | utf8               | 207 |            | Yes         |       8 |
| utf8_persian_ci    | utf8               | 208 |            | Yes         |       8 |
| utf8_esperanto_ci  | utf8               | 209 |            | Yes         |       8 |
| utf8_hungarian_ci  | utf8               | 210 |            | Yes         |       8 |
| utf8_sinhala_ci    | utf8               | 211 |            | Yes         |       8 |
+--------------------+--------------------+-----+------------+-------------+---------+
22 rows in set (0.03 sec)

latin1_swedish_ci belongs to latin1

mysql> select * from information_schema.collations where COLLATION_NAME = 'latin1_swedish_ci';
+-------------------+--------------------+----+------------+-------------+---------+
| COLLATION_NAME    | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN |
+-------------------+--------------------+----+------------+-------------+---------+
| latin1_swedish_ci | latin1             |  8 | Yes        | Yes         |       1 |
+-------------------+--------------------+----+------------+-------------+---------+
1 row in set (0.03 sec)

The closest utf8 for your collation is ID 200

mysql> select * from information_schema.collations where ID = 200;
+-----------------+--------------------+-----+------------+-------------+---------+
| COLLATION_NAME  | CHARACTER_SET_NAME | ID  | IS_DEFAULT | IS_COMPILED | SORTLEN |
+-----------------+--------------------+-----+------------+-------------+---------+
| utf8_swedish_ci | utf8               | 200 |            | Yes         |       8 |
+-----------------+--------------------+-----+------------+-------------+---------+
1 row in set (0.00 sec)

CAVEAT

You may have to experiment with latin1, change the character set and collation of the individual column in the table, or possibly both. At the very least, use latin1 for display in phpmyadmin.

You could tweek and experiment with making the entire database a specific character set and collation using ALTER DATABASE.

ALTER DATABASE dbname CHARACTER SET utf8 COLLATE utf8_general_ci;

Make sure you make a backup of the database and load it into a dev/staging DB and then run ALTER DATABASE against dev/staging.

Mysql – How to convert control characters in MySQL from latin1 to UTF-8

I'm not certain. I tried to start out be reproducing your problem but the alter worked fine for me.

test > CREATE TABLE `bar` (  `content` text ) ENGINE=MyISAM DEFAULT CHARSET=latin1;  INSERT INTO bar VALUES (0x8081828384858687898A8B8C8D8E8F909192939495969798999A9B9C9D9E9F);
Query OK, 0 rows affected (0.02 sec)

Query OK, 1 row affected (0.00 sec)

test > ALTER TABLE bar CHANGE content content TEXT CHARACTER SET UTF8;
Query OK, 1 row affected (0.04 sec)
Records: 1  Duplicates: 0  Warnings: 0

test > select * from bar;
+---------------------------------+
| content                         |
+---------------------------------+
| ����������������������������� |
+---------------------------------+
1 row in set (0.00 sec)

test > set names utf8;
Query OK, 0 rows affected (0.00 sec)

test > select * from bar;
+---------------------------------------------------------------------------------+
| content                                                                         |
+---------------------------------------------------------------------------------+
| €‚ƒ„…†‡‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ |
+---------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Here's my related char settings

test > show variables like '%char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+

Edit

My char settings before running set names utf8

test > show variables like '%char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | latin1                     |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

Version

test > select version();
+-------------------------+
| version()               |
+-------------------------+
| 5.1.41-3ubuntu12.10-log |
+-------------------------+
1 row in set (0.00 sec)

Best Answer

Related Solutions

Mysql – Character Set Encoding in a Table

Mysql – How to convert control characters in MySQL from latin1 to UTF-8

Related Question