Thesqldump collation issues

backupcollationMySQLutf-8

I have tried my hand at building a shell script to backup our wordpress database using tar balls and chron jobs on a Debian environment.
The problem I am facing now is that our wp database after exporting, using the script, and then importing, using phpmyadmin, loses some special characters. Most noteably the € sign.
Phpmyadmin lists the database as latin1_swedish_ci and that is the collation setting I am selecting in the Import tab of phpmyadmin (iso-8859-1).
I am using the following script to export the database:

mysqldump –user=user –password=password –default-character-set=latin1 –skip-set-charset databasename -r wp-db.sql

tar -cpzf /home/backups/wp-backups/website-wp-$(date +"%d-%m-%Y–%H-%M").tar.gz wp-db.sql && rm wp-db.sql

My best guess is that there must be some issue with collation but I am not aware where some type of conversion happens in the steps that I am taking. At some point the € must be dropped.
When inspecting the wp-db.sql using file it is described as iso-8859-1, hence I wonder where conversion takes place.
I have since tried to test some things and discovered that I am not capable of converting a UTF-8file to iso-8859-1if it includes a € sign. At least not using

iconv -f UTF-8 -t ISO_8859-1 test

From what I gather mysql runs in UTF-8 natively, might that be the issue?

If I do not bother with any of this it will completely ruin things. Pretty much everything that isnt wp files but stored in the database will be displayed incorrectly or will be completely missing. By adding these two arguments to my script I have eliminated most of the issues apart from special character problems.

Best Answer

"displayed incorrectly" -- That does not necessarily mean that the text is stored incorrectly. How many steps do you have here?

You have data in a WP database? Please provide SHOW CREATE TABLE for one of the tables so we can verify whether it is latin1 or utf8. Also, do SELECT col, HEX(col) to see what you have in the table for €. For utf8, the hex should be E282AC. As you pointed out, it cannot be stored in latin1. If you don't see that hex, stop; we need to debug things before this point. Else continue...

Then you are doing a dump? Let mysqldump specify the CHARACTER SET (not COLLATION).

Take a peek at the dump file. Well, this may be useless, because it depends on what tool you use to peek at it.

Then you are reloading the data? After reloading, what does the hex say? Again, stop if it is not that same hex.

Now you are running WP and trying to display stuff. On a web page? Look at the HTML; there should be a <meta... > tag very near the start. What does it say? In particular, does it say charset=UTF-8?

One more thing to check is the SET NAMES utf8 (or equivalent) -- phpmyadmin establishes the CHARACTER SET one place; find it. And find it for WP.

Related Solutions

Mysql – How to convert control characters in MySQL from latin1 to UTF-8

I'm not certain. I tried to start out be reproducing your problem but the alter worked fine for me.

test > CREATE TABLE `bar` (  `content` text ) ENGINE=MyISAM DEFAULT CHARSET=latin1;  INSERT INTO bar VALUES (0x8081828384858687898A8B8C8D8E8F909192939495969798999A9B9C9D9E9F);
Query OK, 0 rows affected (0.02 sec)

Query OK, 1 row affected (0.00 sec)

test > ALTER TABLE bar CHANGE content content TEXT CHARACTER SET UTF8;
Query OK, 1 row affected (0.04 sec)
Records: 1  Duplicates: 0  Warnings: 0

test > select * from bar;
+---------------------------------+
| content                         |
+---------------------------------+
| ����������������������������� |
+---------------------------------+
1 row in set (0.00 sec)

test > set names utf8;
Query OK, 0 rows affected (0.00 sec)

test > select * from bar;
+---------------------------------------------------------------------------------+
| content                                                                         |
+---------------------------------------------------------------------------------+
| €‚ƒ„…†‡‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ |
+---------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Here's my related char settings

test > show variables like '%char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+

Edit

My char settings before running set names utf8

test > show variables like '%char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | latin1                     |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

Version

test > select version();
+-------------------------+
| version()               |
+-------------------------+
| 5.1.41-3ubuntu12.10-log |
+-------------------------+
1 row in set (0.00 sec)

Postgresql – How to choose a collation for international database

The C collation is the right choice.

Everything is a bit faster without locale. And since no collation is right anyway, create the database without collation, meaning with C.

It may be a pain to have to provide a collation for many operations. There shouldn't be a noticeable difference in speed between the default collation and an ad-hoc collation, though. After all it's just unsorted data, and collation rules are applied when sorting.

Be aware that Postgres builds on the locale settings provided by the underlying OS, so you need to have locales generated for each locale to be used. More in related answer on SO here and here.

However, as @Craig already mentioned, indexes are the bottleneck in this scenario. The collation of the index has to match the collation of the applied operator in many cases that involve character data.

You can use the COLLATE specifier in indexes to produce matching indexes. Partial indexes may be the perfect choice if you are mixing data in the same table.

For example, a table with international strings:

CREATE TABLE string (
   string_id serial
  ,lang_id   int NOT NULL
  ,string    text NOT NULL
);

And you are mostly interested in one language at a time:

SELECT *
FROM   string
WHERE  lang_id = 5  -- 5 being German / Germany here
AND    string > 'foo' COLLATE "de_DE"
ORDER  BY string COLLATE "de_DE";

Then create partial indexes like:

CREATE INDEX string_string_lang_id_idx ON string (string COLLATE "de_DE")
WHERE lang_id = 5;

One for each language you need.

Actually, inheritance might be a superior approach for a table like this. Then you can have a plain index on each inherited table containing only strings for a single locale. You need to be comfortable with the special rules for inherited tables, of course.

Best Answer

Related Solutions

Mysql – How to convert control characters in MySQL from latin1 to UTF-8

Postgresql – How to choose a collation for international database

Related Question