I'm not certain. I tried to start out be reproducing your problem but the alter worked fine for me.
test > CREATE TABLE `bar` ( `content` text ) ENGINE=MyISAM DEFAULT CHARSET=latin1; INSERT INTO bar VALUES (0x8081828384858687898A8B8C8D8E8F909192939495969798999A9B9C9D9E9F);
Query OK, 0 rows affected (0.02 sec)
Query OK, 1 row affected (0.00 sec)
test > ALTER TABLE bar CHANGE content content TEXT CHARACTER SET UTF8;
Query OK, 1 row affected (0.04 sec)
Records: 1 Duplicates: 0 Warnings: 0
test > select * from bar;
+---------------------------------+
| content |
+---------------------------------+
| ����������������������������� |
+---------------------------------+
1 row in set (0.00 sec)
test > set names utf8;
Query OK, 0 rows affected (0.00 sec)
test > select * from bar;
+---------------------------------------------------------------------------------+
| content |
+---------------------------------------------------------------------------------+
| €‚ƒ„…†‡‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ |
+---------------------------------------------------------------------------------+
1 row in set (0.00 sec)
Here's my related char settings
test > show variables like '%char%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
Edit
My char settings before running set names utf8
test > show variables like '%char%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)
Version
test > select version();
+-------------------------+
| version() |
+-------------------------+
| 5.1.41-3ubuntu12.10-log |
+-------------------------+
1 row in set (0.00 sec)
The C
collation is the right choice.
Everything is a bit faster without locale. And since no collation is right anyway, create the database without collation, meaning with C
.
It may be a pain to have to provide a collation for many operations. There shouldn't be a noticeable difference in speed between the default collation and an ad-hoc collation, though. After all it's just unsorted data, and collation rules are applied when sorting.
Be aware that Postgres builds on the locale settings provided by the underlying OS, so you need to have locales generated for each locale to be used. More in related answer on SO here and here.
However, as @Craig already mentioned, indexes are the bottleneck in this scenario. The collation of the index has to match the collation of the applied operator in many cases that involve character data.
You can use the COLLATE
specifier in indexes to produce matching indexes. Partial indexes may be the perfect choice if you are mixing data in the same table.
For example, a table with international strings:
CREATE TABLE string (
string_id serial
,lang_id int NOT NULL
,string text NOT NULL
);
And you are mostly interested in one language at a time:
SELECT *
FROM string
WHERE lang_id = 5 -- 5 being German / Germany here
AND string > 'foo' COLLATE "de_DE"
ORDER BY string COLLATE "de_DE";
Then create partial indexes like:
CREATE INDEX string_string_lang_id_idx ON string (string COLLATE "de_DE")
WHERE lang_id = 5;
One for each language you need.
Actually, inheritance might be a superior approach for a table like this. Then you can have a plain index on each inherited table containing only strings for a single locale. You need to be comfortable with the special rules for inherited tables, of course.
Best Answer
"displayed incorrectly" -- That does not necessarily mean that the text is stored incorrectly. How many steps do you have here?
You have data in a WP database? Please provide
SHOW CREATE TABLE
for one of the tables so we can verify whether it is latin1 or utf8. Also, doSELECT col, HEX(col)
to see what you have in the table for€
. For utf8, the hex should beE282AC
. As you pointed out, it cannot be stored in latin1. If you don't see that hex, stop; we need to debug things before this point. Else continue...Then you are doing a dump? Let mysqldump specify the
CHARACTER SET
(notCOLLATION
).Take a peek at the dump file. Well, this may be useless, because it depends on what tool you use to peek at it.
Then you are reloading the data? After reloading, what does the hex say? Again, stop if it is not that same hex.
Now you are running WP and trying to display stuff. On a web page? Look at the HTML; there should be a
<meta... >
tag very near the start. What does it say? In particular, does it saycharset=UTF-8
?One more thing to check is the
SET NAMES utf8
(or equivalent) -- phpmyadmin establishes theCHARACTER SET
one place; find it. And find it for WP.