I was given the task to migrate a PostgreSQL 8.2.x database to another server. To do this I'm using the pgAdmin 1.12.2 (on Ubuntu 11.04 by the way) and using the Backup and Restore using the custom/compress format (.backup) and UTF8 encoding.
The original database is in UTF8, like so:
-- Database: favela
-- DROP DATABASE favela;
CREATE DATABASE favela
WITH OWNER = favela
ENCODING = 'UTF8'
TABLESPACE = favela
CONNECTION LIMIT = -1;
I'm creating this database exactly like this on the destination server. But when I restore the database from the .backup file using the Restore option it gives me some of these errors:
pg_restore: restoring data for table "arena"
pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 2173; 0 35500 TABLE DATA arena favela
pg_restore: [archiver (db)] COPY failed: ERROR: invalid byte sequence for encoding "UTF8": 0xe3a709
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
CONTEXT: COPY arena, line 62
When I check which record triggered this error in fact some vartext fields have diacritical characters like ç (used in Portuguese, for example, "caça"), and when I manually remove them from the text in the records the error passes to the next record that has them – since when copy has an error it stops inserting data on this table. And I don't want to replace them manually one by one to accomplish this.
But it's kinda of strange because with UTF8 there shouldn't be this kind of problems, right?
I don't know how they got there in the first place. I'm only migrating the database, and I supose that somehow the database was like in LATIN1 and then was improperly changed to UTF8.
Is there any way to check if a table/database has invalid UTF8 sequences? Or any way to enforce/reconvert these characters into UFT8 so I don't run into any problems when I execute the restore?
Thanks, in advance.
Best Answer
Digging around the internet, I've seen that this is a pretty common problem. The common solution is to use the plain text format dump and feed it through iconv to correct the encoding.
Here is more information about that.