Postgresql – How to solve UTF8 invalid byte sequence copy errors on a restore, when the source database is encoded in UTF8

migrationpgadminpostgresql

I was given the task to migrate a PostgreSQL 8.2.x database to another server. To do this I'm using the pgAdmin 1.12.2 (on Ubuntu 11.04 by the way) and using the Backup and Restore using the custom/compress format (.backup) and UTF8 encoding.

The original database is in UTF8, like so:

-- Database: favela

-- DROP DATABASE favela;

CREATE DATABASE favela
  WITH OWNER = favela
       ENCODING = 'UTF8'
       TABLESPACE = favela
       CONNECTION LIMIT = -1;

I'm creating this database exactly like this on the destination server. But when I restore the database from the .backup file using the Restore option it gives me some of these errors:

pg_restore: restoring data for table "arena"
pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 2173; 0 35500 TABLE DATA arena favela
pg_restore: [archiver (db)] COPY failed: ERROR:  invalid byte sequence for encoding "UTF8": 0xe3a709
HINT:  This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
CONTEXT:  COPY arena, line 62

When I check which record triggered this error in fact some vartext fields have diacritical characters like ç (used in Portuguese, for example, "caça"), and when I manually remove them from the text in the records the error passes to the next record that has them – since when copy has an error it stops inserting data on this table. And I don't want to replace them manually one by one to accomplish this.

But it's kinda of strange because with UTF8 there shouldn't be this kind of problems, right?

I don't know how they got there in the first place. I'm only migrating the database, and I supose that somehow the database was like in LATIN1 and then was improperly changed to UTF8.

Is there any way to check if a table/database has invalid UTF8 sequences? Or any way to enforce/reconvert these characters into UFT8 so I don't run into any problems when I execute the restore?

Thanks, in advance.

Best Answer

Digging around the internet, I've seen that this is a pretty common problem. The common solution is to use the plain text format dump and feed it through iconv to correct the encoding.

Here is more information about that.

Related Solutions

Postgresql – issues with encoding and pg_dump/restore between windows and linux

I suppose the dump file you have is encoded in WIN1252 and the target database on Linux has LATIN1 as the server encoding. That's not going to work, as the error message says.

I suggest you reinitialize the target database using UTF8.

Alternatively, create the dump in UTF8 using the pg_dump -E option.

Postgresql – pg_restore: [archiver (db)] could not execute query: ERROR: schema “public” already exists

The error is harmless but to get rid of it, I think you need to break this restore into two commands, as in:

dropdb -U postgres mydb && \
 pg_restore --create --dbname=postgres --username=postgres pg_backup.dump

The --clean option in pg_restore doesn't look like much but actually raises non-trivial problems.

For versions up to 9.1

The combination of --create and --clean in pg_restore options used to be an error in older PG versions (up to 9.1). There is indeed some contradiction between (quoting the 9.1 manpage):

--clean Clean (drop) database objects before recreating them

and

--create Create the database before restoring into it.

Because what's the point of cleaning inside a brand-new database?

Starting from version 9.2

The combination is now accepted and the doc says this (quoting the 9.3 manpage):

--clean Clean (drop) database objects before recreating them. (This might generate some harmless error messages, if any objects were not present in the destination database.)

--create Create the database before restoring into it. If --clean is also specified, drop and recreate the target database before connecting to it.

Now having both together leads to this kind of sequence during your restore:

DROP DATABASE mydb;
...
CREATE DATABASE mydb WITH TEMPLATE = template0... [other options]
...
CREATE SCHEMA public;
...
CREATE TABLE...

There is no DROP for each individual object, only a DROP DATABASE at the beginning. If not using --create this would be the opposite.

Anyway this sequence raises the error of public schema already existing because creating mydb from template0 has imported it already (which is normal, it's the point of a template database).

I'm not sure why this case is not handled automatically by pg_restore. Maybe this would cause undesirable side-effects when an admin decides to customize template0 and/or change the purpose of public, even if we're not supposed to do that.

Best Answer

Related Solutions

Postgresql – issues with encoding and pg_dump/restore between windows and linux

Postgresql – pg_restore: [archiver (db)] could not execute query: ERROR: schema “public” already exists

Related Question