Postgresql – Restoring SQL_ASCII dumps to a UTF8-encoded database

encodingpostgresql

I've got a Postgres 8.4 environment where the encoding on all our databases is set to SQL_ASCII – We're finally migrating to Postgres 9.2, and I would like to migrate everything over to UTF8 encoding.

Unfortunately the text data in this DB is not clean — Trying to restore the pg_dump to a utf8-encoded database throws errors about invalid byte sequences, even if I specify --encoding=UTF8 when I run pg_dump (presumably because Postgres doesn't know what to make of them and just dumps them unchanged?).

We have a LOT of data (upwards of a million rows with text/string elements), and auditing all of it by hand would be very time consuming (and error prone) so I'd like to automate this if possible.

Is there an easy way to find the non-utf8-conforming strings/text fields in the database so we can fix them? Or am I stuck with a manual audit to straighten this mess out?

Best Answer

I suspect your database content may be in iso8859 or cp1252. If it were ascii, you would not run into problems importing it. You may be able to determine the coding by opening your dump with python. The following python3 tries encodings until it succeeds. It can be used to determine the file encoding.

for enc in ('cp1252', 'utf8'):
    print('Encoding', enc)
    try:
        file = open(fileName, 'r', encoding=enc)
        return file.read()
    except Exception as e:
        print(e)
        return None

Python also can also be used to access the database and audit the data. It could also be used to transcode the data to UTF-8 while copying it if necessary.