PostgreSQL Encoding Problem – How to Resolve UTF-8 Issues

encodingpostgresqlutf-8

I'm striving to convert badly encoded data from my table. For instance, I have a field with Nadège which should be Nadège.

I tried using Postgres's functions convert, convert_from, convert_to without much success.

db=# SHOW client_encoding;
 client_encoding 
-----------------
 UTF8
(1 row)

db=# SHOW server_encoding;
 server_encoding 
-----------------
 UTF8
(1 row)

db=# SELECT "firstName", encode("firstName"::bytea, 'hex') FROM contact;       
 firstName |       encode       
-----------+--------------------
 Nadège    | 4e6164c3a86765
 Nadège   | 4e6164c383c2a86765
(2 rows)

db=# SELECT "firstName", convert_from("firstName"::bytea, 'latin1') FROM contact WHERE "lastName" ILIKE 'crochard';
 firstName |  convert_from  
-----------+----------------
 Nadège    | Nadège
 Nadège   | NadÃ\u0083¨ge
(2 rows)

db=# SELECT "firstName", convert("firstName"::bytea, 'utf8', 'latin1') FROM contact;                                                                                                                                                       
 firstName |     convert      
-----------+------------------
 Nadège    | \x4e6164e86765
 Nadège   | \x4e6164c3a86765
(2 rows)

Using python I'm able to get the correct encoding with:

data.encode('latin1').decode('utf8')

Any hint on how to convert these wrongly encoded data in postgres ?

Best Answer

As you have correctly identified, Nadège is the UTF-8 representation of Nadège incorrectly decoded as ISO-8859-1 ("latin-1"). Then, in your case, re-encoded to UTF-8 for storage in the DB.

To fix it you need to:

  • Take the current representation and decode the UTF-8 to latin-1 as a byte string
  • re-interpret the byte string, decoding it as utf-8

So:

test=> SELECT convert_from(convert_to('Nadège', 'latin-1'), 'utf-8');
 convert_from 
--------------
 Nadège
(1 row)

The Python equivalent would be close to what you wrote, but starts with a unicode representation to illustrate that PostgreSQL stores everything in the database encoding. Something like:

>>> print u"Nadège".encode("latin-1").decode("utf-8")
Nadège

The problem with all your attempted solutions is that the cast from text to bytea uses the database encoding. So you're starting with the bytes for the utf-8 representation of utf-8 mis-decoded as latin-1. With the cast you'd have to write:

test=> SELECT convert_from(convert_to(convert_from((TEXT 'Nadège')::bytea, 'utf-8'), 'latin-1'), 'utf-8');
 convert_from 
--------------
 Nadège
(1 row)

because you have to explicitly decode the utf-8 representation produced by the cast before re-interpreting as latin-1 and decoding again.

You just needed to use convert_to(mycol, 'latin-1') instead of mycol::bytea