PostgreSQL – UTF8 to Latin1 Conversion

byteapostgresql

We have a Postgresql database where the data is in Latin1. Apparently someone loaded some UTF8 encoded data into it since when I select data from some of the columns, I get stuff like:

SELECT symptoms from client
- -
"huvudvÃ¤rke"
- -

where "Ã¤" should be "ä". I could handle it with replace, ie.:

UPDATE client SET symptoms=replace(symptoms, 'Ã¤', 'ä');

I tried a couple of solutions I found, like using convert and convert_to but that leaves the data looking like "huvudv\303\244rke".

Is there an existing solution to my problem?

EDIT:

SELECT symptoms, symptoms::bytea from client
- -
"huvudvÃ¤rke";"huvudv\303\203\302\244rke" 
- -

After the UPDATE above, query with symptoms::bytea outputs:

SELECT symptoms, symptoms::bytea from client
- -
"huvudvärke";"huvudv\303\244rke" 
- -

EDIT2:

select version();
"PostgreSQL 9.2.9, compiled by Visual C++ build 1600, 64-bit"

show client_encoding; --I'm using pgAdmin III 1.18.1
"UNICODE"

show server_encoding;
"UTF8"

SELECT symptoms, convert(convert_to(symptoms, 'utf-8'), 'latin-1', 'utf-8')::text from client;
"huvudvÃ¤rke","huvudv\303\203\302\203\303\202\302\244rke"

Best Answer

It looks like whatever client you are using is confused about the text encoding; it's sending utf-8 bytes as if they were latin-1, probably.

Check:

SHOW client_encoding;
SHOW server_encoding;
locale command in your terminal, if using psql

Your update is substituting the octal bytes \303\244 which are the utf-8 encoding for "ä" (U+00E4). You're not substituting latin-1 encoded data where you think you are.

Observe:

regress=> SELECT convert_from(BYTEA 'huvudv\303\244rke', 'latin-1');
 convert_from 
--------------
 huvudvÃ¤rke
(1 row)

regress=> SELECT convert_from(BYTEA 'huvudv\303\244rke', 'utf-8');
 convert_from 
--------------
 huvudvärke
(1 row)

Not only that, but your replace could only have matched in the first place if the replace target was the utf-8 encoded byte sequence for Ã¤ interpreted as latin-1, i.e. \303\203\302\244.

It's hard to be more specific without details about the Pg version, the client being used, etc, but the root cause is clearly your client doing something totally borked with encodings on I/O.

Your original text is totally mangled, it's not valid in UTF-8 or latin-1. It looks like someone's taken some UTF-8 data, decoded it as latin-1, and then encoded it as utf-8 again.

Yep, sure enough:

regress=> SELECT convert(convert_to('huvudvärke', 'utf-8'), 'latin-1', 'utf-8');
          convert          
---------------------------
 huvudv\303\203\302\244rke
(1 row)

there's your explanation.

I'd say you probably have a bunch of mis-encoded data in the DB already, and you noticed this one because it got mangled twice. You're probably doing something like routinely jamming utf-8 bytes into latin-1 encoded fields, but you usually get away with it because you decode them as utf-8 again - and this time you did something different.

If you want to get from your mangled text to the original, you simply have to reverse the incorrect encoding process. Decode the utf-8 and output latin-1, then re-interpet the latin-1 as utf-8 and decode again, e.g:

regress=> SELECT  convert_from(convert(BYTEA 'huvudv\303\203\302\244rke', 'utf-8', 'latin-1'), 'utf-8');
 convert_from 
--------------
 huvudvärke
(1 row)

Best Answer

Related Solutions

Postgresql – How to solve UTF8 invalid byte sequence copy errors on a restore, when the source database is encoded in UTF8

Postgresql – How to select bytea data as binary data, inserted as hex-encoded from binary

Related Question