Postgresql – Character Encoding in Postgres

encodingperlpostgresql

I have UTF8 data stored in Postgres. I can query it with a Perl script and display it properly. When I use the Postgres client, it displays like the encoding is off. I'll see characters displayed like \u0087\u0081.

From inside the database:

=> \encoding
UTF8

In the shell (I manually set LANGUAGE as I read elsewhere that might be a fix.):

$locale
LANG=en_US.UTF-8
LANGUAGE=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

This is in Ubuntu 16.04.

What setting have I missed?

Edited to add code populating psql:

#!/usr/bin/perl

use strict;
use warnings;
use DBI;

### Source DB setup
my $maria_database="srcdb";
my $maria_user = "";
my $maria_password = "";
my $maria_host= "localhost";

my $mariadbh = DBI->connect("dbi:mysql:database=$maria_database;host=$maria_host","$maria_user","$maria_password",{AutoCommit=>1,RaiseError=>1,PrintError=>0,mysql_enable_utf8=>1});

my $mysrcquery = "SELECT fields from sourcetable limit 2;";
my $src = $mariadbh->prepare($mysrcquery);

### Destination DB setup
my $postgresql_database="desdb";
my $postgresql_user="";
my $postgresql_password="";
my $postgresql_host="localhost";

my $dbh = DBI->connect("DBI:Pg:dbname=$postgresql_database; host=$postgresql_host", "$postgresql_user", "$postgresql_password", {AutoCommit => 0,pg_enable_utf8 => 1});

my $pginsertquery = "insert into desttable (fields) VALUES (?)";
my $pginsert = $dbh->prepare($pginsertquery);

$src->execute();
my ($col0);
$src->bind_columns(undef, \$col0 );

### Loop through results and insert in to psql
while($src->fetch())
{
  print $col0;
  $pginsert->execute($col0);
}

$dbh->commit;
$dbh-> disconnect or warn "Disconnection failed: DBI::errstr\n";
$mariadbh-> disconnect or warn "Disconnection failed: DBI::errstr\n";

Test sqldump:

--
-- PostgreSQL database dump
--

-- Dumped from database version 9.6.1
-- Dumped by pg_dump version 9.6.1

SET statement_timeout = 0;
SET lock_timeout = 0;
SET idle_in_transaction_session_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SET check_function_bodies = false;
SET client_min_messages = warning;
SET row_security = off;

SET search_path = public, pg_catalog;

SET default_tablespace = '';

SET default_with_oids = false;

--
-- Name: test; Type: TABLE; Schema: public; Owner: testuser
--

CREATE TABLE test (
    testdata text
);


ALTER TABLE test OWNER TO testuser;

--
-- Name: test id; Type: DEFAULT; Schema: public; Owner: testuser
--

ALTER TABLE ONLY test ALTER COLUMN id SET DEFAULT nextval('test_id_seq'::regclass);


--
-- Data for Name: test; Type: TABLE DATA; Schema: public; Owner: testuser
--

COPY test (testdata) FROM stdin;
Ð¿Ñ<80>Ð¸Ð²ÐµÑ<82>Ñ<81>Ñ<82>Ð²Ñ<83>Ñ<8e>
\.


--
-- PostgreSQL database dump complete
--

Best Answer

\u0087 (END OF SELECTED AREA) and \u0081 are control characters in the LATIN-1 block. Displaying them as \u0087\u0081 is in fact the convention, and it seems to be what your terminal is doing.

Can you show us the Perl script to insert the data into the database? My assumption is that Perl is inserting non-utf8 data because working with utf8 in perl is a PITA. That PostgreSQL assumes it's utf8 and that the terminal has no idea of how to display it, becuase it's

not really utf8, just marked up as utf8.
or, you have legitimate control characters encoded in your utf8.

Be careful if you used utf8::upgrade or the like which sets the flag without decoding from the original format.

Advice

Update DBD::Pg, DBD::mysql

First update both DBD::Pg, and DBD::mysql

cpan DBD::Pg DBD::mysql

Bug in DBD::mysql

Reviewing change logs, DBD::mysql has always had a bug, you may even want to install the development version (DBD-mysql-4.041_01)..

2016-12-12 Patrick Galbraith, Michiel Beijen, DBI/DBD community (4.041_1)
* Unicode fixes: when using mysql_enable_utf8 or mysql_enable_utf8mb4,
  previous versions of DBD::mysql did not properly encode input statements
  to UTF-8 and retrieved columns were always UTF-8 decoded regardless of the
  column charset.
  Fix by Pali Rohár.
  Reported and feedback on fix by Marc Lehmann
  (https://rt.cpan.org/Public/Bug/Display.html?id=87428)
  Also, the UTF-8 flag was not set for decoded data:
  (https://rt.cpan.org/Public/Bug/Display.html?id=53130)

Who knows what the hell they're doing there. Unicode is complex and it's not surprising that MySQL developers and users don't understand it.

Unsetting assumptions

I would unset pg_enable_utf8 and mysql_enable_utf8. Pretend these options exist. You're force-setting a utf8 flag on the internal variable. You probably don't want to do that. You clearly can't ensure everything is utf8, or it'd be working properly. Time to remove assumptions. Not create more assumptions.

We need a sample case. You claim foreign characters are displaying as Unicode escape sequences, but you only present control-characters (which can't be displayed). Perhaps you can simply this whole thing, and dump the table pg_dump -t <Table> or copy the problematic rows into a temp table and dump that. Then we can review that data and see what we've got.

Related Solutions

Postgresql – Confused over encoding/locale in postgresql

The encoding defines the very basic rules how characters are represented in binary format (like @a_horse explains in his comment). It should be mentioned that the server encoding has to match the client encoding for successful communication. Postgres can translate if necessary, there is a dedicated setting client_encoding for this.

The locale is a superset of settings, which can be split up for PostgreSQL into

LC_COLLATE
LC_CTYPE
LC_MESSAGES
LC_MONETARY
LC_NUMERIC
LC_TIME

The settings of particular interest for you are LC_COLLATE (defines how strings are sorted) and LC_TYPE (defines the type of characters).
In older versions, these two settings could not be changed after a database had been initialized. Since Postgres 9.1 you can at least override the collation setting when needed.

PostgreSQL – How to Fix Encoding Conversion Error

know that I should migrate my database to utf8 to solve this problem, but for some reasons, I can not do that for the moment.

In my case, I'd rather PostgreSQL saves my string removing characters it can not convert or for example replacing them with some symbol like "?" rather than throwing an error...

PostgreSQL does not support this. It's requested periodically, but nobody who requests it does the work to actually implement it in the system and convince the dev team it's an appropriate option to offer.

You will need to do your text-mangling client-side. In PHP, before you send the text to PostgreSQL, you will need to filter out characters that doesn't match the database encoding. How to do that is entirely PHP-specific (start with iconv support, probably). You have described one way to do this, using utf8_decode, already.

Using utf8_decode is actually incorrect, because the function (per the docs) actually assumes the input is ISO-8859-1, i.e. Latin-1. You're using latin-9, i.e. ISO-8859-15. So it'll mangle some of your input characters, in particular the Euro sign. See changes from ISO-8859-1. Instead, use the iconv function. See the surprisingly useful comments on the utf8_decode function documentation.

If in the process of filtering the text you convert it to LATIN9 inside PHP, remember that you must set your client_encoding to latin9, since that's the encoding of the text you'll be sending to PostgreSQL. That means the results will be in latin-9 too, so you must convert all results from PostgreSQL from latin-9 back to PHP's native utf-8.

If you use utf8_encode to convert your latin-9 output from PostgreSQL for consumption in PHP, you'll have the same problem with latin-1 vs latin-9 as you do on utf8_decode.

For that reason, if possible, try to use a filter that replaces characters not supported in latin-9 without actually converting the string to latin-9. It'll save you a bunch of hassle if you can keep client_encoding set to utf-8 and just mangle your strings instead of converting them.

All this said, I strongly recommend upgrading the database to utf-8 instead. The only reason to keep it in latin-9 would be if you have other client applications that can't cope with chars outside the latin-9 range (i.e. they rely on a latin-9 client_encoding)