Postgresql – Character Encoding in Postgres

encodingperlpostgresql

I have UTF8 data stored in Postgres. I can query it with a Perl script and display it properly. When I use the Postgres client, it displays like the encoding is off. I'll see characters displayed like \u0087\u0081.

From inside the database:

=> \encoding
UTF8

In the shell (I manually set LANGUAGE as I read elsewhere that might be a fix.):

$locale
LANG=en_US.UTF-8
LANGUAGE=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

This is in Ubuntu 16.04.

What setting have I missed?

Edited to add code populating psql:

#!/usr/bin/perl

use strict;
use warnings;
use DBI;

### Source DB setup
my $maria_database="srcdb";
my $maria_user = "";
my $maria_password = "";
my $maria_host= "localhost";

my $mariadbh = DBI->connect("dbi:mysql:database=$maria_database;host=$maria_host","$maria_user","$maria_password",{AutoCommit=>1,RaiseError=>1,PrintError=>0,mysql_enable_utf8=>1});

my $mysrcquery = "SELECT fields from sourcetable limit 2;";
my $src = $mariadbh->prepare($mysrcquery);

### Destination DB setup
my $postgresql_database="desdb";
my $postgresql_user="";
my $postgresql_password="";
my $postgresql_host="localhost";

my $dbh = DBI->connect("DBI:Pg:dbname=$postgresql_database; host=$postgresql_host", "$postgresql_user", "$postgresql_password", {AutoCommit => 0,pg_enable_utf8 => 1});

my $pginsertquery = "insert into desttable (fields) VALUES (?)";
my $pginsert = $dbh->prepare($pginsertquery);

$src->execute();
my ($col0);
$src->bind_columns(undef, \$col0 );

### Loop through results and insert in to psql
while($src->fetch())
{
  print $col0;
  $pginsert->execute($col0);
}

$dbh->commit;
$dbh-> disconnect or warn "Disconnection failed: DBI::errstr\n";
$mariadbh-> disconnect or warn "Disconnection failed: DBI::errstr\n";

Test sqldump:

--
-- PostgreSQL database dump
--

-- Dumped from database version 9.6.1
-- Dumped by pg_dump version 9.6.1

SET statement_timeout = 0;
SET lock_timeout = 0;
SET idle_in_transaction_session_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SET check_function_bodies = false;
SET client_min_messages = warning;
SET row_security = off;

SET search_path = public, pg_catalog;

SET default_tablespace = '';

SET default_with_oids = false;

--
-- Name: test; Type: TABLE; Schema: public; Owner: testuser
--

CREATE TABLE test (
    testdata text
);


ALTER TABLE test OWNER TO testuser;

--
-- Name: test id; Type: DEFAULT; Schema: public; Owner: testuser
--

ALTER TABLE ONLY test ALTER COLUMN id SET DEFAULT nextval('test_id_seq'::regclass);


--
-- Data for Name: test; Type: TABLE DATA; Schema: public; Owner: testuser
--

COPY test (testdata) FROM stdin;
пÑ<80>ивеÑ<82>Ñ<81>Ñ<82>вÑ<83>Ñ<8e>
\.


--
-- PostgreSQL database dump complete
--

Best Answer

\u0087 (END OF SELECTED AREA) and \u0081 are control characters in the LATIN-1 block. Displaying them as \u0087\u0081 is in fact the convention, and it seems to be what your terminal is doing.

Can you show us the Perl script to insert the data into the database? My assumption is that Perl is inserting non-utf8 data because working with utf8 in perl is a PITA. That PostgreSQL assumes it's utf8 and that the terminal has no idea of how to display it, becuase it's

  • not really utf8, just marked up as utf8.
  • or, you have legitimate control characters encoded in your utf8.

Be careful if you used utf8::upgrade or the like which sets the flag without decoding from the original format.

Advice

Update DBD::Pg, DBD::mysql

First update both DBD::Pg, and DBD::mysql

cpan DBD::Pg DBD::mysql 

Bug in DBD::mysql

Reviewing change logs, DBD::mysql has always had a bug, you may even want to install the development version (DBD-mysql-4.041_01)..

2016-12-12 Patrick Galbraith, Michiel Beijen, DBI/DBD community (4.041_1)
* Unicode fixes: when using mysql_enable_utf8 or mysql_enable_utf8mb4,
  previous versions of DBD::mysql did not properly encode input statements
  to UTF-8 and retrieved columns were always UTF-8 decoded regardless of the
  column charset.
  Fix by Pali Rohár.
  Reported and feedback on fix by Marc Lehmann
  (https://rt.cpan.org/Public/Bug/Display.html?id=87428)
  Also, the UTF-8 flag was not set for decoded data:
  (https://rt.cpan.org/Public/Bug/Display.html?id=53130)

Who knows what the hell they're doing there. Unicode is complex and it's not surprising that MySQL developers and users don't understand it.

Unsetting assumptions

I would unset pg_enable_utf8 and mysql_enable_utf8. Pretend these options exist. You're force-setting a utf8 flag on the internal variable. You probably don't want to do that. You clearly can't ensure everything is utf8, or it'd be working properly. Time to remove assumptions. Not create more assumptions.

We need a sample case. You claim foreign characters are displaying as Unicode escape sequences, but you only present control-characters (which can't be displayed). Perhaps you can simply this whole thing, and dump the table pg_dump -t <Table> or copy the problematic rows into a temp table and dump that. Then we can review that data and see what we've got.