I have UTF8 data stored in Postgres. I can query it with a Perl script and display it properly. When I use the Postgres client, it displays like the encoding is off. I'll see characters displayed like \u0087\u0081.
From inside the database:
=> \encoding
UTF8
In the shell (I manually set LANGUAGE as I read elsewhere that might be a fix.):
$locale
LANG=en_US.UTF-8
LANGUAGE=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
This is in Ubuntu 16.04.
What setting have I missed?
Edited to add code populating psql:
#!/usr/bin/perl
use strict;
use warnings;
use DBI;
### Source DB setup
my $maria_database="srcdb";
my $maria_user = "";
my $maria_password = "";
my $maria_host= "localhost";
my $mariadbh = DBI->connect("dbi:mysql:database=$maria_database;host=$maria_host","$maria_user","$maria_password",{AutoCommit=>1,RaiseError=>1,PrintError=>0,mysql_enable_utf8=>1});
my $mysrcquery = "SELECT fields from sourcetable limit 2;";
my $src = $mariadbh->prepare($mysrcquery);
### Destination DB setup
my $postgresql_database="desdb";
my $postgresql_user="";
my $postgresql_password="";
my $postgresql_host="localhost";
my $dbh = DBI->connect("DBI:Pg:dbname=$postgresql_database; host=$postgresql_host", "$postgresql_user", "$postgresql_password", {AutoCommit => 0,pg_enable_utf8 => 1});
my $pginsertquery = "insert into desttable (fields) VALUES (?)";
my $pginsert = $dbh->prepare($pginsertquery);
$src->execute();
my ($col0);
$src->bind_columns(undef, \$col0 );
### Loop through results and insert in to psql
while($src->fetch())
{
print $col0;
$pginsert->execute($col0);
}
$dbh->commit;
$dbh-> disconnect or warn "Disconnection failed: DBI::errstr\n";
$mariadbh-> disconnect or warn "Disconnection failed: DBI::errstr\n";
Test sqldump:
--
-- PostgreSQL database dump
--
-- Dumped from database version 9.6.1
-- Dumped by pg_dump version 9.6.1
SET statement_timeout = 0;
SET lock_timeout = 0;
SET idle_in_transaction_session_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SET check_function_bodies = false;
SET client_min_messages = warning;
SET row_security = off;
SET search_path = public, pg_catalog;
SET default_tablespace = '';
SET default_with_oids = false;
--
-- Name: test; Type: TABLE; Schema: public; Owner: testuser
--
CREATE TABLE test (
testdata text
);
ALTER TABLE test OWNER TO testuser;
--
-- Name: test id; Type: DEFAULT; Schema: public; Owner: testuser
--
ALTER TABLE ONLY test ALTER COLUMN id SET DEFAULT nextval('test_id_seq'::regclass);
--
-- Data for Name: test; Type: TABLE DATA; Schema: public; Owner: testuser
--
COPY test (testdata) FROM stdin;
пÑ<80>ивеÑ<82>Ñ<81>Ñ<82>вÑ<83>Ñ<8e>
\.
--
-- PostgreSQL database dump complete
--
Best Answer
\u0087
(END OF SELECTED AREA) and\u0081
are control characters in the LATIN-1 block. Displaying them as\u0087\u0081
is in fact the convention, and it seems to be what your terminal is doing.Can you show us the Perl script to insert the data into the database? My assumption is that Perl is inserting non-utf8 data because working with utf8 in perl is a PITA. That PostgreSQL assumes it's utf8 and that the terminal has no idea of how to display it, becuase it's
Be careful if you used
utf8::upgrade
or the like which sets the flag without decoding from the original format.Advice
Update DBD::Pg, DBD::mysql
First update both DBD::Pg, and DBD::mysql
Bug in DBD::mysql
Reviewing change logs, DBD::mysql has always had a bug, you may even want to install the development version (DBD-mysql-4.041_01)..
Who knows what the hell they're doing there. Unicode is complex and it's not surprising that MySQL developers and users don't understand it.
Unsetting assumptions
I would unset
pg_enable_utf8
andmysql_enable_utf8
. Pretend these options exist. You're force-setting a utf8 flag on the internal variable. You probably don't want to do that. You clearly can't ensure everything is utf8, or it'd be working properly. Time to remove assumptions. Not create more assumptions.We need a sample case. You claim foreign characters are displaying as Unicode escape sequences, but you only present control-characters (which can't be displayed). Perhaps you can simply this whole thing, and dump the table
pg_dump -t <Table>
or copy the problematic rows into a temp table and dump that. Then we can review that data and see what we've got.