PostgreSQL – CSV (UTF-8) to Database Doesn’t Show äöå Characters

postgresqlpythonutf-8

I have csv-file that is UTF-8 encoded. I'm using python (psycopg2) to copy it into a Postgres database. My Postgres database is using UTF-8. However I have problems to show Ä, Ö and Å characters.

If I'm using query SET client_encoding = 'ISO-8859-1'; It would fix the issue.

My database is looking like this (psql -l):

   Name    |  Owner   | Encoding |   Collate   |    Ctype    |   Access privileges
 data      | postgres | UTF8     | fi_FI.UTF-8 | fi_FI.UTF-8 |

My server locale is fi_FI.utf8

My CSV file is UTF-8. Its made with python command:

with open('users.csv', "w+", encoding="UTF-8", newline='') as f:
     writer = csv.writer(f)
     writer.writerows(list_table)

I'm writing it to Postgres table with this code:

cur = conn.cursor()

cur.execute("CREATE SCHEMA IF NOT EXISTS data;")

cur.execute("DROP TABLE IF EXISTS data.users"`)

cur.execute("CREATE TABLE data.users(name VARCHAR(40), lastname VARCHAR(40), phone VARCHAR(13), email VARCHAR(100), iban VARCHAR(18), id VARCHAR(11))")
copy_ = """
           COPY data.users FROM stdin WITH CSV HEADER`
           DELIMITER as ',' ENCODING 'UTF-8'
           """
with open('users.csv', 'r') as f:
    cur.copy_expert(sql=copy_sql, file=f)
    conn.commit()
    cur.close()

First I used this code

with open('users.csv', 'r') as f:
    next(f)
    cur.copy_from(f, 'data.users', sep=',')
    cur.close()

I switched to Copy_expert to get that ENCODING 'UTF-8' into the string. However it didn't do anything.

Now when I use pgadmin4 or console (Linux ubuntu 18.04LTS terminal) I can't get ÄÖÅ characters to show up. Mökkisuo is "MÃ¶kkisuo" and Lammasjärvi is "LammasjÃ¤rvi". I'm using basic query like select * from data.users to check results.

However if I change my client encoding to ISO-8859-1 everything shows like it should? (SET client_encoding = 'ISO-8859-1';).

Start of EDIT

Here is my locales.

~$ locale
LANG=fi_FI.utf8
LANGUAGE=
LC_CTYPE="fi_FI.utf8"
LC_NUMERIC="fi_FI.utf8"
LC_TIME="fi_FI.utf8"
LC_COLLATE="fi_FI.utf8"
LC_MONETARY="fi_FI.utf8"
LC_MESSAGES="fi_FI.utf8"
LC_PAPER="fi_FI.utf8"
LC_NAME="fi_FI.utf8"
LC_ADDRESS="fi_FI.utf8"
LC_TELEPHONE="fi_FI.utf8"
LC_MEASUREMENT="fi_FI.utf8"
LC_IDENTIFICATION="fi_FI.utf8"
LC_ALL=

~$ locale charmap
UTF-8

And here is how I get bad result!

:~$ sudo su - postgres
postgres@Server:~$ psql
postgres=# \c data
data=# select * from data.users;

result is like this:

 Piritta      | MÃ¶kkisuo        | 0    | piritta.mokkisuo@notreal.com

Oh and data is just generated. It's not real people data.

What I have to change to get this show correctly in UTF-8? I don't understand where that conversion to ISO-8859-1 is coming from?

I don't understand why file that is UTF-8, Database that is UTF-8. Locale that is UTF-8 etc. Has to be changed to Latin-1 (client encoding) to show up correctly? What is causing this? Everything should already be UTF-8? So why ISO-8859-1? Why UTF-8 doesn't show these characters?

Best Answer

I solved my own problem. Everything was right from server side and problem was few little things. My first mistake was that I made two text files that I used in my code to generate users. These two text files was in ANSI encoded. I made them with notepad in windows and didn't think much of it.

Even though Visual Studio Code show these ANSI encoded text-files correctly and writes them inside of UTF-8 csv-file correctly. Even if I open that csv-file with notepad++ it's show's those characters as UTF-8 encoded. However when I pushed them to database they all broke up. Not sure why they show up correctly in UTF-8 file in notepad++? Is there some weird translation correction?

I solved this to convert those ANSI text files to UTF-8 with notepad++.

However my second mistake was that I didn't open those files as UTF-8! So even when I converted them to UTF-8 my Visual Studio tried to read them as latin-1 (ISO-8859-1). This round they where broken before I pushed them to database. I took me while to notice, because end result was still the same as ANSI coded files.

I corrected this with adding encoding="utf-8" to all my open statements. Example here with open('users.csv', 'r',encoding="utf-8") as f: I wanted to be sure that Visual Studio Code will open them as UTF-8 and not some weird format.

This finally solved my issue!! This was really specific problem, so I'm not sure if this will help anyone else. However I wanted to post solution here just in case. I hate those post that just writes up that this is now solved and never tells how.

First action

When facing corruption, before you do ANYTHING ELSE, take a complete file-system-level copy of the damaged database. See: http://wiki.postgresql.org/wiki/Corruption . Failure to do so destroys evidence about what caused the corruption, and means that if your repair efforts go badly and make things worse you can't undo them. Do not attempt any repair first.

You appear to have made the right choice an done that, assuming you really copied the whole datadir. However, you appear to have then messed with the copy. Before you do anything else, make a copy of that damaged datadir somewhere safe and do not touch it again. This is your hope of recovery. Never work on this copy - duplicate it, and test recovery attempts on the duplicate.

Don't trust that server!

BTW, I strongly recommend that you stop using this server:

We had some disk corruption on our server

is not ok. Until you know why that happened you should not be using that server. Retire it or put it aside and get some trustworthy hardware.

If you cannot do that, make absolutely certain you're doing at least daily logical backups and streaming replication with WAL archiving to a secondary server. Treat the faulty server as if it might vanish or eat your data again at any time.

If the disk corruption corresponded with a power failure, it's probably due to unsafe write-back caching or a system that's ignoring disk flush requests. This is why I do plug-pull testing on server deployments, and don't buy cheap SSDs.

Backups?

This is the point where I tell you that you need to restore from those backups you've been making and testing regularly, preferably the point-in-time recovery or streaming replication setup.

If that was an option you wouldn't be posting here, though.

Ask for help in the right place

Once you have a safe copy of the datadir set aside, post for help on the pgsql-general mailing list.

If the data is important and hard to recover, be prepared to pay for data recovery / repair expertise. See http://www.postgresql.org/support/professional_support/ . (I work for one of the listed companies, just by way of fair disclosure).

Corruption cases tend to be somewhat unique and require lots of back-and-forth, so they're not usually a good fit for Stack Overflow.

Internal structure

As for the internals of the files in base/ ... you really need the system catalogs to interpret them usefully. The table structure is documented in PostgreSQL internals.

The structure of individual relation extents is basically a header, followed by a bunch of columns that're interpreted based on the system catalogs. If you've lost the system catalogs you've got no reliable way to tell what each column's type and name is, etc.

The other problem you have is that you've lost the transaction commit logs (pg_clog) that keep a record of open, commited, and rolled back transactions. With that data lost, you will need to do a dirty read of the tables in order to recover any data, because you no longer know which tuples were added by transactions that later rolled back, which are deleted, which are old versions of updated tuples, etc.

Recovery?

... will be very hard.

In theory you might be able to read the tuples out of the heap table extents. I am not aware of any tools to do this. You would need to be able to construct new system catalogs that matched the on-disk structure of the tables, probably with a stand-alone PostgreSQL backend (postgres --single).

I'd like it if PostgreSQL had better recovery options, but frankly, we prefer to have good backups and use streaming replication etc to avoid the need in the first place. Repairing DB corruption is always iffy and results in untrustworthy, mangled data. So in general - don't do that.

It might help if you still have a copy of your data directory from before you ran pg_resetxlog.

Prevention

I wrote a bit on corruption prevention a while ago. See this post on my old blog.

Best Answer

Related Solutions

MySQL and PostgreSQL – Resolving Invalid Byte Sequence for UTF8 Encoding

Postgresql – Trying to restore a corrupt Postgresql database, with corrupt data folder. Able to start postgres but the tables are empty