PostgreSQL – CSV (UTF-8) to Database Doesn’t Show äöå Characters

postgresqlpythonutf-8

I have csv-file that is UTF-8 encoded. I'm using python (psycopg2) to copy it into a Postgres database. My Postgres database is using UTF-8. However I have problems to show Ä, Ö and Å characters.

If I'm using query SET client_encoding = 'ISO-8859-1'; It would fix the issue.

My database is looking like this (psql -l):

   Name    |  Owner   | Encoding |   Collate   |    Ctype    |   Access privileges
 data      | postgres | UTF8     | fi_FI.UTF-8 | fi_FI.UTF-8 |

My server locale is fi_FI.utf8

My CSV file is UTF-8. Its made with python command:

with open('users.csv', "w+", encoding="UTF-8", newline='') as f:
     writer = csv.writer(f)
     writer.writerows(list_table)

I'm writing it to Postgres table with this code:

cur = conn.cursor()

cur.execute("CREATE SCHEMA IF NOT EXISTS data;")

cur.execute("DROP TABLE IF EXISTS data.users"`)

cur.execute("CREATE TABLE data.users(name VARCHAR(40), lastname VARCHAR(40), phone VARCHAR(13), email VARCHAR(100), iban VARCHAR(18), id VARCHAR(11))")
copy_ = """
           COPY data.users FROM stdin WITH CSV HEADER`
           DELIMITER as ',' ENCODING 'UTF-8'
           """
with open('users.csv', 'r') as f:
    cur.copy_expert(sql=copy_sql, file=f)
    conn.commit()
    cur.close()

First I used this code

with open('users.csv', 'r') as f:
    next(f)
    cur.copy_from(f, 'data.users', sep=',')
    cur.close()

I switched to Copy_expert to get that ENCODING 'UTF-8' into the string. However it didn't do anything.

Now when I use pgadmin4 or console (Linux ubuntu 18.04LTS terminal) I can't get ÄÖÅ characters to show up. Mökkisuo is "Mökkisuo" and Lammasjärvi is "Lammasjärvi". I'm using basic query like select * from data.users to check results.

However if I change my client encoding to ISO-8859-1 everything shows like it should? (SET client_encoding = 'ISO-8859-1';).

Start of EDIT

Here is my locales.

~$ locale
LANG=fi_FI.utf8
LANGUAGE=
LC_CTYPE="fi_FI.utf8"
LC_NUMERIC="fi_FI.utf8"
LC_TIME="fi_FI.utf8"
LC_COLLATE="fi_FI.utf8"
LC_MONETARY="fi_FI.utf8"
LC_MESSAGES="fi_FI.utf8"
LC_PAPER="fi_FI.utf8"
LC_NAME="fi_FI.utf8"
LC_ADDRESS="fi_FI.utf8"
LC_TELEPHONE="fi_FI.utf8"
LC_MEASUREMENT="fi_FI.utf8"
LC_IDENTIFICATION="fi_FI.utf8"
LC_ALL=

~$ locale charmap
UTF-8

And here is how I get bad result!

:~$ sudo su - postgres
postgres@Server:~$ psql
postgres=# \c data
data=# select * from data.users;

result is like this:

 Piritta      | Mökkisuo        | 0    | piritta.mokkisuo@notreal.com

Oh and data is just generated. It's not real people data.

What I have to change to get this show correctly in UTF-8? I don't understand where that conversion to ISO-8859-1 is coming from?

I don't understand why file that is UTF-8, Database that is UTF-8. Locale that is UTF-8 etc. Has to be changed to Latin-1 (client encoding) to show up correctly? What is causing this? Everything should already be UTF-8? So why ISO-8859-1? Why UTF-8 doesn't show these characters?

Best Answer

I solved my own problem. Everything was right from server side and problem was few little things. My first mistake was that I made two text files that I used in my code to generate users. These two text files was in ANSI encoded. I made them with notepad in windows and didn't think much of it.

Even though Visual Studio Code show these ANSI encoded text-files correctly and writes them inside of UTF-8 csv-file correctly. Even if I open that csv-file with notepad++ it's show's those characters as UTF-8 encoded. However when I pushed them to database they all broke up. Not sure why they show up correctly in UTF-8 file in notepad++? Is there some weird translation correction?

I solved this to convert those ANSI text files to UTF-8 with notepad++.

However my second mistake was that I didn't open those files as UTF-8! So even when I converted them to UTF-8 my Visual Studio tried to read them as latin-1 (ISO-8859-1). This round they where broken before I pushed them to database. I took me while to notice, because end result was still the same as ANSI coded files.

I corrected this with adding encoding="utf-8" to all my open statements. Example here with open('users.csv', 'r',encoding="utf-8") as f: I wanted to be sure that Visual Studio Code will open them as UTF-8 and not some weird format.

This finally solved my issue!! This was really specific problem, so I'm not sure if this will help anyone else. However I wanted to post solution here just in case. I hate those post that just writes up that this is now solved and never tells how.