PostgreSQL CSV Import – How to Split CSV Bulk Import into Smaller Blocks Using Psycopg2

I have a number of csv-files that I would like to import into my postgres-database. Since each file is quite large (200m-2bn records/file) I believe it would be better to add each file in smaller batches.

FOr comparison, when using MS-SQL and using bcp, files get automatically split into 1000 record batches. That is exactly what I (think I) want to do with COPY.

My main script is Python 3.5.2, the database is PostgreSQL 9.5.10 on Ubuntu 16.04 LTS. I am using psycopg2's copy functionality (on the simple account of it being the first that worked). I'm happy to change that, as long as I can call it from python (since I need to do some other processing per file — organizing files, searching, determining the target stage table, unzip, cleanup, etc. — since I am not sure how exactly the data transfer works using this method. Per psycopg2's doc the sql-version of copy is used (https://www.postgresql.org/docs/current/static/sql-copy.html), not the command line version. However I suspect the entire file will be buffered through stdin, but again, I'm not sure.

con = psycopg2.connect(database=dbname, user=user, password=password, host=host, port=port)
cur = con.cursor()
f = open(pathTempFile)
cur.copy_expert("""copy public.stage_datafile from STDIN DELIMITER ',' CSV HEADER""",
    file=f)
f.close()
con.commit()
con.close()

This process works fine, and I'm not even sure there are any drawbacks. However, it seems there might be a better/more robust way? I will use a trigger to export the data from the staging tables into the normalized main tables, surely that would benefit from smaller data blocks?

I appreciate any suggestions on making this process more robust.

Addendum:

I only need to import the data once, not regularly.
I could also use sed to generate some temporary files and import those, in fact I'm doing that while debugging

PostgreSQL CSV Import – How to Split CSV Bulk Import into Smaller Blocks Using Psycopg2

Best Answer

Related Question

Best Answer

Related Solutions

Postgresql – Fill factor for later bigserial + index

PostgreSQL – How to Prevent Sequence Advancement on COPY Failures

Related Question