PostgreSQL CSV Import – How to Split CSV Bulk Import into Smaller Blocks Using Psycopg2

copypostgresqlpython

I have a number of csv-files that I would like to import into my postgres-database. Since each file is quite large (200m-2bn records/file) I believe it would be better to add each file in smaller batches.

FOr comparison, when using MS-SQL and using bcp, files get automatically split into 1000 record batches. That is exactly what I (think I) want to do with COPY.

My main script is Python 3.5.2, the database is PostgreSQL 9.5.10 on Ubuntu 16.04 LTS. I am using psycopg2's copy functionality (on the simple account of it being the first that worked). I'm happy to change that, as long as I can call it from python (since I need to do some other processing per file — organizing files, searching, determining the target stage table, unzip, cleanup, etc. — since I am not sure how exactly the data transfer works using this method. Per psycopg2's doc the sql-version of copy is used (https://www.postgresql.org/docs/current/static/sql-copy.html), not the command line version. However I suspect the entire file will be buffered through stdin, but again, I'm not sure.

con = psycopg2.connect(database=dbname, user=user, password=password, host=host, port=port)
cur = con.cursor()
f = open(pathTempFile)
cur.copy_expert("""copy public.stage_datafile from STDIN DELIMITER ',' CSV HEADER""",
    file=f)
f.close()
con.commit()
con.close()

This process works fine, and I'm not even sure there are any drawbacks. However, it seems there might be a better/more robust way? I will use a trigger to export the data from the staging tables into the normalized main tables, surely that would benefit from smaller data blocks?

I appreciate any suggestions on making this process more robust.


Addendum:

  • I only need to import the data once, not regularly.
  • I could also use sed to generate some temporary files and import those, in fact I'm doing that while debugging

Best Answer

I have a number of csv-files that I would like to import into my postgres-database. Since each file is quite large (200m-2bn records/file) I believe it would be better to add each file in smaller batches.

Probably not. It'll work just fine. You're probably better off running the CREATE TABLE and COPY in the same transaction though, that way you generate minimal WAL.