PostgreSQL Upsert: Efficiently Upserting 50K Records into Large Tables

cpostgresqlpostgresql-performanceupsert

Using C# I'm trying to upsert from csv file (~ 55000 records) to a postgresql table with 350 columns. I'm just building the upsert statements from header and record which are comma separated. Like this:

INSERT INTO public.table(field1,fields2,field3, ..... ,field350) VALUES(?,?,?,?,?,.......,?) ON CONFLICT(field1) DO UPDATE SET ...;
INSERT INTO public.table(field1,fields2,field3, ..... ,field350) VALUES(?,?,?,?,?,.......,?) ON CONFLICT(field1) DO UPDATE SET ...;
INSERT INTO public.table(field1,fields2,field3, ..... ,field350) VALUES(?,?,?,?,?,.......,?) ON CONFLICT(field1) DO UPDATE SET ...;
INSERT INTO public.table(field1,fields2,field3, ..... ,field350) VALUES(?,?,?,?,?,.......,?) ON CONFLICT(field1) DO UPDATE SET ...;
....etc

I know about the bulk insert Copy operation which gave me for the inserts a very good peformance, but as mentioned, I need to run upserts and I didn't find anyway to run these upserts using this COPY thing.

Moreover, I can't run these in transcations because we want to check line by line and if a line fails, we don't want to fail the whole process, but log exception and continue with next line.

Using a stored procedure also didn't work because of the big number of columns that we have (350) which exceeds the allowed (100).

With my C# script i'm achieving 50K in 2 minutes, I know that this can be optimized. So is there any effecient way that you can suggest it to me?

Best Answer

Some suggestions:

  • Use prepared statements. That will speed up execution, because the statement doesn't have to be planned every time.

  • Use savepoints. That way, you can have a single big transaction and still recover from errors:

    • Set a savepoint every 100 rows or so.

    • If a row fails, rollback to the savepoint and redo everything up to the failing row, then skip that and start a new savepoint.

    If you have many errors, perhaps fewer than 100 rows would be preferable, so that you don't have to redo so much work. But setting a savepoint for every single row will also not be good for performance.

  • COPY the data to a temporary table and then

    INSERT INTO finaltable
    SELECT * FROM temptable
    ON CONFLICT ...;