Postgresql – How to bulk load data and still provide row-by-row feedback

csvimportpostgresql

We want to allow our users to import CSVs containing many thousands of records to "pre-load" a particular table in our system. It's not strictly a one-time operation, it may be performed multiple times as they get feedback, tweak the CSV, and re-import.

I understand that this is not a lot of data, relatively speaking. But, we want the operation to be atomic (this is a multi-user system other users may also be adding or modifying data) and we want to provide per-row feedback (like "Row 123 is a duplicate of item 'abc'").

I know that statements like COPY or tools like pg_bulkload can do the work very quickly, but this is user-interactive via our web app and we require row-by-row feedback.

Our naive approach was to run sequential INSERT statements and to catch exceptions. Even though we're re-using a single connection, this makes separate calls and uses a separate transaction. This is taking way too long.

for(var i = 0; i < rows.Count; i++)
{
    try
    {
        connection.Execute("INSERT INTO table1 VALUES (...)");
        // format a "successful" response
    }
    catch (PostgresException e)
    {
        // format a "failed" response based on exception details... 
    }
}

We tried creating a single INSERT statement with multiple VALUES, but this does not provide row-by-row feedback and it stops on the 1st error (I found no useful way to use ON CONFLICT here).

We also tried sharing a transaction and running sequential statements but the transaction is "aborted" on the 1st error.

using(var tx = connection.BeginTransaction())
{
    for(var i = 0; i < rows.Count; i++)
    {
        try
        {
            connection.Execute("INSERT INTO table1 VALUES (...)", tx);
            // format a "successful" response
        }
        catch (PostgresException e)
        {
            // format a response based on the type of failure... 
        }
    }
}

I'm beginning to think that we need to run this entire operation within the database in SQL. We have only one constraint right now, a NOT NULL UNIQUE "name" column. So the statuses would be something like…

  • "OK": the row inserted fine (return the new PRIMARY KEY "id" column)
  • "Conflict": the name in this row is a duplicate of an existing row either a row seen earlier in the CSV or already in the table (return the duplicate name)
  • "Bad Request": the name in this row is missing, empty, or all whitespace

We also need to return the index (INSERT order, I guess) in order to approximate the line number in the original CSV file. I'm thinking we need to INSERT into a temporary table and use something like row_number().

Given a CSV like this (name + some example columns, just to make the point)

name,region,code,set
A1,A,100,1
A2,A,101,1
A1,A,102,1
,B,200,2
C1,C,300,3

I would need responses like this

index | id     | name   | status
------|--------|--------|-------
1     | 100    | A1     | OK
2     | 101    | A2     | OK
3     | [null] | A1     | Conflict
4     | [null] | [null] | Bad Request
5     | 102    | C1     | OK

Am I on the right path? Is this a workable solution or is there some other technique that would work better?

Does it make sense to work the queries out by status? Like, rather than try to process one record at a time, can I try to find all of the bad requests 1st, then all of the duplicates, then all of the new rows (and INSERT them)? I'm fine with this response, we can always reorganize them in the application code.

index | id     | name   | status
------|--------|--------|-------
4     | [null] | [null] | Bad Request
3     | [null] | A1     | Conflict
1     | 100    | A1     | OK
2     | 101    | A2     | OK
5     | 102    | C1     | OK

Best Answer

You have no solution. You can speed it up (bulk load into temporary tables, then execute the actual insert from there), but your requirement ot absolute row by row feedback basically throws and set operation out of the window. And if it has to be atomic, it has to be atomic. Time optimization is all you CAN do.

Now, going from a temporary taging table (that can be ordered) with one complex scipt that then inserts the feedback into the staging table - is about the best yo ucan do. What it does is eliminate the latency to the server and that is the best yo ucan hope for.

Standard practice is to avoid any row by row processing, but your requirement just is what it is.