Greenplum : Inserting non-duplicate rows on unique constraint violation

greenplum

I have a stored procedure that does the following :

INSERT INTO schema.my_unique_values
     SELECT DISTINCT id, value
       FROM schema.a_huge_table
      WHERE NOT EXISTS (SELECT 1
                          FROM schema.my_unique_values)

In summary, this query will insert a unique rows into the my_unique_values table from the a_huge_table.

The problem I have is that when I run this query concurrently in a multi-threaded script for a reason, it can insert duplicate rows. I have tried using serializable to avoid this 'phantom reads', but still no luck. My idea is to allow Postgres to only insert non-duplicate rows on unique constraint violation. But is this possible ?
My current experience is that when a unique constraint violation is hit, it will cancel the whole transaction, so it will not insert the non-duplicate rows. How can I achieve my goal ?

Note: I am using Greenplum 4.3.11 that is using Postgres 8.2, therefore there are limitation on the query that I can use.

Thanks..

Best Answer

First of all, I think you need to change slightly your query, because the WHERE NOT EXISTS, as it is now written, would be false as soon as your table has some row. You need to specify a WHERE clause:

INSERT INTO schema.my_unique_values
     SELECT DISTINCT id, value
       FROM schema.a_huge_table a
      WHERE NOT EXISTS 
             (SELECT 1
                FROM schema.my_unique_values m
               WHERE m.id = a.id AND m.value = a.value)

INSERT, either within a transaction, or as a single-statement-transaction, will be atomic. So, if one row fails, everything will fail. In any case, you cannot have phantom reads in SERIALIZABLE transaction isolation levels. If you're under that isolation level, make sure your NOT EXISTS (...) really has the properly written WHERE clause.

As of PostgreSQL version 9.5, there is a new clause that can be part of an INSERT and that looks exactly for your use-case. You could modify your querty to make use of ON CONFLICT DO NOTHING:

INSERT INTO schema.my_unique_values
     SELECT DISTINCT id, value
       FROM schema.a_huge_table a
ON CONFLICT DO NOTHING ;

NOTE: The ON CONFLICT clause applies to each row, not to the INSERT as a whole.

Related Solutions

Get total size for all Postgres or Greenplum tables including partitions

Here is what i have, its also a little bit roundabout and not testet in PG8.2 (not supported anymore ;) ) but it shows your schema.tablename,PrettySize (first Select) and partitioned Tables are grouped to one (ParentTablename) (second Select). its maybe not the best solution (Tuning welcome)

--get all tables sizes and group partitioned tables by PartenTablename
SELECT nspname || '.' || relname AS "relation",
       pg_size_pretty(pg_total_relation_size(C.oid)) AS "total_size"
FROM pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
LEFT OUTER JOIN pg_inherits I ON (C.relfilenode = I.inhrelid OR C.relfilenode = I.inhparent)
WHERE nspname NOT IN ('pg_catalog', 'information_schema')
    AND C.relkind <> 'i'
    AND nspname !~ '^pg_toast'
    AND I.inhrelid is null
UNION ALL
select nspname || '.' || relname AS "relation", 
       pg_size_pretty(sum(pg_relation_size(inhrelid))) AS "total_size" 
FROM pg_inherits  I 
LEFT JOIN pg_class C ON (C.oid = I.inhparent)
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)  
GROUP BY  nspname,relname
ORDER BY 1,2

How to delete all rows in a Greenplum row oriented table

Since Greenplum uses the Postgres database engine, reasoning applicable to a Postgres database seems appropriate. I don't see any reason not to use TRUNCATE when deleting all rows in a table with no children or foreign keys (which is your case), unless you have the ON DELETE triggers that you want to fire.