Postgresql – Why does this PostGIS query take so long to run

performancepostgispostgresqlpostgresql-performance

I am new to PostGIS and SQL in general. I am running a query on polygons which, I hope, only selects distinct geometries and then repair (ST_MakeValid()) those distinct geometries. Thereby removing duplicate geometries in the process of fixing invalid geometries. However, the query takes really long to run (days). This is the SQL statement:

CREATE TABLE schema.table_geomfix AS 
SELECT gid,id, ST_MakeValid(geom) as geom 
FROM schema.table_polygons 
WHERE geom IN (SELECT DISTINCT geom FROM schema.table_polygons);

Running the two select statements separately is quick. About 1 minute for the makevalid and 15 seconds on the distinct statement.

  1. Why does it take so long to run the combined statement?
  2. Any advice for improving the query in order to make it faster?

NOTE: There is an index on the geom column.

Best Answer

Your query is needlessly complicated and can be untangled to this 100 % equivalent one:

SELECT gid, id, ST_MakeValid(geom) AS geom
FROM   schema.table_polygons
WHERE  geom IS NOT NULL;

Neither original nor this one remove any duplicates (except by possibly removing some with geom IS NULL, but I doubt that was your intention).

Possible misunderstanding

There is a common misunderstanding concerning output column names in the WHERE clause - where you can only reference input column names. So, geom in:

WHERE geom IN ...

references the original input column not the output column (the result of ST_MakeValid(geom)).

Related:

If your actual intention was to only keep rows where the result of ST_MakeValid(geom) is equal to any currently existing geom entry:

SELECT gid, id, ST_MakeValid(t.geom) AS geom 
FROM   schema.table_polygons t
WHERE  EXISTS (
   SELECT 1
   FROM   schema.table_polygons
   WHERE  geom = ST_MakeValid(t.geom)
   );

I added a table alias and table qualification to clarify what's what.

The index you mention is only going to help with this variant.

If your actual intention was to merge duplicate rows after fixing geom values:

SELECT DISTINCT ON (ST_MakeValid(t.geom))
       gid, id, ST_MakeValid(geom) AS geom 
FROM   schema.table_polygons t
ORDER  BY ST_MakeValid(t.geom), gid, id;

You did not define how to break ties and which row to keep from a set of dupes. I keep the row with the smallest gid from each set. And the smallest id from remaining dupes. Details:

Aside: Calling a schema "schema" is like naming your son "son" and hoping for the best that no more sons shall arrive. In other words: don't.