I am new to PostGIS and SQL in general. I am running a query on polygons which, I hope, only selects distinct geometries and then repair (ST_MakeValid()
) those distinct geometries. Thereby removing duplicate geometries in the process of fixing invalid geometries. However, the query takes really long to run (days). This is the SQL statement:
CREATE TABLE schema.table_geomfix AS
SELECT gid,id, ST_MakeValid(geom) as geom
FROM schema.table_polygons
WHERE geom IN (SELECT DISTINCT geom FROM schema.table_polygons);
Running the two select statements separately is quick. About 1 minute for the makevalid
and 15 seconds on the distinct
statement.
- Why does it take so long to run the combined statement?
- Any advice for improving the query in order to make it faster?
NOTE: There is an index on the geom
column.
Best Answer
Your query is needlessly complicated and can be untangled to this 100 % equivalent one:
Neither original nor this one remove any duplicates (except by possibly removing some with
geom IS NULL
, but I doubt that was your intention).Possible misunderstanding
There is a common misunderstanding concerning output column names in the
WHERE
clause - where you can only reference input column names. So,geom
in:references the original input column not the output column (the result of
ST_MakeValid(geom)
).Related:
If your actual intention was to only keep rows where the result of
ST_MakeValid(geom)
is equal to any currently existinggeom
entry:I added a table alias and table qualification to clarify what's what.
The index you mention is only going to help with this variant.
If your actual intention was to merge duplicate rows after fixing
geom
values:You did not define how to break ties and which row to keep from a set of dupes. I keep the row with the smallest
gid
from each set. And the smallestid
from remaining dupes. Details:Aside: Calling a schema "schema" is like naming your son "son" and hoping for the best that no more sons shall arrive. In other words: don't.