Postgresql – Optimizing an Intersect query between two huge spatial tables

performancepostgispostgresqlquery-performancespatial

I am having a hard time trying to improve an intersect between two spatial tables and I would like to receive any tips about the table designs, queries or dba configs.

Tables:

Table teste.recorte_grade has 1,655,569 rows right now, but this a sub sample made for this test of a 9 million rows table.

CREATE TABLE teste.recorte_grade
(
  id integer NOT NULL DEFAULT nextval('teste."Recorte_grade_id_seq"'::regclass),
  id_gre character varying(21),
  indice_gre character varying(16),
  the_geom geometry(Polygon),
  CONSTRAINT "Recorte_grade_pkey" PRIMARY KEY (id)
)
WITH (
  OIDS=FALSE
);
CREATE INDEX sidx_recorte_grade_geom
  ON teste.recorte_grade
  USING gist
  (the_geom);

Table teste2.uso_2012 has 177,888 rows and this is all data that it will ever have.

CREATE TABLE teste2.uso_2012
(
  id integer NOT NULL,
  gridcode smallint NOT NULL,
  geom geometry(MultiPolygon) NOT NULL,
  CONSTRAINT pk_id_uso_2012 PRIMARY KEY (id)
)
WITH (
  OIDS=FALSE
);    
CREATE INDEX idx_hash_calsse_uso_2012_teste2
  ON teste2.uso_2012
  USING hash
  (gridcode);    
CREATE INDEX sidx_uso_2012_geom_teste2
  ON teste2.uso_2012
  USING gist
  (geom);

Problem:

All I want is the area and the gridcode of each intersection between both tables, basically, the result of this query:

Select grade.id, uso.gridcode, , st_area(st_intersection(grade.the_geom, uso.geom)) 
from teste2.uso_2012 as uso 
    inner join teste.recorte_grade as grade on ST_Intersects(grade.the_geom, uso.geom) = 't' 
    order by grade.id

However this query ran for about 16 hours without any result when I decided to cancel its execution. If it took this long with the sub sample, imagine with the full data set.

Both tables were vacuum analyzed before.

EXPLAIN for slow query: http://explain.depesz.com/s/PEV

I thought it might be a good idea to separate this in multiple queries for one gridcode each time. That's why I created the hash index.

This is the data distribution in the teste2.uso_2012 table:

+----------+---------------+---------------+
| Gridcode | Polygon Count |  Total Area   |
+----------+---------------+---------------+
|        1 |          4100 |   40360812499 |
|        2 |         16992 |  516217687499 |
|        3 |         22745 |  955870062499 |
|        4 |         32243 |  802054562500 |
|        5 |          4286 |   69461437500 |
|        6 |         16081 | 3200491312500 |
|        7 |         40704 |  447186874999 |
|        8 |          1776 |   89474187499 |
|        9 |          1894 |   41834437499 |
|       10 |         15918 | 1765555312500 |
|       11 |          5158 |  306742062499 |
|       12 |         15715 |  274680250000 |
|       14 |           275 |    5606687500 |
+----------+---------------+---------------+

Here are some queries results for individual gridcodes:

Select grade.id, uso.gridcode, st_area(st_intersection(grade.the_geom, uso.geom)) from teste.recorte_2012 as uso inner join teste.recorte_grade as grade on ST_Intersects(grade.the_geom, uso.geom) = 't' where uso.gridcode = 1
    --11 seconds
    --10,069 rows retrieved
    --http://explain.depesz.com/s/tZV1

    Select grade.id, uso.gridcode, st_area(st_intersection(grade.the_geom, uso.geom)) from teste.recorte_2012 as uso inner join teste.recorte_grade as grade on ST_Intersects(grade.the_geom, uso.geom) = 't' where uso.gridcode = 2
    --3275 seconds
    --200,682 rows retrieved

Select grade.id, uso.gridcode, st_area(st_intersection(grade.the_geom, uso.geom)) from teste2.uso_2012 as uso inner join teste.recorte_grade as grade on ST_Intersects(grade.the_geom, uso.geom) = 't' where uso.gridcode = 2
--Total query runtime: 3333 seconds
--200,682 rows retrieved.

    Select grade.id, uso.gridcode, st_area(st_intersection(grade.the_geom, uso.geom)) from teste.recorte_2012 as uso inner join teste.recorte_grade as grade on ST_Intersects(grade.the_geom, uso.geom) = 't' where uso.gridcode = 10
    --5 hours without result

teste.recorte_2012 and teste2.uso_2012 are pretty much the same table where uso_2012 have 1 column less.

As you can see, this doesn't seem very promising. Is there any recommendation to speed this process up?

I'm thinking about creating a stored procedure to loop the 177,888 rows and get directly the intersections and the area of each one of them. Is that a good idea?

Configs:

  • shared_buffers: 1920 MB
  • work_memory: 36 MB
  • effective_cache_size: 5632 MB

Server Info:

  • PostgreSQL 9.2.14
  • CENTOS RELEASE 6.4
  • 8GB SRAM
  • STORAGE V7000
  • INTEL(R) XEON(R) CPU E5-2620 2 GHZ
  • POSTGIS="2.0.2 r10789" GEOS="3.3.6-CAPI-1.7.6" PROJ="Rel. 4.8.0, 6 March 2012" GDAL="GDAL 1.9.2, released 2012/10/08" LIBXML="2.7.6" RASTER

The server is shared among other databases, but no heavy process was running in parallel at the same time I was running the queries.

I have some particular features very complex with almost 100k vertices. About the Postgres version, only the DBAs can update the infrastructures, and I'm not one of them.

Best Answer

Update:

Since this was published, the problem evolve and now we deal with 140kk+ rows. However, Postgis also evolved and it is now possible to "fix" the the feature table. No more need of Dice from ArcMap.

I did it using a ST_VoronoiPolygons approach. I created a working gist with a function that breaks features on the original table in feasible sizes for processing.

Original Answer:

For information, I was able to do this using other tools and splitting the workload:

  • To everything work fine, I had to limit each feature in 800 vertices max and 15 kmĀ² max. I tried some tools and even a recurring procedure on plsql, but without success. The only thing that I tried and split everything correctly was the Dice feature of ArcMap;
  • I divided my 170k+ rows in chunks of 20 rows and ran six instances of the query in parallel to compute the area using a .net console app;
  • With the area stored, I was able to do some processing, also in chunks and parallel, to compute the value of each single 9kk+ cell.

This process now takes "only" 3 hours to finish.


The ST_Intersects is light. The problem was this calculation:

st_area(st_intersection(grade.the_geom, uso.geom)).

Build the intersection of complex features and calculate the area was the demanding task.