Postgresql – Optimizing an Intersect query between two huge spatial tables

performancepostgispostgresqlquery-performancespatial

I am having a hard time trying to improve an intersect between two spatial tables and I would like to receive any tips about the table designs, queries or dba configs.

Tables:

Table teste.recorte_grade has 1,655,569 rows right now, but this a sub sample made for this test of a 9 million rows table.

CREATE TABLE teste.recorte_grade
(
  id integer NOT NULL DEFAULT nextval('teste."Recorte_grade_id_seq"'::regclass),
  id_gre character varying(21),
  indice_gre character varying(16),
  the_geom geometry(Polygon),
  CONSTRAINT "Recorte_grade_pkey" PRIMARY KEY (id)
)
WITH (
  OIDS=FALSE
);
CREATE INDEX sidx_recorte_grade_geom
  ON teste.recorte_grade
  USING gist
  (the_geom);

Table teste2.uso_2012 has 177,888 rows and this is all data that it will ever have.

CREATE TABLE teste2.uso_2012
(
  id integer NOT NULL,
  gridcode smallint NOT NULL,
  geom geometry(MultiPolygon) NOT NULL,
  CONSTRAINT pk_id_uso_2012 PRIMARY KEY (id)
)
WITH (
  OIDS=FALSE
);    
CREATE INDEX idx_hash_calsse_uso_2012_teste2
  ON teste2.uso_2012
  USING hash
  (gridcode);    
CREATE INDEX sidx_uso_2012_geom_teste2
  ON teste2.uso_2012
  USING gist
  (geom);

Problem:

All I want is the area and the gridcode of each intersection between both tables, basically, the result of this query:

Select grade.id, uso.gridcode, , st_area(st_intersection(grade.the_geom, uso.geom)) 
from teste2.uso_2012 as uso 
    inner join teste.recorte_grade as grade on ST_Intersects(grade.the_geom, uso.geom) = 't' 
    order by grade.id

However this query ran for about 16 hours without any result when I decided to cancel its execution. If it took this long with the sub sample, imagine with the full data set.

Both tables were vacuum analyzed before.

EXPLAIN for slow query: http://explain.depesz.com/s/PEV

I thought it might be a good idea to separate this in multiple queries for one gridcode each time. That's why I created the hash index.

This is the data distribution in the teste2.uso_2012 table:

+----------+---------------+---------------+
| Gridcode | Polygon Count |  Total Area   |
+----------+---------------+---------------+
|        1 |          4100 |   40360812499 |
|        2 |         16992 |  516217687499 |
|        3 |         22745 |  955870062499 |
|        4 |         32243 |  802054562500 |
|        5 |          4286 |   69461437500 |
|        6 |         16081 | 3200491312500 |
|        7 |         40704 |  447186874999 |
|        8 |          1776 |   89474187499 |
|        9 |          1894 |   41834437499 |
|       10 |         15918 | 1765555312500 |
|       11 |          5158 |  306742062499 |
|       12 |         15715 |  274680250000 |
|       14 |           275 |    5606687500 |
+----------+---------------+---------------+

Here are some queries results for individual gridcodes:

Select grade.id, uso.gridcode, st_area(st_intersection(grade.the_geom, uso.geom)) from teste.recorte_2012 as uso inner join teste.recorte_grade as grade on ST_Intersects(grade.the_geom, uso.geom) = 't' where uso.gridcode = 1
    --11 seconds
    --10,069 rows retrieved
    --http://explain.depesz.com/s/tZV1

    Select grade.id, uso.gridcode, st_area(st_intersection(grade.the_geom, uso.geom)) from teste.recorte_2012 as uso inner join teste.recorte_grade as grade on ST_Intersects(grade.the_geom, uso.geom) = 't' where uso.gridcode = 2
    --3275 seconds
    --200,682 rows retrieved

Select grade.id, uso.gridcode, st_area(st_intersection(grade.the_geom, uso.geom)) from teste2.uso_2012 as uso inner join teste.recorte_grade as grade on ST_Intersects(grade.the_geom, uso.geom) = 't' where uso.gridcode = 2
--Total query runtime: 3333 seconds
--200,682 rows retrieved.

    Select grade.id, uso.gridcode, st_area(st_intersection(grade.the_geom, uso.geom)) from teste.recorte_2012 as uso inner join teste.recorte_grade as grade on ST_Intersects(grade.the_geom, uso.geom) = 't' where uso.gridcode = 10
    --5 hours without result

teste.recorte_2012 and teste2.uso_2012 are pretty much the same table where uso_2012 have 1 column less.

As you can see, this doesn't seem very promising. Is there any recommendation to speed this process up?

I'm thinking about creating a stored procedure to loop the 177,888 rows and get directly the intersections and the area of each one of them. Is that a good idea?

Configs:

shared_buffers: 1920 MB
work_memory: 36 MB
effective_cache_size: 5632 MB

Server Info:

PostgreSQL 9.2.14
CENTOS RELEASE 6.4
8GB SRAM
STORAGE V7000
INTEL(R) XEON(R) CPU E5-2620 2 GHZ
POSTGIS="2.0.2 r10789" GEOS="3.3.6-CAPI-1.7.6" PROJ="Rel. 4.8.0, 6 March 2012" GDAL="GDAL 1.9.2, released 2012/10/08" LIBXML="2.7.6" RASTER

The server is shared among other databases, but no heavy process was running in parallel at the same time I was running the queries.

I have some particular features very complex with almost 100k vertices. About the Postgres version, only the DBAs can update the infrastructures, and I'm not one of them.

Best Answer

Update:

Since this was published, the problem evolve and now we deal with 140kk+ rows. However, Postgis also evolved and it is now possible to "fix" the the feature table. No more need of Dice from ArcMap.

I did it using a ST_VoronoiPolygons approach. I created a working gist with a function that breaks features on the original table in feasible sizes for processing.

Original Answer:

For information, I was able to do this using other tools and splitting the workload:

To everything work fine, I had to limit each feature in 800 vertices max and 15 km² max. I tried some tools and even a recurring procedure on plsql, but without success. The only thing that I tried and split everything correctly was the Dice feature of ArcMap;
I divided my 170k+ rows in chunks of 20 rows and ran six instances of the query in parallel to compute the area using a .net console app;
With the area stored, I was able to do some processing, also in chunks and parallel, to compute the value of each single 9kk+ cell.

This process now takes "only" 3 hours to finish.

The ST_Intersects is light. The problem was this calculation:

st_area(st_intersection(grade.the_geom, uso.geom)).

Build the intersection of complex features and calculate the area was the demanding task.

Related Solutions

Postgresql – How to do to get the RDS instance running PostgreSQL 9.6 to perform at least as well as the instance running PostgreSQL 9.3

The solution appears to be to run VACUUM FULL ANALYZE;. While I did see significant performance improvements after reindexing the databases, it was probably an unnecessary step. I should have just run VACUUM FULL ANALYZE;

This is the new Explain from instance 2 after running VACUUM FULL ANALYZE:

Aggregate  (cost=19242.48..19242.49 rows=1 width=8)
  ->  Nested Loop  (cost=4183.20..19242.42 rows=1 width=4878)
        ->  Hash Join  (cost=4182.77..19241.42 rows=1 width=4881)
              Hash Cond: (((((b.state)::text || (b.county)::text) || (b.tract)::text) || (b.blkgrp)::text) = (blkgrps_2015.geoid)::text)
              ->  Bitmap Heap Scan on g_2015_5 b  (cost=4174.21..17562.74 rows=222682 width=27)
                    Recheck Cond: ((sumlevel)::text = '150'::text)
                    ->  Bitmap Index Scan on sumlev_2015_idx  (cost=0.00..4118.54 rows=222682 width=0)
                          Index Cond: ((sumlevel)::text = '150'::text)
              ->  Hash  (cost=8.55..8.55 rows=1 width=4883)
                    ->  Index Scan using blkgrps_2015_gist on blkgrps_2015  (cost=0.28..8.55 rows=1 width=4883)
                          Index Cond: (geom && '0103000020E61000000100000031000000386F3E38F5EA55C04BB9C64D80F444408467CE6AF6EA55C06AB0B14A61F44440239A5B50FAEA55C0EAD24EC742F44440FC31CDD700EB55C083C6434925F444401F8783E409EB55C0881CBC5109F444403EEED54E15EB55C01CC1335BEFF34440B9C5C0E422EB55C0F5C65ED7D7F34440FEBFC06A32EB55C068B7372DC3F34440EAA6D79C43EB55C00AE43CB7B1F344409120B62F56EB55C08F69E4C1A3F34440185E05D269EB55C056A34D8A99F34440DA13CB2D7EEB55C0B2C2353D93F3444038A9E1E992EB55C0A01734F690F344406D377EABA7EB55C0D55F41BF92F34440ADB3BD17BCEB55C0E9298C9098F344401F8232D5CFEB55C08A0A9B50A2F3444069AD6B8DE2EB55C0890EBCD4AFF34440F7196FEEF3EB55C0F483BFE1C0F34440DF4320AC03EC55C01DECFA2CD5F34440DA698D8111EC55C09DAD8F5DECF34440D6751D321DEC55C06EF5EF0D06F444401B7A998A26EC55C0DD269BCD21F4444030370D622DEC55C0F2450A233FF44440B1CF7A9A31EC55C061F6C38D5DF444407A915E2133EC55C002FA8E887CF44440AC8B00F031EC55C07D93B98B9BF44440AC83920B2EEC55C0BAD46B0FBAF44440B0BC198527EC55C0FAAEFA8DD7F44440D0E924791EEC55C078953186F3F4444068864F0F13EC55C0E6AA887D0DF5444017AF947905EC55C0FDC53D0225F54440056C74F3F5EB55C04C1747AD39F544408F26EFC0E4EB55C0EEE116244BF544401FBE5B2DD2EB55C0D185281A59F5444022561D8ABEEB55C03D09505263F54440B5863E2DAAEB55C0C75DC69F69F544408810F86F95EB55C09AC3EDE66BF54440D58F29AD80EB55C0CBEACA1D6AF54440F4E8CA3F6CEB55C0F9BC304C64F54440A5495D8158EB55C07D0E9E8B5AF54440189C62C845EB55C0FACDCD064DF54440D32BE26634EB55C0919EFBF83BF54440C20101A924EB55C07415E0AC27F544409524B4D316EB55C03712697B10F54440267792230BEB55C01FD333CAF6F44440A861CBCB01EB55C04B78CF09DBF4444009CF46F5FAEA55C0019CCFB3BDF444403450F2BDF6EA55C0AD77B7489FF44440386F3E38F5EA55C04BB9C64D80F44440'::geometry)
                          Filter: _st_intersects(geom, '0103000020E61000000100000031000000386F3E38F5EA55C04BB9C64D80F444408467CE6AF6EA55C06AB0B14A61F44440239A5B50FAEA55C0EAD24EC742F44440FC31CDD700EB55C083C6434925F444401F8783E409EB55C0881CBC5109F444403EEED54E15EB55C01CC1335BEFF34440B9C5C0E422EB55C0F5C65ED7D7F34440FEBFC06A32EB55C068B7372DC3F34440EAA6D79C43EB55C00AE43CB7B1F344409120B62F56EB55C08F69E4C1A3F34440185E05D269EB55C056A34D8A99F34440DA13CB2D7EEB55C0B2C2353D93F3444038A9E1E992EB55C0A01734F690F344406D377EABA7EB55C0D55F41BF92F34440ADB3BD17BCEB55C0E9298C9098F344401F8232D5CFEB55C08A0A9B50A2F3444069AD6B8DE2EB55C0890EBCD4AFF34440F7196FEEF3EB55C0F483BFE1C0F34440DF4320AC03EC55C01DECFA2CD5F34440DA698D8111EC55C09DAD8F5DECF34440D6751D321DEC55C06EF5EF0D06F444401B7A998A26EC55C0DD269BCD21F4444030370D622DEC55C0F2450A233FF44440B1CF7A9A31EC55C061F6C38D5DF444407A915E2133EC55C002FA8E887CF44440AC8B00F031EC55C07D93B98B9BF44440AC83920B2EEC55C0BAD46B0FBAF44440B0BC198527EC55C0FAAEFA8DD7F44440D0E924791EEC55C078953186F3F4444068864F0F13EC55C0E6AA887D0DF5444017AF947905EC55C0FDC53D0225F54440056C74F3F5EB55C04C1747AD39F544408F26EFC0E4EB55C0EEE116244BF544401FBE5B2DD2EB55C0D185281A59F5444022561D8ABEEB55C03D09505263F54440B5863E2DAAEB55C0C75DC69F69F544408810F86F95EB55C09AC3EDE66BF54440D58F29AD80EB55C0CBEACA1D6AF54440F4E8CA3F6CEB55C0F9BC304C64F54440A5495D8158EB55C07D0E9E8B5AF54440189C62C845EB55C0FACDCD064DF54440D32BE26634EB55C0919EFBF83BF54440C20101A924EB55C07415E0AC27F544409524B4D316EB55C03712697B10F54440267792230BEB55C01FD333CAF6F44440A861CBCB01EB55C04B78CF09DBF4444009CF46F5FAEA55C0019CCFB3BDF444403450F2BDF6EA55C0AD77B7489FF44440386F3E38F5EA55C04BB9C64D80F44440'::geometry)
        ->  Index Scan using e_2015_5_0003000_pkey on e_2015_5_0003000  (cost=0.42..0.99 rows=1 width=19)
              Index Cond: (((stusab)::text = (b.stusab)::text) AND ((logrecno)::text = (b.logrecno)::text))

So, lesson learned, when launching an RDS instance from a Snapshot, run VACUUM FULL ANALYZE; before testing.

Postgresql – How to run multiple ST_DWITHINs/ ST_INTERSECTS within a single WHERE clause

Your FROM clause creates an unwanted cross join between the tables. This is not related to PostGIS and geometric functions.

FROM schema2.tablea a, schema3.tableb b

Indeed as suggested, use two separate UPDATE queries, one for each table. When you compare the EXPLAIN commands, you should see 2 nested loops for the big query instead of one for the alternative solution with two small queries.