Postgresql – Optimizing a simple query joining two big tables

index-tuningperformancepostgresqlpostgresql-9.6query-performance

I don't understand why the following query is so slow (with Postgresql 9.6):

SELECT s.pkid AS pkid_site , s.geom AS geom_site , z.pkid AS pkidEmprise, z.geom AS geom_emprise, z.précision_contour
FROM traitements.sites_candidats AS s
JOIN traitements.zones_sites AS z
    ON z.pkid_site = s.pkid
WHERE s.statut = 'selected'
AND z.statut = 'selected';

The structure of the tables is the following:

CREATE TABLE traitements.sites_candidats (
    pkid serial PRIMARY KEY,
    statut varchar(255) NOT NULL, 
    geom geometry(Point, 2154)
); 
CREATE INDEX ON traitements.sites_candidats (statut);

CREATE TABLE traitements.zones_sites (
    pkid serial PRIMARY KEY,
    pkid_site integer NOT NULL,
    geom geometry(MultiPolygon,2154),
    statut varchar(100)
);

CREATE INDEX zones_sites_idx_pkid_site  ON traitements.zones_sites (pkid_site);
CREATE INDEX zones_sites_idx_statut ON traitements.zones_sites (statut) ;
ALTER TABLE  traitements.zones_sites
    ADD CONSTRAINT zones_sites_references_sites_candidats FOREIGN KEY (pkid_site) REFERENCES traitements.sites_candidats(pkid) ON DELETE CASCADE;

(I deleted a couple of columns that are not involved in the query in order to improve readability).

The result of EXPLAIN ANALYZE:

Hash Join  (cost=17709.29..152088.29 rows=147368 width=887) (actual time=137.074..879.884 rows=210708 loops=1)
  Hash Cond: (z.pkid_site = s.pkid)
  ->  Seq Scan on zones_sites z  (cost=0.00..85198.14 rows=210717 width=855) (actual time=0.140..384.964 rows=210708 loops=1)
        Filter: ((statut)::text = 'selected'::text)
        Rows Removed by Filter: 23
  ->  Hash  (cost=13433.16..13433.16 rows=210490 width=36) (actual time=136.772..136.772 rows=210708 loops=1)
        Buckets: 65536  Batches: 8  Memory Usage: 2191kB
        ->  Seq Scan on sites_candidats s  (cost=0.00..13433.16 rows=210490 width=36) (actual time=3.085..87.774 rows=210708 loops=1)
              Filter: ((statut)::text = 'selected'::text)
              Rows Removed by Filter: 90265
Planning time: 0.386 ms
Execution time: 888.436 ms

It's obviously the JOIN .. ON that consumes time because it does a hash.

Table traitements.sites_candidats has about 300k rows (210k of them with statut='selected') and traitements.zones_sites about 210k rows (nearly 99% of have statut='selected'). The query takes more than 4 minutes.

Best Answer

Composite indices including statut might help.
Since only rows with statut = 'selected' are of interest, partial indexes might help some more:

-- CREATE INDEX ON traitements.zones_sites (pkid_site)
-- WHERE statut = 'selected';  -- not useful for you!

CREATE INDEX ON traitements.sites_candidats (pkid)
WHERE statut = 'selected';     -- probably also not useful, yet

Smaller indices, less write load and fragmentation.

However, since most rows in zones_sites have statut = 'selected', and since you fetch several columns from that table, and (almost) all rows, Postgres will stick to a sequential scan as fastest option and the only index that can help is the one on sites_candidats - where also ~ 70 % of all rows qualify. So a bitmap index scan using the above index would most likely still be more expensive than another sequential scan on zones_sites.

The remaining option to improve performance is an index-only scan. Consider appending the only other column of interest from sites_candidats to the index:

CREATE INDEX ON traitements.sites_candidats (pkid, geom)
WHERE statut = 'selected';

Only useful if preconditions for index-only scans are met, and it becomes more attractive if the table is big and the appended column is small. Related:

More on index-only scans in my answer to your last question:

Don't expect much, retrieving 210708 rows (according to your EXPLAIN output) will still take some time. If you don't actually need all of them, consider LIMIT (combined with ORDER BY) or additional predicates to fetch fewer rows and make it much faster.

Asides:

I would not use accented letters in identifiers (like: précision_contour). While that works, it introduces potential encoding problems.

Why different data types for statut? varchar(255), varchar(100). Seems like those should be the same. If only a handful of states are possible, make that an enum or a tiny FK column pointing to a statut table - to make everything smaller and faster, yet. And less error-prone.

You had Postgres 9.5 in your last question yesterday, but Postgres 9.6 today. Did you just upgrade? If so, did you run ANALYZE after the upgrade? Table statistics are not carried over. Related: