Postgresql – Index Only Scan performs better with same input but provided manually in query

performancepostgresqlpostgresql-performance

(Postgres 9.6)

I had a little trouble titling this as I'm still not entirely sure how best to describe the issue. Here goes nothing.

I'm performing quite a basic lookup between two large tables, using the results from the first table to filter from the second table. Here's an example of the query I'm performing:

SELECT
    table1.id AS id
FROM table1, table2
WHERE
    ST_Intersects(ST_Transform(ST_SetSRID(table2.the_geom, 2768), 4326), ST_SetSRID(ST_MakeBox2D(ST_Point(-122.2119140625,38.013476231041935), ST_Point(-122.18994140625,38.03078569382294)), 4326))
    AND table1.id = table2.id

And here is the analysis of the query:

Hash Join  (cost=9.12..60597.37 rows=1 width=4) (actual time=674.453..1890.932 rows=25 loops=1)
  Hash Cond: ((table1.id)::double precision = table2.id)
  Buffers: shared hit=5393
  ->  Index Only Scan using table1_id_key on table1  (cost=0.43..50807.61 rows=1956212 width=4) (actual time=0.049..948.074 rows=1956212 loops=1)
        Heap Fetches: 0
        Buffers: shared hit=5350
  ->  Hash  (cost=8.69..8.69 rows=1 width=8) (actual time=3.475..3.475 rows=25 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 9kB
        Buffers: shared hit=43
        ->  Index Scan using table2_expr_idx_geom on table2  (cost=0.41..8.69 rows=1 width=8) (actual time=0.905..3.444 rows=25 loops=1)
              Index Cond: (st_transform(st_setsrid(the_geom, 2768), 4326) && '0103000020E6100000010000000500000000000000908D5EC09ACCD196B901434000000000908D5EC0BD131EC9F003434000000000288C5EC0BD131EC9F003434000000000288C5EC09ACCD196B901434000000000908D5EC09ACCD196B9014340'::geometry)
              Filter: _st_intersects(st_transform(st_setsrid(the_geom, 2768), 4326), '0103000020E6100000010000000500000000000000908D5EC09ACCD196B901434000000000908D5EC0BD131EC9F003434000000000288C5EC0BD131EC9F003434000000000288C5EC09ACCD196B901434000000000908D5EC09ACCD196B9014340'::geometry)
              Rows Removed by Filter: 3
              Buffers: shared hit=43
Planning time: 0.736 ms
Execution time: 1891.037 ms

So a execution time of around 2 seconds and it returns 25 rows. However, if I run the first query separately:

SELECT id
FROM parcel
WHERE
    ST_Intersects(ST_Transform(ST_SetSRID(the_geom, 2768), 4326), ST_SetSRID(ST_MakeBox2D(ST_Point(-122.2119140625,38.013476231041935), ST_Point(-122.18994140625,38.03078569382294)), 4326))

We can see this isn't the bottleneck:

Index Scan using table2_expr_idx_geom on table2  (cost=0.41..8.69 rows=1 width=8) (actual time=0.861..3.259 rows=25 loops=1)
  Index Cond: (st_transform(st_setsrid(the_geom, 2768), 4326) && '0103000020E6100000010000000500000000000000908D5EC09ACCD196B901434000000000908D5EC0BD131EC9F003434000000000288C5EC0BD131EC9F003434000000000288C5EC09ACCD196B901434000000000908D5EC09ACCD196B9014340'::geometry)
  Filter: _st_intersects(st_transform(st_setsrid(the_geom, 2768), 4326), '0103000020E6100000010000000500000000000000908D5EC09ACCD196B901434000000000908D5EC0BD131EC9F003434000000000288C5EC0BD131EC9F003434000000000288C5EC09ACCD196B901434000000000908D5EC09ACCD196B9014340'::geometry)
  Rows Removed by Filter: 3
  Buffers: shared hit=43
Planning time: 0.494 ms
Execution time: 3.326 ms

And if we take the results of that query (25 rows) and manually input them into a similar query to the original one:

SELECT id FROM table1 WHERE id = ANY (ARRAY[658871,668445,661415,659378,658834,662874,668801,668477,663881,651950,665593,663531,659654,659096,648320,648273,647054,656322,649041,656368,647186,644004,647752,658653,664536])

We end up with a much faster query!

Index Only Scan using table1_id_key on table1  (cost=0.43..111.12 rows=25 width=4) (actual time=0.021..0.114 rows=25 loops=1)
  Index Cond: (id = ANY ('{658871,668445,661415,659378,658834,662874,668801,668477,663881,651950,665593,663531,659654,659096,648320,648273,647054,656322,649041,656368,647186,644004,647752,658653,664536}'::integer[]))
  Heap Fetches: 0
  Buffers: shared hit=76
Planning time: 0.153 ms
Execution time: 0.149 ms

I've tried many permutations of the original query but they are all much slower (around 2 seconds) unless I manually define the 25 values returned from the WHERE condition of the first query (as in the second, quicker example).

From what I can tell the first query is slow because it's going through every single row in the table on the Index Only Scan (all 1,956,212 rows) whereas the second query is only going through 25 rows.

How can I optimise the original query to be as fast as if I performed the 2 queries separately? There's theoretically nothing different about the process. As things stand it's actually quicker for me to perform 2 queries and use the results of the first query as input into the second, which seems inefficient. It's as if the first query is unable to detect that only 25 rows were returned from one table for lookup in the other table.

Any help is appreciated!

Best Answer

My guess is that this is the problem

Hash Cond: ((table1.id)::double precision = table2.id)

Your query is casting the table1.id to something with a fractional portion. This likely throws off planner estimates. Is id an int or a double? Try removing your cast.

Also, don't write SQL-89:

SELECT id
FROM table1
INNER JOIN table2 USING (id)
WHERE ST_Intersects(
  ST_Transform(ST_SetSRID(table2.the_geom, 2768), 4326), 
  ST_SetSRID(
    ST_MakeBox2D(
      ST_Point(-122.2119140625,38.013476231041935), 
      ST_Point(-122.18994140625,38.03078569382294)
    ),
    4326
  )
);