Postgresql – Index not being used in PostgreSQL

indexpostgresql

I have three tables, call them table1, table2 and table3. table1 has a foreign key column referencing table2,
and table2 has a column referencing table3. Both foreign key columns have a B-tree index over just that column;
in addition, table1 has an additional composite index over two columns: (table2_fk, another_column). The tables also have
numerous other indexes on their other columns (table1 has five other indexes, each over a single column).

table1 has 67 million entries, table2 has 7400.

The problematic query is

(a) SELECT * 
    FROM table1 
    WHERE table2_fk IN (SELECT id FROM table2 WHERE table3_fk = 3);

This for some reason runs for several minutes. The two component queries,

(b) SELECT id FROM table2 WHERE table3_fk = 3;
(c) SELECT * FROM table1 WHERE table2_fk IN (13, 14);

both complete instantly. The query (b) returns two IDs (13 and 14). The queries (a) and (c) return 16 rows.

Results of query explainer (with table/column/index names replaced with the above-mentioned aliases) look as follows (edit: added EXPLAIN (ANALYZE, BUFFERS))

db=# explain (ANALYZE, BUFFERS) select * FROM table1 WHERE table2_fk IN (13, 14);
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on table1  (cost=1869.86..313991.96 rows=98673 width=193) (actual time=0.033..10.826 rows=18 loops=1)
   Recheck Cond: (table2_fk = ANY ('{13,14}'::integer[]))
   Heap Blocks: exact=6
   Buffers: shared hit=9 read=5
   I/O Timings: read=10.614
   ->  Bitmap Index Scan on table1_table2_fk_idx  (cost=0.00..1845.19 rows=98673 width=0) (actual time=0.016..0.016 rows=18 loops=1)
         Index Cond: (table2_fk = ANY ('{13,14}'::integer[]))
         Buffers: shared hit=8
 Planning time: 49.277 ms
 Execution time: 10.868 ms
(10 rows)


db=# explain (ANALYZE, BUFFERS) select * FROM table1 WHERE table2_fk IN (SELECT id FROM table2 WHERE table3_fk = 3);
                                                                     QUERY PLAN                                                                     
----------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Semi Join  (cost=15.16..2417375.77 rows=2153302 width=193) (actual time=39974.333..127352.240 rows=18 loops=1)
   Hash Cond: (table1.table2_fk = table2.id)
   Buffers: shared hit=74 read=1536993
   I/O Timings: read=99114.790
   ->  Seq Scan on table1  (cost=0.00..2215354.00 rows=67829000 width=193) (actual time=0.005..119121.005 rows=67398453 loops=1)
         Buffers: shared hit=71 read=1536993
         I/O Timings: read=99114.790
   ->  Hash  (cost=15.11..15.11 rows=4 width=4) (actual time=0.009..0.009 rows=2 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 1kB
         Buffers: shared hit=3
         ->  Index Scan using table2_table3_fk_idx on table2  (cost=0.28..15.11 rows=4 width=4) (actual time=0.006..0.007 rows=2 loops=1)
               Index Cond: (table3_fk = 3)
               Buffers: shared hit=3
 Planning time: 0.441 ms
 Execution time: 127352.285 ms
(15 rows)

I.e query (c) uses the index on table2_fk, but query (a) for some reason does not, performing an
inefficient sequential scan instead. What could be causing this?

The same database structure exists on many computers (some of which have even more entries in table1),
but query (a) is slow only on one machine.

I have tried re-indexing and vacuum/analyze with no help. The behavior didn't change when selecting
a single column instead of "select *", or when re-writing (a) using JOIN.

Best Answer

Try adding an "OFFSET 0" in the subquery like this:

SELECT * 
FROM table1 
WHERE table2_fk IN (SELECT id FROM table2 WHERE table3_fk = 3 OFFSET 0);