Postgresql – Postgres stops using a (btree_gin) index when select clause includes a column not part of the index

I am seeing some strange behavior with the btree_gin extension

I have the following table

create table my_table(id INTEGER, a TEXT, b text);

CREATE EXTENSION btree_gin;

CREATE INDEX idx ON my_table USING gin (id, a);

If I run the query :

select t.id, t.a 
from my_table t 
where t.id = :someNumber and t.a like 'someString'

the index gets used.

If I add another column in the select clause that is not part of the index, the index is no longer used, ex:

select t.id, t.a, t.b 
from my_table t 
where t.id = :someNumber and t.a like 'someString'

I'm using PostgreSQL 9.6.3

One thing that is very strange, is that the index does get used if I join the table to anotherone, and add a col from the other table in the select, ex:

select t.id, o.a_col_from_other_table
from my_table t 
join another_table o on t.id = o.the_fk
where t.id = :someNumber and t.a like 'someString'

However, if the join clause uses a column not part of the index, it will will resort to seq scan.

Update 1

For some reason, the query optimizer does not even use the btree_gin anymore. The only way I can get it to use the (btree_gin) index is with

SET enable_seqscan = OFF;. It turns out that even by using this index, it is quite slow, so the optimizer is probably correct in not using it, but the question is then why is it slow ? I've used indexes for like query and they work fine, in this case there is not even a wild card.

the real query is :

 EXPLAIN
    SELECT
      b.transcript_translation_accession
    FROM r_prot_search b
    WHERE
        b.transcript_translation_accession like 'IP_304743.1'
        and b.assembly_id = 12  
        and b.openprot_release_id = 28
    limit 30;

The index:

CREATE INDEX idx_r_prot_search2_transcript_translation_accession3 
ON r_prot_search USING gin (openprot_release_id, transcript_translation_accession);

and the plan :

"QUERY PLAN"
"Limit  (cost=27633.94..593461.27 rows=30 width=13)"
"  ->  Bitmap Heap Scan on r_prot_search b  (cost=27633.94..2366386.90 rows=124 width=13)"
"        Recheck Cond: (openprot_release_id = 28)"
"        Filter: (((transcript_translation_accession)::text ~~ 'IP_304743.1'::text) AND (assembly_id = 12))"
"        ->  Bitmap Index Scan on idx_r_prot_search2_transcript_translation_accession3  (cost=0.00..27633.91 rows=3079721 width=0)"
"              Index Cond: (openprot_release_id = 28)"

I should probably add that the query returns 5 rows, and the table has 10 millions rows.

update 2

Here's an EXPLAIN ANALYZE, with SET enable_seqscan = ON:

"QUERY PLAN"
"Limit  (cost=1000.00..537006.24 rows=30 width=13) (actual time=15164.523..15610.792 rows=6 loops=1)"
"  ->  Gather  (cost=1000.00..2216492.44 rows=124 width=13) (actual time=15164.521..15610.788 rows=6 loops=1)"
"        Workers Planned: 4"
"        Workers Launched: 4"
"        ->  Parallel Seq Scan on r_prot_search b  (cost=0.00..2215480.04 rows=31 width=13) (actual time=15514.285..15603.515 rows=1 loops=5)"
"              Filter: (((transcript_translation_accession)::text ~~ 'IP_304743.1'::text) AND (assembly_id = 12) AND (openprot_release_id = 28))"
"              Rows Removed by Filter: 1948958"
"Planning time: 0.370 ms"
"Execution time: 15628.514 ms"

Now with SET enable_seqscan = OFF;

"QUERY PLAN"
"Limit  (cost=27633.94..593461.27 rows=30 width=13) (actual time=3406.041..8350.549 rows=6 loops=1)"
"  ->  Bitmap Heap Scan on r_prot_search b  (cost=27633.94..2366386.90 rows=124 width=13) (actual time=3406.038..8350.544 rows=6 loops=1)"
"        Recheck Cond: (openprot_release_id = 28)"
"        Filter: (((transcript_translation_accession)::text ~~ 'IP_304743.1'::text) AND (assembly_id = 12))"
"        Rows Removed by Filter: 3228843"
"        Heap Blocks: exact=320353"
"        ->  Bitmap Index Scan on idx_r_prot_search2_transcript_translation_accession3  (cost=0.00..27633.91 rows=3079721 width=0) (actual time=640.817..640.817 rows=3228849 loops=1)"
"              Index Cond: (openprot_release_id = 28)"
"Planning time: 0.394 ms"
"Execution time: 8350.618 ms"

Best Answer

As jjanes points out, I may not be right here if you've only got a GIN index. I'm waiting for the plan before I revise

I don't believe the index "is no longer used" is correct. In the first query, you're selecting ONLY the columns indexed. That means you can potentially use an "index-only scan". In the second query, you're selecting columns not in the index. That means you CAN NOT use an "index-only scan". That means you have to visit the table on disk.

On the basis of that, in the second case, whether or not the index is used is dependent on the selectivity estimates. We would need the EXPLAIN ANALYZE of the two queries to further assist in that. We have two options:

I imagine the index is used, however the rows need to be retrieved from disk and thus it is slower than the first query. This is what is likely happening.
PostgreSQL assess a higher cost to visiting the table AND the index, so it just visits the table and does a seq scan. Usually this only happens on small tables.

Update 1

update 2

Best Answer

Related Solutions

Postgresql – Optimizing ORDER BY in a full text search query

Postgresql – How to index WHERE (start_date >= ‘2013-12-15’)

Related Question