Postgresql – Optimising a query to get rows where x is between pairs of values from another table

optimizationperformancepostgresqlpostgresql-9.4query-performance

PostgreSQL 9.4.5

In my database there are 2 tables, vcfentries and exons.

I want to query for rows in vcfentries where pos is between a list of pairs of positions, exonstart and exonend in the table exons.

Result of the query:
select * from exons where exons.genename = 'NM_001301824'

exons table

Total query runtime: 11 ms.
8 rows retrieved.

This is the equivalent of pulling up the results for a single exon without any joins across tables:

Select pos,chrom,alt,ref from vcfentries 
where chrom = '1' 
and pos > 33546679  
and pos < 33547159

Total query runtime: 11 ms.
89 rows retrieved.

Currently these are the indexes I have on the tables

Table indexes
Table indexes

Querying for a specific exon is efficient:

select vcfentries.pos,vcfentries.chrom,vcfentries.alt,vcfentries.ref,exons.exonnumber from exons , vcfentries
where vcfentries.pos BETWEEN  exons.exonstart
and exons.exonend
and exons.genename = 'NM_001301824'
and exonnumber = 6
and vcfentries.chrom = exons.chrom

Total query runtime: 72 ms.
137 rows retrieved.

"Nested Loop  (cost=438468.31..2324377.64 rows=809587 width=24)"
"  ->  Index Only Scan using exonspkey on exons  (cost=0.42..8.44 rows=1 width=26)"
"        Index Cond: ((genename = 'NM_001301824'::text) AND (exonnumber = 8))"
"  ->  Bitmap Heap Scan on vcfentries  (cost=438467.89..2317823.87 rows=654532 width=16)"
"        Recheck Cond: ((pos >= exons.exonstart) AND (pos <= exons.exonend) AND (chrom = exons.chrom))"
"        ->  Bitmap Index Scan on vcfentries_pos_chrom_idx  (cost=0.00..438304.26 rows=654532 width=0)"
"              Index Cond: ((pos >= exons.exonstart) AND (pos <= exons.exonend) AND (chrom = exons.chrom))"

The performance goes down the toilet when querying for them all. Suddenly it jumps into the minutes range:

select * from exons , vcfentries
where vcfentries.pos BETWEEN  exons.exonstart
and exons.exonend
and exons.genename = 'NM_001037501'
and vcfentries.chrom = exons.chrom

Total query runtime: 325389 ms.
2331 rows retrieved.

"Hash Join  (cost=58.73..11528494.14 rows=11334216 width=24)"
"  Output: vcfentries.pos, vcfentries.chrom, vcfentries.alt, vcfentries.ref, exons.exonnumber"
"  Hash Cond: (vcfentries.chrom = exons.chrom)"
"  Join Filter: ((vcfentries.pos >= exons.exonstart) AND (vcfentries.pos <= exons.exonend))"
"  ->  Seq Scan on coeus.vcfentries  (cost=0.00..7170736.76 rows=141378976 width=16)"
"        Output: vcfentries.pos, vcfentries.chrom, vcfentries.alt, vcfentries.ref, vcfentries.analysisid, vcfentries.filter, vcfentries.info_ac, vcfentries.info_af, vcfentries.info_an, vcfentries.info_baseqranksum, vcfentries.info_clippingranksum, vcfentrie (...)"
"  ->  Hash  (cost=58.56..58.56 rows=14 width=26)"
"        Output: exons.exonnumber, exons.exonstart, exons.exonend, exons.chrom"
"        ->  Bitmap Heap Scan on coeus.exons  (cost=4.53..58.56 rows=14 width=26)"
"              Output: exons.exonnumber, exons.exonstart, exons.exonend, exons.chrom"
"              Recheck Cond: (exons.genename = 'NM_001301824'::text)"
"              ->  Bitmap Index Scan on exons_genename_idx  (cost=0.00..4.53 rows=14 width=0)"
"                    Index Cond: (exons.genename = 'NM_001301824'::text)"

The exons table is modest in size with only about 400K rows.

The vcfentries table is large with a few hundred million rows but is acceptably fast to query with indexes.

I'm having trouble optimizing this query. When I run explain on my attempts to use explicit joins I get the same execution plan.

Any ideas why it's creating such a bad execution plan and any suggested fixes or better queries?

Best Answer

I'd suggest using a CTE in this instance. Something that looks like this:

with exons (genename, exonnumber, crom, exonstart, exonend, padding, direction) as
    (select genename, exonnumber, crom, exonstart, exonend, padding, direction
     from exons
     where enename = 'NM_001037501')
select *
from vcfentries vcf
join exons exo
using (chrom)
where vcf.pos between exo.exonstart and exo.exonend;