Also see https://stackoverflow.com/questions/17741167/hibernate-jpa-improve-performance-of-distinct-query but I realized this is mainly a PostgreSQL issue.
My application uses a 3rd party extension to PostgreSQL to for searching chemical structures. This is in general slow. I can not change the SQL directly as the application uses hibernate and native query is not an option.
I have a many-to-many relationship and the "Link-table" has an additional column. Basically I have a Mixture that consists of elements and an element occurs in the mixture at a certain percentage. favorite
I use Spring-Data JPA with QueryDSL, hibernate and PostgreSQL. I have a query with 2 Joins It's a many too many with a link-table that has additional columns. Bascially I have a Mixture that consists of elements and an element occurs in the mixture at a certain percentage.
I'm now searching all Mixtures that contain an element matching the given criteria. Because a mixture can have multiple elements that match the criteria, the query may return the same entity multiple times. I want to prevent that hence DISTINCT or GROUP BY primary key.
The query is also paged meaning it uses limit and offset. The query runs perfectly fine without either distinct or group by but then I can get duplicate rows. If I add either group by or distinct query is over 1000 times slower.
Query with DISTINCT (note SQL from hibernate):
select distinct
simplecomp0_.chem_compound_id as chem1_0_,
--snipped about 10 more columns all short varchar or date fields
from simple_compound simplecomp0_
inner join compound_composition compositio1_
on simplecomp0_.chem_compound_id=compositio1_.chem_compound_id
inner join chemical_structure chemicalst2_
on compositio1_.chemical_structure_id=chemicalst2_.structure_id
where
chemicalst2_.structure_id @ ('CCNc1ccccc1', '')::bingo.sub
limit 5
offset 5
EXPLAIN ANALYZE with DISTINCT:
"Limit (cost=5984.58..5984.63 rows=5 width=1645) (actual time=6342.541..6342.543 rows=5 loops=1)"
" -> HashAggregate (cost=5984.53..5989.79 rows=526 width=1645) (actual time=6342.538..6342.542 rows=10 loops=1)"
" -> Nested Loop (cost=0.00..5971.38 rows=526 width=1645) (actual time=7.289..6166.512 rows=128527 loops=1)"
" -> Nested Loop (cost=0.00..4445.81 rows=526 width=8) (actual time=7.281..5694.663 rows=128527 loops=1)"
" -> Index Scan using idx_chemical_structure on chemical_structure chemicalst2_ (cost=0.00..20.26 rows=526 width=8) (actual time=7.262..5013.620 rows=128508 loops=1)"
" Index Cond: (chemical_structure @ ROW('CCNc1ccccc1'::text, ''::text)::bingo.sub)"
" -> Index Only Scan using compound_composition_pkey on compound_composition compositio1_ (cost=0.00..8.40 rows=1 width=16) (actual time=0.003..0.004 rows=1 loops=128508)"
" Index Cond: (chemical_structure_id = chemicalst2_.structure_id)"
" Heap Fetches: 128527"
" -> Index Scan using idx_pk on simple_compound simplecomp0_ (cost=0.00..2.89 rows=1 width=1645) (actual time=0.002..0.003 rows=1 loops=128527)"
" Index Cond: (chem_compound_id = compositio1_.chem_compound_id)"
"Total runtime: 6344.584 ms"
also http://explain.depesz.com/s/ocC
The long time is caused by searching the 3rd party index for chemical structure search. For some reason the whole indexed is searched.
If the distinct is removed, limit and offset are correctly applied to the 3rd part index and query is fast:
EXPLAIN ANALYZE:
"Limit (cost=56.76..113.52 rows=5 width=1645) (actual time=11.135..11.472 rows=5 loops=1)"
" -> Nested Loop (cost=0.00..5971.38 rows=526 width=1645) (actual time=10.783..11.469 rows=10 loops=1)"
" -> Nested Loop (cost=0.00..4445.81 rows=526 width=8) (actual time=10.774..11.406 rows=10 loops=1)"
" -> Index Scan using idx_chemical_structure on chemical_structure chemicalst2_ (cost=0.00..20.26 rows=526 width=8) (actual time=10.755..11.313 rows=10 loops=1)"
" Index Cond: (chemical_structure @ ROW('CCNc1ccccc1'::text, ''::text)::bingo.sub)"
" -> Index Only Scan using compound_composition_pkey on compound_composition compositio1_ (cost=0.00..8.40 rows=1 width=16) (actual time=0.006..0.006 rows=1 loops=10)"
" Index Cond: (chemical_structure_id = chemicalst2_.structure_id)"
" Heap Fetches: 10"
" -> Index Scan using simple_compound_pkey on simple_compound simplecomp0_ (cost=0.00..2.89 rows=1 width=1645) (actual time=0.004..0.004 rows=1 loops=10)"
" Index Cond: (chem_compound_id = compositio1_.chem_compound_id)"
"Total runtime: 12.052 ms"
Also http://explain.depesz.com/s/gU2
Is there anyway I can tune PostgreSQL to apply the 3rd party index only according to the limit and offset clause when using distinct?
EDIT:
After further thinking about the issue I came to the conclusion that there is no solution that I can implement. With group by or distinct the whole query obviously must be run regardless of limit clause. And if the whole query is run the 3rd party index must be used an that takes time (without that index such a search would take minutes not seconds).
Now about statistics. Here a quote from supplier:
The cost for structure index for operator @ is underestimated and hard coded, for Postres > to use index almost in all cases. Again, as I mentioned, the cost-estimation function is > not implemented yet (we'll let you know when it's ready).
Best Answer
First off,
LIMIT
/OFFSET
withoutORDER BY
are of limited usefulness, since the order is arbitrary and can change any time (whenVACUUM
runs or when the table is manipulated in at any way). It is only somewhat reliable with read-only tables. That's fine if you don't care which rows you get back, but it may break paging.You may be able to solve your conundrum with the good old
EXISTS
. PostgreSQL can stop searching for more hits as soon as the first is found - as opposed to your query withDISTINCT
, where it tries to collect all matches.It's hard to be more specific without knowing the table structure, cardinalities, index definitions and what's behind your peculiar
WHERE
expression. But this might just do it:This also assumes you are only interested in columns from
simple_compound
in the output.