Postgresql – Group By primary key or DISTINCT increase query time over 1000x with limit

group byperformancepostgresqlpostgresql-9.2

Also see https://stackoverflow.com/questions/17741167/hibernate-jpa-improve-performance-of-distinct-query but I realized this is mainly a PostgreSQL issue.

My application uses a 3rd party extension to PostgreSQL to for searching chemical structures. This is in general slow. I can not change the SQL directly as the application uses hibernate and native query is not an option.

I have a many-to-many relationship and the "Link-table" has an additional column. Basically I have a Mixture that consists of elements and an element occurs in the mixture at a certain percentage. favorite

I use Spring-Data JPA with QueryDSL, hibernate and PostgreSQL. I have a query with 2 Joins It's a many too many with a link-table that has additional columns. Bascially I have a Mixture that consists of elements and an element occurs in the mixture at a certain percentage.

I'm now searching all Mixtures that contain an element matching the given criteria. Because a mixture can have multiple elements that match the criteria, the query may return the same entity multiple times. I want to prevent that hence DISTINCT or GROUP BY primary key.

The query is also paged meaning it uses limit and offset. The query runs perfectly fine without either distinct or group by but then I can get duplicate rows. If I add either group by or distinct query is over 1000 times slower.

Query with DISTINCT (note SQL from hibernate):

select distinct
    simplecomp0_.chem_compound_id as chem1_0_, 
    --snipped about 10 more columns all short varchar or date fields
    from simple_compound simplecomp0_ 
    inner join compound_composition compositio1_ 
              on simplecomp0_.chem_compound_id=compositio1_.chem_compound_id 
    inner join chemical_structure chemicalst2_ 
              on compositio1_.chemical_structure_id=chemicalst2_.structure_id 
where 
    chemicalst2_.structure_id  @ ('CCNc1ccccc1', '')::bingo.sub
limit 5 
offset 5

EXPLAIN ANALYZE with DISTINCT:

"Limit  (cost=5984.58..5984.63 rows=5 width=1645) (actual time=6342.541..6342.543 rows=5 loops=1)"
"  ->  HashAggregate  (cost=5984.53..5989.79 rows=526 width=1645) (actual time=6342.538..6342.542 rows=10 loops=1)"
"        ->  Nested Loop  (cost=0.00..5971.38 rows=526 width=1645) (actual time=7.289..6166.512 rows=128527 loops=1)"
"              ->  Nested Loop  (cost=0.00..4445.81 rows=526 width=8) (actual time=7.281..5694.663 rows=128527 loops=1)"
"                    ->  Index Scan using idx_chemical_structure on chemical_structure chemicalst2_  (cost=0.00..20.26 rows=526 width=8) (actual time=7.262..5013.620 rows=128508 loops=1)"
"                          Index Cond: (chemical_structure @ ROW('CCNc1ccccc1'::text, ''::text)::bingo.sub)"
"                    ->  Index Only Scan using compound_composition_pkey on compound_composition compositio1_  (cost=0.00..8.40 rows=1 width=16) (actual time=0.003..0.004 rows=1 loops=128508)"
"                          Index Cond: (chemical_structure_id = chemicalst2_.structure_id)"
"                          Heap Fetches: 128527"
"              ->  Index Scan using idx_pk on simple_compound simplecomp0_  (cost=0.00..2.89 rows=1 width=1645) (actual time=0.002..0.003 rows=1 loops=128527)"
"                    Index Cond: (chem_compound_id = compositio1_.chem_compound_id)"
"Total runtime: 6344.584 ms"

also http://explain.depesz.com/s/ocC

The long time is caused by searching the 3rd party index for chemical structure search. For some reason the whole indexed is searched.

If the distinct is removed, limit and offset are correctly applied to the 3rd part index and query is fast:

EXPLAIN ANALYZE:

"Limit  (cost=56.76..113.52 rows=5 width=1645) (actual time=11.135..11.472 rows=5 loops=1)"
"  ->  Nested Loop  (cost=0.00..5971.38 rows=526 width=1645) (actual time=10.783..11.469 rows=10 loops=1)"
"        ->  Nested Loop  (cost=0.00..4445.81 rows=526 width=8) (actual time=10.774..11.406 rows=10 loops=1)"
"              ->  Index Scan using idx_chemical_structure on chemical_structure chemicalst2_  (cost=0.00..20.26 rows=526 width=8) (actual time=10.755..11.313 rows=10 loops=1)"
"                    Index Cond: (chemical_structure @ ROW('CCNc1ccccc1'::text, ''::text)::bingo.sub)"
"              ->  Index Only Scan using compound_composition_pkey on compound_composition compositio1_  (cost=0.00..8.40 rows=1 width=16) (actual time=0.006..0.006 rows=1 loops=10)"
"                    Index Cond: (chemical_structure_id = chemicalst2_.structure_id)"
"                    Heap Fetches: 10"
"        ->  Index Scan using simple_compound_pkey on simple_compound simplecomp0_  (cost=0.00..2.89 rows=1 width=1645) (actual time=0.004..0.004 rows=1 loops=10)"
"              Index Cond: (chem_compound_id = compositio1_.chem_compound_id)"
"Total runtime: 12.052 ms"

Also http://explain.depesz.com/s/gU2

Is there anyway I can tune PostgreSQL to apply the 3rd party index only according to the limit and offset clause when using distinct?

EDIT:

After further thinking about the issue I came to the conclusion that there is no solution that I can implement. With group by or distinct the whole query obviously must be run regardless of limit clause. And if the whole query is run the 3rd party index must be used an that takes time (without that index such a search would take minutes not seconds).

Now about statistics. Here a quote from supplier:

The cost for structure index for operator @ is underestimated and hard coded, for Postres > to use index almost in all cases. Again, as I mentioned, the cost-estimation function is > not implemented yet (we'll let you know when it's ready).

Best Answer

First off, LIMIT / OFFSET without ORDER BY are of limited usefulness, since the order is arbitrary and can change any time (when VACUUM runs or when the table is manipulated in at any way). It is only somewhat reliable with read-only tables. That's fine if you don't care which rows you get back, but it may break paging.

You may be able to solve your conundrum with the good old EXISTS. PostgreSQL can stop searching for more hits as soon as the first is found - as opposed to your query with DISTINCT, where it tries to collect all matches.

It's hard to be more specific without knowing the table structure, cardinalities, index definitions and what's behind your peculiar WHERE expression. But this might just do it:

SELECT s.chem_compound_id AS chem1_0_ 
       -- 10 more columns all short varchar or date fields
FROM   simple_compound s
WHERE  EXISTS (
   SELECT 1
   FROM   compound_composition sc
   JOIN   chemical_structure   c  ON sc.chemical_structure_id = c.structure_id 
   WHERE  c.structure_id  @ ('CCNc1ccccc1', '')::bingo.sub
   AND    sc.chem_compound_id = s.chem_compound_id
   )
LIMIT  5 
OFFSET 5

This also assumes you are only interested in columns from simple_compound in the output.