PostgreSQL Performance Optimization – How to Speed Up SELECT DISTINCT

amazon-rdsoptimizationperformancepostgresqlpostgresql-9.3query-performance

I have a simple select distinct on some time series data:

SELECT DISTINCT user_id
FROM events
WHERE project_id = 6
AND time > '2015-01-11 8:00:00'
AND time < '2015-02-10 8:00:00';

And it takes 112 seconds. Here's the query plan:

http://explain.depesz.com/s/NTyA

My application has to preform a lot of distinct operations and counts like this. Is there a faster way to get this kind of data?

Best Answer

You probably don't want to hear this, but the best option to speed up SELECT DISTINCT is to avoid DISTINCT to begin with. In many cases (not all!) it can be avoided with better database-design or better queries.

Sometimes, GROUP BY is faster, because it takes a different code path.

In your particular case, it doesn't seem like you can get rid of DISTINCT (well, see blow). But you can support the query with a special index if you have many queries of that kind:

CREATE INDEX foo ON events (project_id, "time", user_id);

Adding user_id is only useful if you get index-only scans out of this. Follow the link for details. Would remove the expensive ~~Bitmap Heap Scan~~ from your query plan, which consumes 90% of the query time.

Your EXPLAIN shows 2,491 distinct users out of half a million qualifying rows. This won't become super-fast, no matter what you do, but it can be substantially faster. With around 200 rows per user, emulating an index skip scan on above index might pay. The range condition on time complicates matters, and 200 rows per user is still a moderate number. So not sure. See:

Either way, if time intervals in your queries are always the same, a MATERIALIZED VIEW folding user_id per (project_id, <fixed time interval>) would go a long way. No chance there with varying time intervals, though. Maybe you could at least fold users per hour or some other minimum time unit, and that would buy enough performance to warrant the considerable overhead. Can be combined with either query style.

Nitpick:
Most probably, the predicates on "time" should really be:

AND "time" >= '2015-01-11 8:00:00'
AND "time" <  '2015-02-10 8:00:00';

Aside:
Don't use time as identifier. It's a reserved word in standard SQL and a basic type in Postgres.

Related Solutions

Oracle distinct doing sort

Can you add indexes? I would first try adding an index on (col, userid).

10 seconds looks too much for a 30M rows table.

Postgresql – Group By primary key or DISTINCT increase query time over 1000x with limit

First off, LIMIT / OFFSET without ORDER BY are of limited usefulness, since the order is arbitrary and can change any time (when VACUUM runs or when the table is manipulated in at any way). It is only somewhat reliable with read-only tables. That's fine if you don't care which rows you get back, but it may break paging.

You may be able to solve your conundrum with the good old EXISTS. PostgreSQL can stop searching for more hits as soon as the first is found - as opposed to your query with DISTINCT, where it tries to collect all matches.

It's hard to be more specific without knowing the table structure, cardinalities, index definitions and what's behind your peculiar WHERE expression. But this might just do it:

SELECT s.chem_compound_id AS chem1_0_ 
       -- 10 more columns all short varchar or date fields
FROM   simple_compound s
WHERE  EXISTS (
   SELECT 1
   FROM   compound_composition sc
   JOIN   chemical_structure   c  ON sc.chemical_structure_id = c.structure_id 
   WHERE  c.structure_id  @ ('CCNc1ccccc1', '')::bingo.sub
   AND    sc.chem_compound_id = s.chem_compound_id
   )
LIMIT  5 
OFFSET 5

This also assumes you are only interested in columns from simple_compound in the output.

Best Answer

Related Solutions

Oracle distinct doing sort

Postgresql – Group By primary key or DISTINCT increase query time over 1000x with limit

Related Question