PostgreSQL Performance Optimization – How to Speed Up SELECT DISTINCT

amazon-rdsoptimizationperformancepostgresqlpostgresql-9.3query-performance

I have a simple select distinct on some time series data:

SELECT DISTINCT user_id
FROM events
WHERE project_id = 6
AND time > '2015-01-11 8:00:00'
AND time < '2015-02-10 8:00:00';

And it takes 112 seconds. Here's the query plan:

http://explain.depesz.com/s/NTyA

My application has to preform a lot of distinct operations and counts like this. Is there a faster way to get this kind of data?

Best Answer

You probably don't want to hear this, but the best option to speed up SELECT DISTINCT is to avoid DISTINCT to begin with. In many cases (not all!) it can be avoided with better database-design or better queries.

Sometimes, GROUP BY is faster, because it takes a different code path.

In your particular case, it doesn't seem like you can get rid of DISTINCT (well, see blow). But you can support the query with a special index if you have many queries of that kind:

CREATE INDEX foo ON events (project_id, "time", user_id);

Adding user_id is only useful if you get index-only scans out of this. Follow the link for details. Would remove the expensive Bitmap Heap Scan from your query plan, which consumes 90% of the query time.

Your EXPLAIN shows 2,491 distinct users out of half a million qualifying rows. This won't become super-fast, no matter what you do, but it can be substantially faster. With around 200 rows per user, emulating an index skip scan on above index might pay. The range condition on time complicates matters, and 200 rows per user is still a moderate number. So not sure. See:

Either way, if time intervals in your queries are always the same, a MATERIALIZED VIEW folding user_id per (project_id, <fixed time interval>) would go a long way. No chance there with varying time intervals, though. Maybe you could at least fold users per hour or some other minimum time unit, and that would buy enough performance to warrant the considerable overhead. Can be combined with either query style.

Nitpick:
Most probably, the predicates on "time" should really be:

AND "time" >= '2015-01-11 8:00:00'
AND "time" <  '2015-02-10 8:00:00';

Aside:
Don't use time as identifier. It's a reserved word in standard SQL and a basic type in Postgres.