I have a simple select distinct on some time series data:
SELECT DISTINCT user_id
FROM events
WHERE project_id = 6
AND time > '2015-01-11 8:00:00'
AND time < '2015-02-10 8:00:00';
And it takes 112 seconds. Here's the query plan:
http://explain.depesz.com/s/NTyA
My application has to preform a lot of distinct operations and counts like this. Is there a faster way to get this kind of data?
Best Answer
You probably don't want to hear this, but the best option to speed up
SELECT DISTINCT
is to avoidDISTINCT
to begin with. In many cases (not all!) it can be avoided with better database-design or better queries.Sometimes,
GROUP BY
is faster, because it takes a different code path.In your particular case, it doesn't seem like you can get rid of
DISTINCT
(well, see blow). But you can support the query with a special index if you have many queries of that kind:Adding
user_id
is only useful if you get index-only scans out of this. Follow the link for details. Would remove the expensiveBitmap Heap Scanfrom your query plan, which consumes 90% of the query time.Your
EXPLAIN
shows 2,491 distinct users out of half a million qualifying rows. This won't become super-fast, no matter what you do, but it can be substantially faster. With around 200 rows per user, emulating an index skip scan on above index might pay. The range condition ontime
complicates matters, and 200 rows per user is still a moderate number. So not sure. See:Either way, if time intervals in your queries are always the same, a
MATERIALIZED VIEW
foldinguser_id
per(project_id, <fixed time interval>)
would go a long way. No chance there with varying time intervals, though. Maybe you could at least fold users per hour or some other minimum time unit, and that would buy enough performance to warrant the considerable overhead. Can be combined with either query style.Nitpick:
Most probably, the predicates on
"time"
should really be:Aside:
Don't use
time
as identifier. It's a reserved word in standard SQL and a basic type in Postgres.