I've got a table progresses
(contains on the order of hundreds of thousands of records currently):
Column | Type | Modifiers
---------------+-----------------------------+---------------------------------------------------------
id | integer | not null default nextval('progresses_id_seq'::regclass)
lesson_id | integer |
user_id | integer |
created_at | timestamp without time zone |
deleted_at | timestamp without time zone |
Indexes:
"progresses_pkey" PRIMARY KEY, btree (id)
"index_progresses_on_deleted_at" btree (deleted_at)
"index_progresses_on_lesson_id" btree (lesson_id)
"index_progresses_on_user_id" btree (user_id)
and a view v_latest_progresses
which queries for the most recent progress
by user_id
and lesson_id
:
SELECT DISTINCT ON (progresses.user_id, progresses.lesson_id)
progresses.id AS progress_id,
progresses.lesson_id,
progresses.user_id,
progresses.created_at,
progresses.deleted_at
FROM progresses
WHERE progresses.deleted_at IS NULL
ORDER BY progresses.user_id, progresses.lesson_id, progresses.created_at DESC;
A user can have many progresses for any given lesson, but we often want to query for a set of the most recently created progresses for a given set of users or lessons (or a combination of the two).
The view v_latest_progresses
does this nicely and is even performant when I specify a set of user_id
s:
# EXPLAIN SELECT "v_latest_progresses".* FROM "v_latest_progresses" WHERE "v_latest_progresses"."user_id" IN ([the same list of ids given by the subquery in the second example below]);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=526.68..528.66 rows=36 width=57)
-> Sort (cost=526.68..527.34 rows=265 width=57)
Sort Key: progresses.user_id, progresses.lesson_id, progresses.created_at
-> Index Scan using index_progresses_on_user_id on progresses (cost=0.47..516.01 rows=265 width=57)
Index Cond: (user_id = ANY ('{ [the above list of user ids] }'::integer[]))
Filter: (deleted_at IS NULL)
(6 rows)
However if I try to do the same query replacing the set of user_id
s with a subquery, it becomes very inefficient:
# EXPLAIN SELECT "v_latest_progresses".* FROM "v_latest_progresses" WHERE "v_latest_progresses"."user_id" IN (SELECT "users"."id" FROM "users" WHERE "users"."company_id"=44);
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Merge Semi Join (cost=69879.08..72636.12 rows=19984 width=57)
Merge Cond: (progresses.user_id = users.id)
-> Unique (cost=69843.45..72100.80 rows=39969 width=57)
-> Sort (cost=69843.45..70595.90 rows=300980 width=57)
Sort Key: progresses.user_id, progresses.lesson_id, progresses.created_at
-> Seq Scan on progresses (cost=0.00..31136.31 rows=300980 width=57)
Filter: (deleted_at IS NULL)
-> Sort (cost=35.63..35.66 rows=10 width=4)
Sort Key: users.id
-> Index Scan using index_users_on_company_id on users (cost=0.42..35.46 rows=10 width=4)
Index Cond: (company_id = 44)
(11 rows)
What I'm trying to figure out is why PostgreSQL wants to perform the DISTINCT
query on the entire progresses
table before it filters by the subquery in the second example.
Would anyone have any advice as to how to improve this query?
Best Answer
Aaron,
In my recent work, I've been looking into some similar questions with PostgreSQL. PostgreSQL is almost always pretty good at generating the right query plan, but it isn't always perfect.
Some simple suggestions would be to make sure to run an
ANALYZE
on yourprogresses
table to make sure that you have updated statistics, but this isn't guaranteed to fix your problems!For reasons that are probably too long-winded for this post, I've found some odd behaviors in the statistics gathering of
ANALYZE
and the query planner that may need to be resolved in the long-term. In the short-term the trick is to rewrite your query to try and hack out the query plan you want.Without having access to your data for testing, I'll make the following two possible suggestions.
1) Use
ARRAY()
PostgreSQL treats arrays and sets of records differently in its query planner. Sometimes you'll end up with an identical query plan. In this case, as in many of my cases, you don't.
In your original query you had:
As a first pass at trying to fix it, try
Note the change of the subquery from
IN
to=ANY(ARRAY())
.2) Use CTEs
Another trick is to force separate optimizations, if my first suggestion doesn't work. I know many people use this trick, because queries within a CTE are optimized and materialized separate from the main query.
Essentially, by creating the CTE
user_selection
using theWITH
clause, you are asking PostgreSQL to perform a separate optimization on the subqueryand then materializing those results. I then, once again use the
=ANY(ARRAY())
expression to try to manually manipulate the plan.In these cases, you probably can't trust just the results of
EXPLAIN
, because it already thought that it found the least costly solution. Make sure to run anEXPLAIN (ANALYZE,BUFFERS)...
to find out what it really costs in terms of time and page reads.