PostgreSQL – Inefficient Plan with SELECT DISTINCT ON Subquery

postgresqlpostgresql-9.4

I've got a table progresses (contains on the order of hundreds of thousands of records currently):

    Column     |            Type             |                        Modifiers                        
---------------+-----------------------------+---------------------------------------------------------
 id            | integer                     | not null default nextval('progresses_id_seq'::regclass)
 lesson_id     | integer                     | 
 user_id       | integer                     | 
 created_at    | timestamp without time zone | 
 deleted_at    | timestamp without time zone | 
Indexes:
    "progresses_pkey" PRIMARY KEY, btree (id)
    "index_progresses_on_deleted_at" btree (deleted_at)
    "index_progresses_on_lesson_id" btree (lesson_id)
    "index_progresses_on_user_id" btree (user_id)

and a view v_latest_progresses which queries for the most recent progress by user_id and lesson_id:

SELECT DISTINCT ON (progresses.user_id, progresses.lesson_id)
  progresses.id AS progress_id,
  progresses.lesson_id,
  progresses.user_id,
  progresses.created_at,
  progresses.deleted_at
 FROM progresses
WHERE progresses.deleted_at IS NULL
ORDER BY progresses.user_id, progresses.lesson_id, progresses.created_at DESC;

A user can have many progresses for any given lesson, but we often want to query for a set of the most recently created progresses for a given set of users or lessons (or a combination of the two).

The view v_latest_progresses does this nicely and is even performant when I specify a set of user_ids:

# EXPLAIN SELECT "v_latest_progresses".* FROM "v_latest_progresses" WHERE "v_latest_progresses"."user_id" IN ([the same list of ids given by the subquery in the second example below]);
                                                                               QUERY PLAN                                                                                                                                         
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=526.68..528.66 rows=36 width=57)
   ->  Sort  (cost=526.68..527.34 rows=265 width=57)
         Sort Key: progresses.user_id, progresses.lesson_id, progresses.created_at
         ->  Index Scan using index_progresses_on_user_id on progresses  (cost=0.47..516.01 rows=265 width=57)
               Index Cond: (user_id = ANY ('{ [the above list of user ids] }'::integer[]))
               Filter: (deleted_at IS NULL)
(6 rows)

However if I try to do the same query replacing the set of user_ids with a subquery, it becomes very inefficient:

# EXPLAIN SELECT "v_latest_progresses".* FROM "v_latest_progresses" WHERE "v_latest_progresses"."user_id" IN (SELECT "users"."id" FROM "users" WHERE "users"."company_id"=44);
                                             QUERY PLAN                                              
-----------------------------------------------------------------------------------------------------
 Merge Semi Join  (cost=69879.08..72636.12 rows=19984 width=57)
   Merge Cond: (progresses.user_id = users.id)
   ->  Unique  (cost=69843.45..72100.80 rows=39969 width=57)
         ->  Sort  (cost=69843.45..70595.90 rows=300980 width=57)
               Sort Key: progresses.user_id, progresses.lesson_id, progresses.created_at
               ->  Seq Scan on progresses  (cost=0.00..31136.31 rows=300980 width=57)
                     Filter: (deleted_at IS NULL)
   ->  Sort  (cost=35.63..35.66 rows=10 width=4)
         Sort Key: users.id
         ->  Index Scan using index_users_on_company_id on users  (cost=0.42..35.46 rows=10 width=4)
               Index Cond: (company_id = 44)
(11 rows)

What I'm trying to figure out is why PostgreSQL wants to perform the DISTINCT query on the entire progresses table before it filters by the subquery in the second example.

Would anyone have any advice as to how to improve this query?

Best Answer

Aaron,

In my recent work, I've been looking into some similar questions with PostgreSQL. PostgreSQL is almost always pretty good at generating the right query plan, but it isn't always perfect.

Some simple suggestions would be to make sure to run an ANALYZE on your progresses table to make sure that you have updated statistics, but this isn't guaranteed to fix your problems!

For reasons that are probably too long-winded for this post, I've found some odd behaviors in the statistics gathering of ANALYZE and the query planner that may need to be resolved in the long-term. In the short-term the trick is to rewrite your query to try and hack out the query plan you want.

Without having access to your data for testing, I'll make the following two possible suggestions.

1) Use ARRAY()

PostgreSQL treats arrays and sets of records differently in its query planner. Sometimes you'll end up with an identical query plan. In this case, as in many of my cases, you don't.

In your original query you had:

EXPLAIN SELECT "v_latest_progresses".* FROM "v_latest_progresses" 
WHERE "v_latest_progresses"."user_id" 
IN (SELECT "users"."id" FROM "users" WHERE "users"."company_id"=44);

As a first pass at trying to fix it, try

EXPLAIN SELECT "v_latest_progresses".* FROM "v_latest_progresses" 
WHERE "v_latest_progresses"."user_id" =
ANY(ARRAY(SELECT "users"."id" FROM "users" WHERE "users"."company_id"=44));

Note the change of the subquery from IN to =ANY(ARRAY()).

2) Use CTEs

Another trick is to force separate optimizations, if my first suggestion doesn't work. I know many people use this trick, because queries within a CTE are optimized and materialized separate from the main query.

EXPLAIN 
WITH user_selection AS(
  SELECT "users"."id" FROM "users" WHERE "users"."company_id"=44
)
SELECT "v_latest_progresses".* FROM "v_latest_progresses" 
WHERE "v_latest_progresses"."user_id" =
ANY(ARRAY(SELECT "id" FROM user_selection));

Essentially, by creating the CTE user_selection using the WITH clause, you are asking PostgreSQL to perform a separate optimization on the subquery

SELECT "users"."id" FROM "users" WHERE "users"."company_id"=44

and then materializing those results. I then, once again use the =ANY(ARRAY()) expression to try to manually manipulate the plan.

In these cases, you probably can't trust just the results of EXPLAIN, because it already thought that it found the least costly solution. Make sure to run an EXPLAIN (ANALYZE,BUFFERS)... to find out what it really costs in terms of time and page reads.