Postgresql – Really slow DISTINCT ON query with multiple joins

join;postgresql

Originally posted:
https://stackoverflow.com/questions/11173717/expensive-query-on-select-distinct-with-multiple-inner-join-in-postgres

The songs table has only about 4k rows, posts and stations have less. Running the query without the DISTINCT ON fixes it.

Running Postgres on Mac OS X Lion.

Song Load (7358.2ms)

EXPLAIN (426.2ms)

EXPLAIN for: 
SELECT  DISTINCT ON (songs.rank, songs.shared_id) songs.*, 
        songs.*, 
        posts.url as post_url, 
        posts.excerpt as post_excerpt, 
        stations.title as station_title, 
        stations.slug as station_slug 
FROM "songs" 
    INNER JOIN "posts" ON "posts"."id" = "songs"."post_id" 
    inner join stations on stations.blog_id = songs.blog_id 
WHERE "songs"."processed" = 't' 
  AND "songs"."working" = 't' 
ORDER BY songs.rank desc 
LIMIT 24 OFFSET 0

                                           QUERY PLAN
------------------------------------------------------------------------------------------------
 Limit  (cost=546147.28..546159.16 rows=24 width=2525)
   ->  Unique  (cost=546147.28..547360.75 rows=2452 width=2525)
         ->  Sort  (cost=546147.28..546551.77 rows=161796 width=2525)
               Sort Key: songs.rank, songs.shared_id
               ->  Hash Join  (cost=466.50..2906.84 rows=161796 width=2525)
                     Hash Cond: (songs.blog_id = stations.blog_id)
                     ->  Hash Join  (cost=249.41..587.52 rows=2452 width=2499)
                           Hash Cond: (songs.post_id = posts.id)
                           ->  Seq Scan on songs  (cost=0.00..304.39 rows=2452 width=2223)
                                 Filter: (processed AND working)
                           ->  Hash  (cost=230.85..230.85 rows=1485 width=280)
                                 ->  Seq Scan on posts  (cost=0.00..230.85 rows=1485 width=280)
                     ->  Hash  (cost=140.93..140.93 rows=6093 width=30)
                           ->  Seq Scan on stations  (cost=0.00..140.93 rows=6093 width=30)

I tried a few things… first index on (rank, shared_id). Then removed that and added an index on rank and shared_id separately, as well as combinations of the three… no luck.

Are the indexes not being used for some reason? Or do I need to do anything after adding an index to ensure they work?

Best Answer

Because the WHERE condition of the query involves only equality checks:

WHERE "songs"."processed" = 't' 
  AND "songs"."working" = 't'

and then you have:

SELECT  DISTINCT ON (songs.rank, songs.shared_id) ...

which is similar to GROUP BY songs.rank, songs.shared_id

I would first try adding a compound index on (first the columns in WHERE, then the columns in DISTINCT ON):

(processed, working, rank, shared_id)

The ordering: ORDER BY rank DESC may be better optimized if you have the index as:

(processed, working, rank DESC, shared_id)

Not really sure if this would contribute to efficiency but you can test.


Addition by @Erwin

As per request in comment

In principal (default) b-tree indexes can be scanned forward and backward at the same speed. But sorting can make a difference in multi-column indexes where you combine the sort order of multiple columns. The query starts with:

SELECT  DISTINCT ON (songs.rank, songs.shared_id)

In combination with ORDER BY rank DESC this dictates that the result be ordered by rank DESC, shared_id effectively. After the (simplified) WHERE clause WHERE processed AND working has been applied and before LIMIT can be applied.
I have my doubts if the DISTINCT clause is actually useful. But while it is there, the optimal index for the query should be (just as @ypercube suspected):

CREATE INDEX songs_special_idx
ON songs (processed, working, rank DESC, shared_id);

Looks like one of the rare cases where explicit ordering of index columns would benefit the query. There is an excellent explanation in the chapter Indexes and ORDER BY of the manual.

If the WHERE condition is stable (always WHERE processed AND working), a partial multi-column index would be smaller and faster, yet:

CREATE INDEX songs_special_idx
ON songs (rank DESC, shared_id)
WHERE processed AND working;