Originally posted:
https://stackoverflow.com/questions/11173717/expensive-query-on-select-distinct-with-multiple-inner-join-in-postgres
The songs table has only about 4k rows, posts and stations have less. Running the query without the DISTINCT ON
fixes it.
Running Postgres on Mac OS X Lion.
Song Load (7358.2ms)
EXPLAIN (426.2ms)
EXPLAIN for:
SELECT DISTINCT ON (songs.rank, songs.shared_id) songs.*,
songs.*,
posts.url as post_url,
posts.excerpt as post_excerpt,
stations.title as station_title,
stations.slug as station_slug
FROM "songs"
INNER JOIN "posts" ON "posts"."id" = "songs"."post_id"
inner join stations on stations.blog_id = songs.blog_id
WHERE "songs"."processed" = 't'
AND "songs"."working" = 't'
ORDER BY songs.rank desc
LIMIT 24 OFFSET 0
QUERY PLAN
------------------------------------------------------------------------------------------------
Limit (cost=546147.28..546159.16 rows=24 width=2525)
-> Unique (cost=546147.28..547360.75 rows=2452 width=2525)
-> Sort (cost=546147.28..546551.77 rows=161796 width=2525)
Sort Key: songs.rank, songs.shared_id
-> Hash Join (cost=466.50..2906.84 rows=161796 width=2525)
Hash Cond: (songs.blog_id = stations.blog_id)
-> Hash Join (cost=249.41..587.52 rows=2452 width=2499)
Hash Cond: (songs.post_id = posts.id)
-> Seq Scan on songs (cost=0.00..304.39 rows=2452 width=2223)
Filter: (processed AND working)
-> Hash (cost=230.85..230.85 rows=1485 width=280)
-> Seq Scan on posts (cost=0.00..230.85 rows=1485 width=280)
-> Hash (cost=140.93..140.93 rows=6093 width=30)
-> Seq Scan on stations (cost=0.00..140.93 rows=6093 width=30)
I tried a few things… first index on (rank, shared_id)
. Then removed that and added an index on rank
and shared_id
separately, as well as combinations of the three… no luck.
Are the indexes not being used for some reason? Or do I need to do anything after adding an index to ensure they work?
Best Answer
Because the
WHERE
condition of the query involves only equality checks:and then you have:
which is similar to
GROUP BY songs.rank, songs.shared_id
I would first try adding a compound index on (first the columns in
WHERE
, then the columns inDISTINCT ON
):The ordering:
ORDER BY rank DESC
may be better optimized if you have the index as:Not really sure if this would contribute to efficiency but you can test.
Addition by @Erwin
As per request in comment
In principal (default) b-tree indexes can be scanned forward and backward at the same speed. But sorting can make a difference in multi-column indexes where you combine the sort order of multiple columns. The query starts with:
In combination with
ORDER BY rank DESC
this dictates that the result be ordered byrank DESC, shared_id
effectively. After the (simplified) WHERE clauseWHERE processed AND working
has been applied and beforeLIMIT
can be applied.I have my doubts if the
DISTINCT
clause is actually useful. But while it is there, the optimal index for the query should be (just as @ypercube suspected):Looks like one of the rare cases where explicit ordering of index columns would benefit the query. There is an excellent explanation in the chapter Indexes and ORDER BY of the manual.
If the WHERE condition is stable (always
WHERE processed AND working
), a partial multi-column index would be smaller and faster, yet: