Postgresql – Improve sort performance in PostgreSQL

indexperformancepostgresql

I've got a simple blog database in postgres-8.4 which has two tables, articles and comments. I have a query (generated by Django) that wants to get the latest article of type 'NEWS' and also find the number of comments for that article. It does that with the following query:

SELECT "articles"."id", "articles"."datestamp", "articles"."title", "articles"."shorttitle", "articles"."description", "articles"."markdown", "articles"."body", "articles"."idxfti", "articles"."published", "articles"."type", COUNT("comments"."id") AS "comment__count"
FROM "articles"
LEFT OUTER JOIN "comments" ON ("articles"."id" = "comments"."article_id")
WHERE ("articles"."type"='NEWS')
GROUP BY "articles"."id", "articles"."datestamp", "articles"."title", "articles"."shorttitle", "articles"."description", "articles"."markdown", "articles"."body", "articles"."idxfti", "articles"."published", "articles"."type"
ORDER BY "articles"."datestamp" DESC
LIMIT 1;

None of these tables is particularly large, and yet that query takes 46ms. The execution plan is:

Limit  (cost=119.54..119.58 rows=1 width=1150) (actual time=46.479..46.481 rows=1 loops=1)
   ->  GroupAggregate  (cost=119.54..138.88 rows=455 width=1150) (actual time=46.475..46.475 rows=1 loops=1)
     ->  Sort  (cost=119.54..120.68 rows=455 width=1150) (actual time=46.426..46.428 rows=2 loops=1)
           Sort Key: articles.datestamp, articles.id, articles.title, articles.shorttitle, articles.description, articles.markdown, articles.body, articles.idxfti, articles.published, articles.type
           Sort Method:  quicksort  Memory: 876kB
           ->  Hash Left Join  (cost=11.34..99.45 rows=455 width=1150) (actual time=0.513..2.527 rows=566 loops=1)
                 Hash Cond: (articles.id = comments.article_id)
                 ->  Seq Scan on articles  (cost=0.00..78.84 rows=455 width=1146) (actual time=0.017..0.881 rows=455 loops=1)
                       Filter: ((type)::text = 'NEWS'::text)
                 ->  Hash  (cost=8.93..8.93 rows=193 width=8) (actual time=0.486..0.486 rows=193 loops=1)
                       ->  Seq Scan on comments  (cost=0.00..8.93 rows=193 width=8) (actual time=0.004..0.252 rows=193 loops=1)
 Total runtime: 46.574 ms

The articles table has the following index defined (amongst others):

idx_articles_datestamp" btree (datestamp DESC) CLUSTER

Before I clustered it, the query execution was more in line with the estimates, around 119ms.

To my untrained eye, it looks like the sort is what's taking the most amount of time here. It also seems to be trying to sort on all the GROUP BY fields, the issue being that it's trying to sort on three relatively large fields, body, markdown and idx_fti.

My question is this: Is this an unreasonable amount of time for this query to take, or is there something obvious I've missed that I can use to speed this query up? All the other queries requested by this blog site take around 1-5ms to execute, so this one stands out as taking a long time. I appreciate there's an OUTER JOIN and a SORT, which don't really help. However, I'm not an expert, so if anyone has any suggestions, that'd be hugely useful.

Best Answer

Why is it slow?

I would advise to use the query @ypercube provided in combination with the mentioned indexes. But why is the query you had so slow in comparison?

You did not provide your table definition, but I assume from the column names and what you wrote that you have several (large) character type (text or varchar) columns in the table articles:

title, shorttitle, description, markdown, body, idx_fti

I further assume that you are running your database with a locale other than C. Sorting large text columns according to a locale is rather expensive. The relevant thing is the collation. Check your (current) setting for LC_COLLATE:

SHOW LC_COLLATE;

With Postgres 9.1 or later you can pick a collation for evaluating your expressions ad hoc. With PostgreSQL 8.4, however, this is set at cluster creation time and can't be changed later.

We recently had a related question on SO where after much deliberation and testing we found sorting according to a locale to be the major slowdown:

I expect @ypercube's query to solve that problem radically: No GROUP BY for the long text columns eliminates the expensive sort altogether. Problem solved.