Postgresql – how to use index to speed up sorting in postgres

indexperformancepostgresqlpostgresql-9.4postgresql-performancesorting

I am using postgres 9.4.

The messages has the following schema:
messages belongs to feed_id, and has posted_at, also messages can have a parent message (in case of replies).

                    Table "public.messages"
            Column            |            Type             | Modifiers
------------------------------+-----------------------------+-----------
 message_id                   | character varying(255)      | not null
 feed_id                      | integer                     |
 parent_id                    | character varying(255)      |
 posted_at                    | timestamp without time zone |
 share_count                  | integer                     |
Indexes:
    "messages_pkey" PRIMARY KEY, btree (message_id)
    "index_messages_on_feed_id_posted_at" btree (feed_id, posted_at DESC NULLS LAST)

I want to return all messages ordered by share_count, but for each parent_id, I only want to return one message. ie, if multiple messages have the same parent_id, then only the latest one (posted_at) is returned. The parent_id can be null, messages with null parent_id should all return.

The query I used is:

WITH filtered_messages AS (SELECT * 
                           FROM messages
                           WHERE feed_id IN (7) 
                           AND (posted_at >= '2015-01-01 04:00:00.000000') 
                           AND (posted_at < '2015-04-28 04:00:00.000000'))
    SELECT *
    FROM (SELECT DISTINCT ON(COALESCE(parent_id, message_id)) parent_id,
                          message_id, 
                          posted_at, 
                          share_count
          FROM filtered_messages
          ORDER BY COALESCE(parent_id, message_id), posted_at DESC NULLS LAST
         ) messages
    ORDER BY share_count DESC NULLS LAST, posted_at DESC NULLS LAST;

Here's the http://sqlfiddle.com/#!15/588e5/1/0, in the SQL Fiddle, I have defined the schema, the exact query, and the expected result.

But the performance of the query is slow once the messages table gets big. I tried add multiple sorting indexes, but it does not seem to use the index.
Here's the explain: http://explain.depesz.com/s/Sv2

How can I create a correct index?

Best Answer

Query

This query should be substantially faster in any case:

SELECT parent_id, message_id, posted_at, share_count
FROM   messages
WHERE  feed_id = 7
AND    posted_at >= '2015-01-01 4:0:0'
AND    posted_at <  '2015-04-28 4:0:0'
AND    parent_id IS NULL  -- match index condition
UNION ALL
(
SELECT DISTINCT ON(parent_id)
       parent_id, message_id, posted_at, share_count
FROM   messages
WHERE  feed_id = 7
AND    posted_at >= '2015-01-01 4:0:0'
AND    posted_at <  '2015-04-28 4:0:0'
AND    parent_id IS NOT NULL  -- match index condition
ORDER  BY parent_id, posted_at DESC NULLS LAST
)
ORDER  BY share_count DESC NULLS LAST, posted_at DESC NULLS LAST;
  • The CTE does nothing here that a plain subquery could not deliver also. And a CTE introduces an optimization barrier since it is executed separately and its result is materialized.

  • You have one more subquery-level than you actually need.

  • The expression (COALESCE(parent_id, message_id) is not compatible with a plain index, you would need an index on that expression. But that may not be very useful either, depending on data distribution. Follow my links below for detailed information.

  • Splitting the simple case of parent_id IS NULL into a separate SELECT may or may not deliver the optimum. Especially not, if that's a rare case anyway, in which case a combined query with an index on (COALESCE(parent_id, message_id) may perform better. Other considerations apply ...

Indices

Especially when supported with these indices:

CREATE INDEX messages_idx_null ON messages (
  feed_id
, posted_at DESC NULLS LAST
, share_count DESC NULLS LAST
, parent_id, message_id
)
WHERE parent_id IS NULL;

CREATE INDEX messages_idx_notnull ON messages (
  feed_id
, posted_at DESC NULLS LAST
, share_count DESC NULLS LAST
, parent_id, message_id
)
WHERE parent_id IS NOT NULL;

The two partial indices cover the whole table together and are about the same size together as a single total index.

The last two columns parent_id, message_id only make sense if you get index-only scans out of it. Else remove them from both indices.

SQL Fiddle.

Depending on missing details, DISTINCT ON may or may not be the best query technique for the purpose. Reade detailed explanation here:

And possibly faster alternatives here: