Postgresql – Optimising query: not sure why LOWER() index isn’t used

explainindex-tuningpostgresqlpostgresql-performance

I've got comments that have 1..N comment_notifications.

Each notification basically contains a username of the user notified.

Each notification can be resolved_at when the user is done with it.

This query is slower than I'd like – seeing times around 100–150 ms in production (hobby DB plan on Heroku) but would like to get it down to at least 50 ms or so.

EXPLAIN ANALYZE
  SELECT c0."id", c0."resolved_at", c0."username", c0."comment_id", c0."inserted_at", c0."updated_at",
  c1."id", c1."github_id", c1."commit_sha", c1."body", c1."commented_at", c1."commenter_username", c1."path", c1."position", c1."payload", c1."inserted_at", c1."updated_at"
  FROM "comment_notifications" AS c0
  INNER JOIN "comments" AS c1 ON c1."id" = c0."comment_id"
  WHERE 
    (NOT (c0."resolved_at" IS NULL)) AND
    (LOWER(c1."commenter_username") = 'henrik')
  ORDER BY c0."resolved_at" DESC
  LIMIT 300

These are some of my indexes:

CREATE INDEX comment_notifications_resolved_at_index ON public.comment_notifications USING btree (resolved_at);

CREATE INDEX comment_notifications_lower_username_index ON public.comment_notifications USING btree (lower((username)::text));

This is the result of the EXPLAIN ANALYZE:

                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=57.86..57.86 rows=5 width=298) (actual time=1.931..1.969 rows=300 loops=1)
   ->  Sort  (cost=57.86..57.86 rows=5 width=298) (actual time=1.930..1.949 rows=300 loops=1)
         Sort Key: c0.resolved_at DESC
         Sort Method: quicksort  Memory: 238kB
         ->  Hash Join  (cost=43.64..57.85 rows=5 width=298) (actual time=1.101..1.520 rows=459 loops=1)
               Hash Cond: (c0.comment_id = c1.id)
               ->  Seq Scan on comment_notifications c0  (cost=0.00..13.75 rows=866 width=48) (actual time=0.014..0.177 rows=871 loops=1)
                     Filter: (resolved_at IS NOT NULL)
                     Rows Removed by Filter: 46
               ->  Hash  (cost=43.62..43.62 rows=6 width=250) (actual time=1.051..1.051 rows=473 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 140kB
                     ->  Seq Scan on comments c1  (cost=0.00..43.62 rows=6 width=250) (actual time=0.058..0.805 rows=473 loops=1)
                           Filter: (lower((commenter_username)::text) = 'henrik'::text)
                           Rows Removed by Filter: 387
 Planning Time: 1.221 ms
 Execution Time: 2.056 ms
(16 rows)

My reading is that the username filter is the most expensive part by far. But I have an index on LOWER(username) as seen above. Any thoughts on why it's not being used?

Any suggestions of what I could improve, or further experiments I could make?

Best Answer

Two different indexes for your query:

As you found out yourself, you had the column comment_notifications.username indexed, which is irrelevant for the case. You need this one instead:

CREATE INDEX c_lower_commenter_username_idx
ON public.comments (lower(commenter_username));

More importantly, the plan also shows a sequential scan on comment_notifications - the far bigger table. Another index on (comment_id) seems to be missing. Or, tailored to the query, this partial multicolumn index:

CREATE INDEX cn_comment_id_resolved_at_idx
ON public.comment_notifications(comment_id DESC, resolved_at DESC)
WHERE resolved_at IS NOT NULL;

Run ANALYZE on the tables after creating these (and possibly deleting the ones you had).

Adding WHERE resolved_at IS NOT NULL only makes sense if a sizable proportion of rows with NULL are excluded.

If you can change ORDER BY, the second index can be used much more efficiently, yet:

SELECT cn.id, cn.resolved_at, cn.username, cn.comment_id, cn.inserted_at, cn.updated_at
     , c.id, c.github_id, c.commit_sha, c.body, c.commented_at, c.commenter_username, c.path, c.position, c.payload, c.inserted_at, c.updated_at
FROM   comments              c
JOIN   comment_notifications cn ON cn.comment_id = c.id
WHERE  lower(c.commenter_username) = 'henrik'
AND    cn.resolved_at IS NOT NULL
ORDER  BY cn.comment_id DESC, cn.resolved_at DESC
LIMIT  300  -- or 150 or whatever

If you only added AND cn.resolved_at IS NOT NULL to get rid of the (few) rows with NULL being listed first, there is a more elegant solution with NULLS LAST.

Query:

SELECT ...
...
WHERE  lower(c.commenter_username) = 'henrik'
ORDER  BY cn.comment_id DESC, cn.resolved_at DESC NULLS LAST
LIMIT  300  -- or 150 or whatever

Matching index:

CREATE INDEX cn_comment_id_resolved_at_idx
ON public.comment_notifications(comment_id DESC, resolved_at DESC NULLS LAST);

See:

Sort by column ASC, but NULL values first?

Aside 1: Why the cast to text? What's the actual data type of username and commenter_username? (Should be text to begin with.) And why the need for lower()? User names should not be saved with varying upper/lower case spellings to begin with.

Aside 2: The explain plan shows low cardinalities. Indexes pay for big tables. Not so much for small tables, where queries are fast either way. 2 ms doesn't seem too bad. 150 ms seems to be a misunderstanding, including network latency or display times.

Best Answer

Related Solutions

Postgresql – How to speed up a Postgres query containing lots of Joins with an ILIKE condition

PostgreSQL Optimization – Explain Plan with Giant Gaps Between Operations

Related Question