PostgreSQL Query Optimization – Optimize Query with LIMIT, Predicate, and ORDER BY

indexoptimizationperformancepostgresqlpostgresql-9.3postgresql-performance

I'm using Postgres 9.3.4 and I have 4 queries that have very similar inputs but have vastly different response times:

Query #1

EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (19082, 19075, 20705, 18328, 19110, 24965, 18329, 27600, 17804, 20717, 27598, 27599)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc
LIMIT 100 OFFSET 0;
                                                                                 QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..585.44 rows=100 width=1041) (actual time=326092.852..507360.199 rows=100 loops=1)
   ->  Index Scan using index_posts_on_external_created_at on posts  (cost=0.43..14871916.35 rows=2542166 width=1041) (actual time=326092.301..507359.524 rows=100 loops=1)
         Filter: (source_id = ANY ('{19082,19075,20705,18328,19110,24965,18329,27600,17804,20717,27598,27599}'::integer[]))
         Rows Removed by Filter: 6913925
 Total runtime: 507361.944 ms

Query #2

EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (5202, 5203, 661, 659, 662, 627)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc
LIMIT 100 OFFSET 0;                                            

    QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=31239.64..31239.89 rows=100 width=1041) (actual time=2.004..2.038 rows=100 loops=1)
   ->  Sort  (cost=31239.64..31261.26 rows=8648 width=1041) (actual time=2.003..2.017 rows=100 loops=1)
         Sort Key: external_created_at
         Sort Method: top-N heapsort  Memory: 80kB
         ->  Index Scan using index_posts_on_source_id on posts  (cost=0.44..30909.12 rows=8648 width=1041) (actual time=0.024..1.063 rows=944 loops=1)
               Index Cond: (source_id = ANY ('{5202,5203,661,659,662,627}'::integer[]))
               Filter: (deleted_at IS NULL)
               Rows Removed by Filter: 109
 Total runtime: 2.125 ms

Query #3

EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (14790, 14787, 32928, 14796, 14791, 15503, 14789, 14772, 15506, 14794, 15543, 31615, 15507, 15508, 14800)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc
LIMIT 100 OFFSET 0;
                                                                             QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..821.25 rows=100 width=1041) (actual time=19.224..55.599 rows=100 loops=1)
   ->  Index Scan using index_posts_on_external_created_at on posts  (cost=0.43..14930351.58 rows=1818959 width=1041) (actual time=19.213..55.529 rows=100 loops=1)
         Filter: (source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[]))
         Rows Removed by Filter: 414
 Total runtime: 55.683 ms

Query #4

EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (18766, 18130, 18128, 18129, 19705, 28252, 18264, 18126, 18767, 27603, 28657, 28654, 28655, 19706, 18330)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc
LIMIT 100 OFFSET 0;
                                                                            QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..69055.29 rows=100 width=1041) (actual time=26.094..320.626 rows=100 loops=1)
   ->  Index Scan using index_posts_on_external_created_at on posts  (cost=0.43..14930351.58 rows=21621 width=1041) (actual time=26.093..320.538 rows=100 loops=1)
         Filter: (source_id = ANY ('{18766,18130,18128,18129,19705,28252,18264,18126,18767,27603,28657,28654,28655,19706,18330}'::integer[]))
         Rows Removed by Filter: 6156
 Total runtime: 320.778 ms

All 4 are the same, apart from looking at posts with different source_ids.

Three of the four end up using the following index:

CREATE INDEX index_posts_on_external_created_at ON posts USING btree (external_created_at DESC)
WHERE (deleted_at IS NULL);

And the #2 uses this index:

CREATE INDEX index_posts_on_source_id ON posts USING btree (source_id);

What's interesting to me, is that of the 3 that use the index_posts_on_external_created_at index, two are quite fast, while the other (#1) is insanely slow.

Query #2 has way less posts than the other 3 do, so that might explain why it uses the index_posts_on_source_id index instead. However, if I get rid of the index_posts_on_external_created_at index, the other 3 queries are extremely slow when forced to use the index_posts_on_source_id index.

Here's my definition of the posts table:

CREATE TABLE posts (
    id integer NOT NULL,
    source_id integer,
    message text,
    image text,
    external_id text,
    created_at timestamp without time zone,
    updated_at timestamp without time zone,
    external text,
    like_count integer DEFAULT 0 NOT NULL,
    comment_count integer DEFAULT 0 NOT NULL,
    external_created_at timestamp without time zone,
    deleted_at timestamp without time zone,
    poster_name character varying(255),
    poster_image text,
    poster_url character varying(255),
    poster_id text,
    position integer,
    location character varying(255),
    description text,
    video text,
    rejected_at timestamp without time zone,
    deleted_by character varying(255),
    height integer,
    width integer
);

I've tried using CLUSTER posts USING index_posts_on_external_created_at

Which is essentially an index that orders by external_created_at and this seems to be the only effective method I've found. However, I am unable to use this on production as it causes a global lock for several hours while it runs. I'm on heroku, so I can't install pg_repack or anything like that.

Why would the #1 query be so slow, and others be really quick? What can I do to mitigate this?

EDIT: Here are my queries without the LIMIT and ORDER

Query #1

EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (19082, 19075, 20705, 18328, 19110, 24965, 18329, 27600, 17804, 20717, 27598, 27599)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc;
                                                                        QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=7455044.81..7461163.56 rows=2447503 width=1089) (actual time=94903.143..95110.898 rows=238975 loops=1)
   Sort Key: external_created_at
   Sort Method: external merge  Disk: 81440kB
   ->  Bitmap Heap Scan on posts  (cost=60531.78..1339479.50 rows=2447503 width=1089) (actual time=880.150..90988.460 rows=238975 loops=1)
         Recheck Cond: (source_id = ANY ('{19082,19075,20705,18328,19110,24965,18329,27600,17804,20717,27598,27599}'::integer[]))
         Rows Removed by Index Recheck: 5484857
         Filter: (deleted_at IS NULL)
         Rows Removed by Filter: 3108465
         ->  Bitmap Index Scan on index_posts_on_source_id  (cost=0.00..59919.90 rows=3267549 width=0) (actual time=877.904..877.904 rows=3347440 loops=1)
               Index Cond: (source_id = ANY ('{19082,19075,20705,18328,19110,24965,18329,27600,17804,20717,27598,27599}'::integer[]))
 Total runtime: 95534.724 ms

Query #2

EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (5202, 5203, 661, 659, 662, 627)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc;
                                                                     QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=36913.72..36935.85 rows=8852 width=1089) (actual time=212.450..212.549 rows=944 loops=1)
   Sort Key: external_created_at
   Sort Method: quicksort  Memory: 557kB
   ->  Index Scan using index_posts_on_source_id on posts  (cost=0.44..32094.90 rows=8852 width=1089) (actual time=1.732..209.590 rows=944 loops=1)
         Index Cond: (source_id = ANY ('{5202,5203,661,659,662,627}'::integer[]))
         Filter: (deleted_at IS NULL)
         Rows Removed by Filter: 109
 Total runtime: 214.507 ms

Query #3

EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (14790, 14787, 32928, 14796, 14791, 15503, 14789, 14772, 15506, 14794, 15543, 31615, 15507, 15508, 14800)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc;
                                                                        QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=5245032.87..5249894.14 rows=1944508 width=1089) (actual time=131032.952..134342.372 rows=1674072 loops=1)
   Sort Key: external_created_at
   Sort Method: external merge  Disk: 854864kB
   ->  Bitmap Heap Scan on posts  (cost=48110.86..1320005.55 rows=1944508 width=1089) (actual time=605.648..91351.334 rows=1674072 loops=1)
         Recheck Cond: (source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[]))
         Rows Removed by Index Recheck: 5304550
         Filter: (deleted_at IS NULL)
         Rows Removed by Filter: 879414
         ->  Bitmap Index Scan on index_posts_on_source_id  (cost=0.00..47624.73 rows=2596024 width=0) (actual time=602.744..602.744 rows=2553486 loops=1)
               Index Cond: (source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[]))
 Total runtime: 136176.868 ms

Query #4

EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (18766, 18130, 18128, 18129, 19705, 28252, 18264, 18126, 18767, 27603, 28657, 28654, 28655, 19706, 18330)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc;
                                                                       QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=102648.92..102704.24 rows=22129 width=1089) (actual time=15225.250..15256.931 rows=51408 loops=1)
   Sort Key: external_created_at
   Sort Method: external merge  Disk: 35456kB
   ->  Index Scan using index_posts_on_source_id on posts  (cost=0.45..79869.91 rows=22129 width=1089) (actual time=3.975..14803.320 rows=51408 loops=1)
         Index Cond: (source_id = ANY ('{18766,18130,18128,18129,19705,28252,18264,18126,18767,27603,28657,28654,28655,19706,18330}'::integer[]))
         Filter: (deleted_at IS NULL)
         Rows Removed by Filter: 54
 Total runtime: 15397.453 ms

Postgres memory settings:

name, setting, unit
'default_statistics_target','100',''
'effective_cache_size','16384','8kB'
'maintenance_work_mem','16384','kB'
'max_connections','100',''
'random_page_cost','4',NULL
'seq_page_cost','1',NULL
'shared_buffers','16384','8kB'
'work_mem','1024','kB'

Database stats:

Total Posts: 20,997,027
Posts where deleted_at is null: 15,665,487
Distinct source_id's: 22,245
Max number of rows per single source_id: 1,543,950
Min number of rows per single source_id: 1
Most source_ids in a single query: 21
Distinct external_created_at: 11,146,151

Best Answer

General advice

All the general advice for performance optimization applies. Default settings are very conservative and some of these resource settings are way to low for tables with millions of rows (work_mem in particular). You need to configure your RDBMS to make use of available RAM wisely. The Postgres Wiki is a good starting point. This is beyond the scope of a single question here.

However, the query I suggest below only needs very moderate resource settings.

Also increase the statistics target for source_id to have more detailed statistics on the crucial column:

ALTER TABLE posts ALTER COLUMN source_id SET STATISTICS 2000;  -- or similar

Then: ANALYZE posts;

More:

You could optimize storage some more (for minor gains):

Query

The query itself is hard to optimize. Refer to @ypercube's related question for advanced performance optimization:

There is a simple method if ...

  • the number of distinct source_id per query is reasonably small
  • and the LIMIT is also reasonably small.

... which is true for your case according to your added details.

The only index you need for the query below:

CREATE INDEX posts_special_idx ON posts (source_id, external_created_at DESC)
WHERE deleted_at IS NULL;

Example based on your query #1:

SELECT p.*
FROM   unnest('{19082, 19075, 20705, 18328, 19110, 24965, 18329, 27600
              , 17804, 20717, 27598, 27599}'::int[]) s(source_id)
     , LATERAL (
   SELECT *
   FROM   posts
   WHERE  source_id = s.source_id
   AND    deleted_at IS NULL
   ORDER  BY external_created_at DESC
   LIMIT  100
   ) p
ORDER  BY p.external_created_at DESC
LIMIT  100;

This is emulating a loose index scan, similar to what's discussed here in great detail:

If n is the number of source_id's (and luckily never > 21), we make Postgres fetch the top 100 rows (according to external_created_at DESC) for each source_id from the index, which is very fast in itself, but max. (n-1) * 100 rows are surplus. Given your value frequencies:

22,245 source_id with 1 to 1,543,950 rows - and 20,997,027 rows total

(You didn't clarify whether all of these numbers include "deleted" rows, but only ~ 25% are "deleted".)

... I'd expect some of the source_id's to have less than 100 rows to begin with. So we only have to sort 2100 rows in the worst case (typically much fewer) to keep the top 100. That shouldn't perform so badly - once you have configured Postgres with decent resource settings.

If you have a source table holding all distinct source_id, it might make sense to use it and eliminate non-existent source_id early:

SELECT p.*
FROM   source s, LATERAL ( ... ) p
WHERE  s.source_id IN (19082, 19075, 20705, ...)
ORDER  BY ...

A maximum of 21 IN values is ok for this form, but consider this related question:

You could optimize some more if you know the minimum external_created_at or the maximum number of rows from a single source_id in the result ...