Postgresql – Query planner thinks doing a sequential scan on a very large table is faster than using an index

performancepostgresqlquery-performance

I've got a query that selects the most favorited photo for each of the last 12 editions for a particular domain (this is just a segment of a larger query, but the problem seems to stem from here):

WITH
    selected_editions AS (
        SELECT
            editions.*
        FROM
            editions JOIN
            domains USING (community_id)
        WHERE
            domains.name = 'www.example.com'
            AND editions.name <= now()
        ORDER BY
            editions.name DESC
            LIMIT 12
        )
SELECT DISTINCT ON (edition_id)
    photos.*,
    edition_id,
    community_id,
    event_id
FROM
    selected_editions
    JOIN edition_event_maybes ON edition_event_maybes.edition_id = selected_editions.id
    JOIN event_photos USING (event_id)
    JOIN photos ON photos.id = event_photos.photo_id
ORDER BY
    edition_id,
    photos.favorites DESC

Right now, all photos have 0 favorites (I don't have the trigger in place yet to update this column), but the query is just as slow if I drop the sort on that column. For some reason, the query planner has determined that doing a sequential scan on photos (which has over 1.5M records) is faster than using the indexes I have in place:

 Unique  (cost=372256.50..372935.22 rows=7365 width=334) (actual time=154536.566..154608.510 rows=10 loops=1)
   CTE selected_editions
     ->  Limit  (cost=0.00..54.22 rows=12 width=36) (actual time=4.925..53.899 rows=12 loops=1)
           ->  Nested Loop  (cost=0.00..284.65 rows=63 width=36) (actual time=4.902..53.600 rows=12 loops=1)
                 Join Filter: (public.editions.community_id = domains.community_id)
                 Rows Removed by Join Filter: 746
                 ->  Index Scan using editions_name_desc_idx on editions  (cost=0.00..212.99 rows=4226 width=36) (actual time=0.081..9.468 rows=758 loops=1)
                       Index Cond: (name <= now())
                 ->  Materialize  (cost=0.00..8.27 rows=1 width=4) (actual time=0.012..0.023 rows=1 loops=758)
                       ->  Index Scan using domains_pkey on domains  (cost=0.00..8.27 rows=1 width=4) (actual time=0.108..0.144 rows=1 loops=1)
                             Index Cond: ((name)::text = 'www.example.com'::text)
   ->  Sort  (cost=372202.28..372541.64 rows=135744 width=334) (actual time=154536.538..154572.782 rows=2894 loops=1)
         Sort Key: public.editions.id, photos.favorites
         Sort Method: external merge  Disk: 936kB
         ->  Hash Join  (cost=93971.72..318867.75 rows=135744 width=334) (actual time=128046.409..154460.268 rows=2894 loops=1)
               Hash Cond: (photos.id = event_photos.photo_id)
               ->  Seq Scan on photos  (cost=0.00..83395.70 rows=1555170 width=322) (actual time=29.002..38235.655 rows=1555170 loops=1)
               ->  Hash  (cost=91611.92..91611.92 rows=135744 width=16) (actual time=73723.507..73723.507 rows=2894 loops=1)
                     Buckets: 4096  Batches: 8  Memory Usage: 19kB
                     ->  Nested Loop  (cost=91.76..91611.92 rows=135744 width=16) (actual time=50381.167..73683.577 rows=2894 loops=1)
                           ->  Hash Join  (cost=91.76..62163.87 rows=6408 width=16) (actual time=47309.596..73452.213 rows=1300 loops=1)
                                 Hash Cond: (event_communities.community_id = public.editions.community_id)
                                 Join Filter: ((public.editions.date_start <= (max(public.event_dates.date_end))) AND (public.editions.date_end >= (max(public.event_dates.date_end))))
                                 Rows Removed by Join Filter: 65504
                                 ->  Merge Join  (cost=0.00..60081.31 rows=300177 width=20) (actual time=0.166..65913.681 rows=471703 loops=1)
                                       Merge Cond: (public.event_dates.event_id = event_communities.event_id)
                                       ->  GroupAggregate  (cost=0.00..37516.38 rows=285686 width=12) (actual time=0.107..30174.408 rows=466853 loops=1)
                                             ->  Index Scan using event_dates_event_id_idx on event_dates  (cost=0.00..31789.93 rows=573917 width=12) (actual time=0.055..11274.795 rows=573918 loops=1)
                                       ->  Index Scan using event_communities_event_id_idx on event_communities  (cost=0.00..14812.83 rows=471702 width=8) (actual time=0.023..9308.039 rows=471703 loops=1)
                                 ->  Hash  (cost=91.61..91.61 rows=12 width=28) (actual time=55.305..55.305 rows=12 loops=1)
                                       Buckets: 1024  Batches: 1  Memory Usage: 1kB
                                       ->  Nested Loop  (cost=0.00..91.61 rows=12 width=28) (actual time=5.010..55.140 rows=12 loops=1)
                                             ->  CTE Scan on selected_editions  (cost=0.00..0.24 rows=12 width=8) (actual time=4.949..54.232 rows=12 loops=1)
                                             ->  Index Scan using editions_id_key on editions  (cost=0.00..7.60 rows=1 width=24) (actual time=0.016..0.028 rows=1 loops=12)
                                                   Index Cond: (id = selected_editions.id)
                           ->  Index Only Scan using event_photos_pkey on event_photos  (cost=0.00..4.39 rows=21 width=8) (actual time=0.020..0.096 rows=2 loops=1300)
                                 Index Cond: (event_id = public.event_dates.event_id)
                                 Heap Fetches: 0
 Total runtime: 154610.261 ms
(39 rows)

Here's the same query with sequential scans disabled:

 Unique  (cost=1074619.92..1075298.64 rows=7365 width=334) (actual time=94494.393..94577.865 rows=10 loops=1)
   CTE selected_editions
     ->  Limit  (cost=0.00..54.22 rows=12 width=36) (actual time=3.573..79.361 rows=12 loops=1)
           ->  Nested Loop  (cost=0.00..284.65 rows=63 width=36) (actual time=3.548..79.050 rows=12 loops=1)
                 Join Filter: (public.editions.community_id = domains.community_id)
                 Rows Removed by Join Filter: 746
                 ->  Index Scan using editions_name_desc_idx on editions  (cost=0.00..212.99 rows=4226 width=36) (actual time=0.093..10.826 rows=758 loops=1)
                       Index Cond: (name <= now())
                 ->  Materialize  (cost=0.00..8.27 rows=1 width=4) (actual time=0.012..0.025 rows=1 loops=758)
                       ->  Index Scan using domains_pkey on domains  (cost=0.00..8.27 rows=1 width=4) (actual time=0.020..0.032 rows=1 loops=1)
                             Index Cond: ((name)::text = 'www.example.com'::text)
   ->  Sort  (cost=1074565.70..1074905.06 rows=135744 width=334) (actual time=94494.369..94537.576 rows=2894 loops=1)
         Sort Key: public.editions.id, photos.favorites
         Sort Method: external sort  Disk: 944kB
         ->  Nested Loop  (cost=91.76..1021231.16 rows=135744 width=334) (actual time=63179.391..94438.900 rows=2894 loops=1)
               ->  Nested Loop  (cost=91.76..91611.92 rows=135744 width=16) (actual time=63179.301..93417.564 rows=2894 loops=1)
                     ->  Hash Join  (cost=91.76..62163.87 rows=6408 width=16) (actual time=59254.420..91777.586 rows=1300 loops=1)
                           Hash Cond: (event_communities.community_id = public.editions.community_id)
                           Join Filter: ((public.editions.date_start <= (max(public.event_dates.date_end))) AND (public.editions.date_end >= (max(public.event_dates.date_end))))
                           Rows Removed by Join Filter: 65504
                           ->  Merge Join  (cost=0.00..60081.31 rows=300177 width=20) (actual time=0.132..80756.825 rows=471703 loops=1)
                                 Merge Cond: (public.event_dates.event_id = event_communities.event_id)
                                 ->  GroupAggregate  (cost=0.00..37516.38 rows=285686 width=12) (actual time=0.077..44386.794 rows=466853 loops=1)
                                       ->  Index Scan using event_dates_event_id_idx on event_dates  (cost=0.00..31789.93 rows=573917 width=12) (actual time=0.027..22761.032 rows=573918 loops=1)
                                 ->  Index Scan using event_communities_event_id_idx on event_communities  (cost=0.00..14812.83 rows=471702 width=8) (actual time=0.020..9425.884 rows=471703 loops=1)
                           ->  Hash  (cost=91.61..91.61 rows=12 width=28) (actual time=112.148..112.148 rows=12 loops=1)
                                 Buckets: 1024  Batches: 1  Memory Usage: 1kB
                                 ->  Nested Loop  (cost=0.00..91.61 rows=12 width=28) (actual time=34.863..111.959 rows=12 loops=1)
                                       ->  CTE Scan on selected_editions  (cost=0.00..0.24 rows=12 width=8) (actual time=3.600..79.708 rows=12 loops=1)
                                       ->  Index Scan using editions_id_key on editions  (cost=0.00..7.60 rows=1 width=24) (actual time=2.619..2.633 rows=1 loops=12)
                                             Index Cond: (id = selected_editions.id)
                     ->  Index Only Scan using event_photos_pkey on event_photos  (cost=0.00..4.39 rows=21 width=8) (actual time=1.088..1.126 rows=2 loops=1300)
                           Index Cond: (event_id = public.event_dates.event_id)
                           Heap Fetches: 0
               ->  Index Scan using photos_pkey on photos  (cost=0.00..6.84 rows=1 width=322) (actual time=0.188..0.202 rows=1 loops=2894)
                     Index Cond: (id = event_photos.photo_id)
 Total runtime: 94579.633 ms
(37 rows)

Here are the indexes for photos/event_photos:

create index event_photos_event_id_idx on event_photos(event_id);
create index event_photos_photo_id_idx on event_photos(photo_id);

create index photos_active_idx on photos(active);
create index photos_user_id_idx on photos(user_id);
create index photos_date_added_idx on photos(date_added);
create index photos_favorites_idx on photos(favorites);
create index photos_favorites_desc_idx on photos(favorites DESC);

edition_event_maybes is actually a view, but I don't think it is related to the problem:

CREATE OR REPLACE VIEW edition_event_maybes AS
SELECT
    event_dates.event_id,
    editions.id AS edition_id
FROM
    (SELECT event_id, max(date_end) AS date_end FROM event_dates GROUP BY event_id) AS event_dates JOIN
    event_communities USING (event_id) JOIN
    editions USING (community_id)
WHERE
    editions.date_start <= event_dates.date_end AND
    editions.date_end >= event_dates.date_end;

Is there something I've missed that would cause this bottleneck? Changing the query to use a subquery instead of a CTE still causes a sequential scan on the photos table.

 Unique  (cost=388072.43..388758.52 rows=7365 width=334) (actual time=95404.014..95469.743 rows=10 loops=1)
   ->  Sort  (cost=388072.43..388415.47 rows=137218 width=334) (actual time=95403.965..95437.141 rows=2894 loops=1)
         Sort Key: public.editions.id, photos.favorites
         Sort Method: quicksort  Memory: 1597kB
         ->  Hash Join  (cost=109237.10..334146.54 rows=137218 width=334) (actual time=49556.700..95365.375 rows=2894 loops=1)
               Hash Cond: (photos.id = event_photos.photo_id)
               ->  Seq Scan on photos  (cost=0.00..83394.73 rows=1555073 width=322) (actual time=0.048..20448.423 rows=1555170 loops=1)
               ->  Hash  (cost=106850.77..106850.77 rows=137227 width=16) (actual time=49467.987..49467.987 rows=2894 loops=1)
                     Buckets: 8192  Batches: 4  Memory Usage: 35kB
                     ->  Nested Loop  (cost=19732.99..106850.77 rows=137227 width=16) (actual time=36904.902..49432.907 rows=2894 loops=1)
                           ->  Merge Join  (cost=19732.99..75716.63 rows=6424 width=20) (actual time=35270.958..49286.796 rows=1300 loops=1)
                                 Merge Cond: (public.event_dates.event_id = events.id)
                                 ->  Merge Join  (cost=19732.99..62334.85 rows=6424 width=16) (actual time=28501.498..38919.456 rows=1300 loops=1)
                                       Merge Cond: (public.event_dates.event_id = event_communities.event_id)
                                       Join Filter: ((public.editions.date_start <= (max(public.event_dates.date_end))) AND (public.editions.date_end >= (max(public.event_dates.date_end))))
                                       Rows Removed by Join Filter: 65504
                                       ->  GroupAggregate  (cost=0.00..37500.55 rows=284288 width=12) (actual time=0.269..18824.228 rows=466846 loops=1)
                                             ->  Index Scan using event_dates_event_id_idx on event_dates  (cost=0.00..31788.08 rows=573918 width=12) (actual time=0.084..7061.177 rows=573912 loops=1)
                                       ->  Materialize  (cost=19732.99..20186.60 rows=90721 width=28) (actual time=12017.842..14224.345 rows=66804 loops=1)
                                             ->  Sort  (cost=19732.99..19959.80 rows=90721 width=28) (actual time=12017.464..12774.118 rows=66804 loops=1)
                                                   Sort Key: event_communities.event_id
                                                   Sort Method: external sort  Disk: 2480kB
                                                   ->  Hash Join  (cost=145.88..10089.01 rows=90721 width=28) (actual time=2211.610..11195.388 rows=66804 loops=1)
                                                         Hash Cond: (event_communities.community_id = public.editions.community_id)
                                                         ->  Seq Scan on event_communities  (cost=0.00..7267.03 rows=471703 width=8) (actual time=0.165..5138.541 rows=471703 loops=1)
                                                         ->  Hash  (cost=145.73..145.73 rows=12 width=28) (actual time=62.562..62.562 rows=12 loops=1)
                                                               Buckets: 1024  Batches: 1  Memory Usage: 1kB
                                                               ->  Nested Loop  (cost=0.00..145.73 rows=12 width=28) (actual time=5.694..62.355 rows=12 loops=1)
                                                                     ->  Limit  (cost=0.00..54.24 rows=12 width=36) (actual time=5.603..61.340 rows=12 loops=1)
                                                                           ->  Nested Loop  (cost=0.00..284.74 rows=63 width=36) (actual time=5.552..61.014 rows=12 loops=1)
                                                                                 Join Filter: (public.editions.community_id = domains.community_id)
                                                                                 Rows Removed by Join Filter: 746
                                                                                 ->  Index Scan using editions_name_desc_idx on editions  (cost=0.00..213.04 rows=4229 width=36) (actual time=0.157..10.830 rows=758 loops=1)
                                                                                       Index Cond: (name <= now())
                                                                                 ->  Materialize  (cost=0.00..8.27 rows=1 width=4) (actual time=0.013..0.027 rows=1 loops=758)
                                                                                       ->  Index Scan using domains_pkey on domains  (cost=0.00..8.27 rows=1 width=4) (actual time=0.058..0.096 rows=1 loops=1)
                                                                                             Index Cond: ((name)::text = 'www.snapbram.com'::text)
                                                                     ->  Index Scan using editions_id_key on editions  (cost=0.00..7.60 rows=1 width=24) (actual time=0.016..0.029 rows=1 loops=12)
                                                                           Index Cond: (id = public.editions.id)
                                 ->  Index Only Scan using events_pkey on events  (cost=0.00..12134.45 rows=466810 width=4) (actual time=0.036..5276.910 rows=466846 loops=1)
                                       Heap Fetches: 0
                           ->  Index Only Scan using event_photos_pkey on event_photos  (cost=0.00..4.64 rows=21 width=8) (actual time=0.016..0.041 rows=2 loops=1300)
                                 Index Cond: (event_id = public.event_dates.event_id)
                                 Heap Fetches: 0
 Total runtime: 95475.120 ms
(45 rows)

Version info: PostgreSQL 9.2 on a virtual machine running OpenBSD 5.3 amd64. PostgreSQL settings are the defaults.

Best Answer

From looking at your query, I think the planner is making the right choice, you have no WHERE clause or LIMIT, so the database has to return every row anyway, so it has the choice look at the entire table and look at the indexes or look at the entire table.

Have you tried restricting the rows with a WHERE clause or used a limit? I cant think of many time critical uses cases where you would want to look at 1.5 million rows.