PostgreSQL – Aggregated Column Causes Full Table Scan Despite Index

optimizationperformancepostgresqlpostgresql-9.5query-performance

I have a query where I want to fetch the first few rows from the table datasets ordered by the date_added column. The column that is sorted by is indexed, so the basic version of this table is very fast:

SELECT datasets.id FROM datasets ORDER BY date_added LIMIT 25

"Limit  (cost=0.28..6.48 rows=25 width=12) (actual time=0.040..0.092 rows=25 loops=1)"
"  ->  Index Scan using datasets_date_added_idx2 on datasets  (cost=0.28..1244.19 rows=5016 width=12) (actual time=0.037..0.086 rows=25 loops=1)"
"Planning time: 0.484 ms"
"Execution time: 0.139 ms"

But I have an issue once I make the query a bit more complicated. I want to join another table representing a many-to-many relationship and aggregate the results in an array column. To do that I need to add a GROUP BY id clause:

SELECT datasets.id FROM datasets GROUP BY datasets.id ORDER BY date_added LIMIT 25

"Limit  (cost=551.41..551.47 rows=25 width=12) (actual time=9.926..9.931 rows=25 loops=1)"
"  ->  Sort  (cost=551.41..563.95 rows=5016 width=12) (actual time=9.924..9.926 rows=25 loops=1)"
"        Sort Key: date_added"
"        Sort Method: top-N heapsort  Memory: 26kB"
"        ->  HashAggregate  (cost=359.70..409.86 rows=5016 width=12) (actual time=7.016..8.604 rows=5016 loops=1)"
"              Group Key: datasets_id"
"              ->  Seq Scan on datasets  (cost=0.00..347.16 rows=5016 width=12) (actual time=0.009..1.574 rows=5016 loops=1)"
"Planning time: 0.502 ms"
"Execution time: 10.235 ms"

Just by adding the GROUP BY clause the query now does a full scan of the datasets table instead of using the index on the date_added column like previously.

A simplified version of the actual query I want to do is the following:

SELECT 
    datasets.id,
    array_remove(array_agg(other_table.some_column), NULL) AS other_table
FROM datasets 
LEFT JOIN other_table 
    ON other_table.id = datasets.id
GROUP BY datasets.id 
ORDER BY date_added 
LIMIT 25

Why does the GROUP BY clause cause the index to be ignored and forces a full table scan? And is there a way to rewrite this query to get it to use the index on the column it is sorted by?

I'm using Postgres 9.5.4 on Windows, the table in question has currently 5000 rows, but it could have a few hundred thousand. I ran ANALYZE manually on both tables before the EXPLAIN ANALYZE.

Table definitions:

CREATE TABLE public.datasets
(
  id integer NOT NULL DEFAULT nextval('datasets_id_seq'::regclass),
  date_added timestamp with time zone,
  ...
  CONSTRAINT datasets_pkey PRIMARY KEY (id)
)

CREATE TABLE public.other_table
(
  id integer NOT NULL,
  some_column integer NOT NULL,
  CONSTRAINT other_table_pkey PRIMARY KEY (id, some_column)
)

The output of \d datasets with irrelevant columns anonymized:

                                                   Table "public.datasets"
             Column              |           Type           |                           Modifiers
---------------------------------+--------------------------+------------------------------------------------------
 id                              | integer                  | not null default nextval('datasets_id_seq'::regclass)
 key                             | text                     |
 date_added                      | timestamp with time zone |
 date_last_modified              | timestamp with time zone |
 *****                           | integer                  |
 ********                        | boolean                  | default false
 *****                           | boolean                  | default false
 ***************                 | integer                  |
 *********************           | integer                  |
 *********                       | boolean                  | default false
 ********                        | integer                  |
 ************                    | integer                  |
 ************                    | integer                  |
 ****************                | timestamp with time zone |
 ************                    | text                     | default ''::text
 *****                           | text                     |
 *******                         | integer                  |
 *********                       | integer                  |
 **********************          | text                     | default ''::text
 *******************             | text                     |
 ****************                | integer                  |
 **********************          | text                     | default ''::text
 *******************             | text                     | default ''::text
 **********                      | integer                  |
 ***********                     | text                     |
 ***********                     | text                     |
 **********************          | integer                  |
 ******************************* | text                     | default ''::text
 ************************        | text                     | default ''::text
 ***********                     | integer                  | default 0
 *************                   | text                     |
 *******************             | integer                  |
 ****************                | integer                  | default 0
 ***************                 | text                     |
 **************                  | text                     |
Indexes:
    "datasets_pkey" PRIMARY KEY, btree (id)
    "datasets_date_added_idx" btree (date_added)
    "datasets_*_idx" btree (*)
    "datasets_*_idx" btree (*)
    "datasets_*_idx" btree (*)
    "datasets_*_idx" btree (*)
    "datasets_*_idx" btree (*)
    "datasets_*_idx1" btree (*)
    "datasets_*_idx" btree (*)

Best Answer

The problem is that your 2nd query:

SELECT datasets.id 
FROM datasets 
GROUP BY datasets.id 
ORDER BY date_added 
LIMIT 25 ;

does not mean what you expect. It does give you the first 25 rows ordered by date_added only because the id is the primary key of the table, so the GROUP BY can be removed without changing the result.

It seems however that the optimizer does not always remove the redundant GROUP BY and thus it produces a different plan. I'm not sure why - the various features of the optimizer that do these simplifications are far from covering all cases.

You might get a better plan if you change the query to have matching GROUP BY and ORDER BY clauses:

SELECT d.id 
FROM datasets AS d 
GROUP BY d.date_added, d.id 
ORDER BY d.date_added, d.id 
LIMIT 25 ;

But in any case, my advice would be "don't use redundant / complicated syntax when there is a simpler one".

Now for the 3rd query, with the join, while the GROUP BY method is working, you can rewrite it by using standard SQL window functions (ROW_NUMBER()) or Postgres DISTINCT ON or by joining to a derived table (which uses your very first query!, with minor details changed):

SELECT  
    d.id,
    array_remove(array_agg(o.some_column), NULL) AS other_table
FROM 
  ( SELECT d.id, d.date_added
    FROM datasets AS d 
    ORDER BY d.date_added 
    LIMIT 25 
  ) AS d
LEFT JOIN other_table AS o
    ON o.id = d.id
GROUP BY d.date_added, d.id
ORDER BY d.date_added
LIMIT 25 ;

We could also avoid GROUP BY completely (well, it's hidden in the inline subquery):

SELECT  
    d.id,
    ( SELECT array_remove(array_agg(o.some_column), NULL)
      FROM other_table AS o
      WHERE o.id = d.id
    ) AS other_table
FROM  datasets AS d 
ORDER BY d.date_added 
LIMIT 25 ;

Both queries are written so that the plan produced will do the (fast) limit subquery first and then the join, avoiding a full table scan of either table.

If you need aggregate from more columns, a third method combines both of the above, using a correlated (LATERAL) subquery in the FROM clause:

SELECT  
    d.id,
    o.other_table
    -- more aggregates
FROM 
    ( SELECT d.id, d.date_added
      FROM datasets AS d 
      ORDER BY d.date_added 
      LIMIT 25 
    ) AS d
  LEFT JOIN LATERAL
    ( SELECT array_remove(array_agg(o.some_column), NULL) AS other_table
             -- more aggregates
      FROM other_table AS o
      WHERE o.id = d.id
    ) AS o
    ON TRUE
ORDER BY d.date_added
LIMIT 25 ;