PostgreSQL incorrectly using primary key index for min/max query

execution-planindexpostgresqlpostgresql-performancequery-performance

I have a table with the structure similar to this:

CREATE TABLE employees (
    id bigserial NOT NULL,
    name_id uuid NOT NULL,
    department uuid NOT NULL,
    details text NULL,
    deleted bool NOT NULL DEFAULT false,
    CONSTRAINT employees_pk PRIMARY KEY (id)
);

CREATE INDEX employees_department_and_id_index ON employees USING btree (department, id);

I need to find the highest id for the given department, the query is staightforward:

select max(id) from employees
where department = 'some-uuid';

When I query for a department with relatively small amount of total employees, the query is executed as expected with index-only scan using employees_department_and_id_index:

explain analyze select max(id) from employees
where department = '7291e1de-7870-4d68-889e-693e5731fcfb';

Result  (cost=4.58..4.59 rows=1 width=8) (actual time=0.722..0.722 rows=1 loops=1)
  InitPlan 1 (returns $0)
    ->  Limit  (cost=0.56..4.58 rows=1 width=8) (actual time=0.719..0.719 rows=0 loops=1)
          ->  Index Only Scan Backward using employees_department_and_id_index on employees  (cost=0.56..26738.12 rows=6661 width=8) (actual time=0.719..0.719 rows=0 loops=1)
                Index Cond: ((department = '7291e1de-7870-4d68-889e-693e5731fcfb'::uuid) AND (id IS NOT NULL))
                Heap Fetches: 0
Planning Time: 0.111 ms
Execution Time: 0.740 ms

However, when the condition contains a heavily-occupied department, the execution plan goes unexpectedly using employees_pk:

explain analyze select max(id) from employees
where department = 'deadbeef-deaf-feed-dead-beefdeadbeef';

Result  (cost=2.92..2.93 rows=1 width=8) (actual time=190780.059..190780.060 rows=1 loops=1)
  InitPlan 1 (returns $0)
    ->  Limit  (cost=0.56..2.92 rows=1 width=8) (actual time=190780.053..190780.055 rows=1 loops=1)
          ->  Index Scan Backward using employees_pk on employees  (cost=0.56..2257557.69 rows=959468 width=8) (actual time=190780.052..190780.052 rows=1 loops=1)
                Index Cond: (id IS NOT NULL)
                Filter: (department = 'deadbeef-deaf-feed-dead-beefdeadbeef'::uuid)
                Rows Removed by Filter: 50000000
Planning Time: 0.102 ms
Execution Time: 190780.082 ms

Notice how long it took to execute such query. Now, to force the usage of the other index, I dropped the primary key and executed this query again:

ALTER TABLE employees DROP CONSTRAINT employees_pk;
explain analyze select max(id) from employees
where department = 'deadbeef-deaf-feed-dead-beefdeadbeef';

Result  (cost=3.07..3.08 rows=1 width=8) (actual time=1.029..1.030 rows=1 loops=1)
  InitPlan 1 (returns $0)
    ->  Limit  (cost=0.56..3.07 rows=1 width=8) (actual time=1.026..1.027 rows=1 loops=1)
          ->  Index Only Scan Backward using employees_department_and_id_index on employees  (cost=0.56..2407872.31 rows=959468 width=8) (actual time=1.025..1.025 rows=1 loops=1)
                Index Cond: ((department = 'deadbeef-deaf-feed-dead-beefdeadbeef'::uuid) AND (id IS NOT NULL))
                Heap Fetches: 1
Planning Time: 0.094 ms
Execution Time: 1.047 ms

This time, the execution is a few orders of magnitude faster which clearly shows that the planner chose the incorrect primary key index.

What can be done to enforce the usage of the correct index when both of them are present? Doing analyze doesn't help here, also trying to replace max with order by id desc limit 1 doesn't change the plan.

This can be reproduced even on a clean database with the data like this – we create the layout with some small departments followed by a huge one and then more smaller departments:

create extension if not exists "uuid-ossp";

insert into employees (name_id, department)
select uuid_generate_v4(), dep.d
from 
    (select uuid_generate_v4() as d from generate_series(1, 1000)) as dep,
    (select generate_series(1, 5000)) as a;

insert into employees (name_id, department)
select uuid_generate_v4(), 'deadbeef-deaf-feed-dead-beefdeadbeef'
from generate_series(1, 1000000);

insert into employees (name_id, department)
select uuid_generate_v4(), dep.d
from 
    (select uuid_generate_v4() as d from generate_series(1, 100)) as dep,
    (select generate_series(1, 500000)) as a;

analyze employees;

I tested it on PostgreSQL 11.6, 11.8 and 12.3 on AWS RDS instance type db.m5.large with 100GB SSD storage and default parameter group, all giving similar results.
Thank you in advance for any hints how to modify the query, indexes or configuration parameters.

TL;DR: PostgreSQL doesn't use the sane index for min/max of id but prefers to seek through half of the table data using primary key index instead, which doesn't make sense.

Best Answer

I can reproduce this if I do your steps exactly, creating the index before populating the table. But if I create the index after the table is populated, I can't reproduce it. That is because the index present during population (when it is not populated in order, the way the primary key is) becomes somewhat bloated. This bloat isn't a lot, but it is enough to push the planner over the edge to pick the other plan. A REINDEX of that index should be enough to fix it.

If that isn't stable enough for you, you can force the issue in a pretty grotty way by creating an index ON employees (department ,(id+0));, and writing the query with max(id+0). PostgreSQL does not recognize +0 as an identity operation, so doesn't think it can satisfy it with in index including only plain "id", but can with the index on id+0.

The root problem is that PostgreSQL doesn't understand the strong pattern of the order of rows in the table. Since it does know that about 1/56 of the table has department = 'deadbeef-deaf-feed-dead-beefdeadbeef', it thinks it will find the first example after looking at only 56 rows, and can then stop. It also thinks all 56 of those rows will be in the same table page (because it does understand the relation between "id" and row order), so it thinks that no extra IO will be needed to look at them. However, knowing the root problem does not currently give you a way to fix it, so you are left with one work-around or another.

Another way to gently push it in the right direction is to VACUUM your table. Setting pages to all-visible will help the estimate of the index-only scan (the actually fast one) but not help the estimate of the plain index scan (the actually slow one).