Postgresql – Postgres: Performance Issue: Query on enormous data fails to use index

indexperformancepostgresql

This is the schema of task_statuses table

                         Table "public.task_statuses"
     Column     |            Type             | Collation | Nullable | Default 
----------------+-----------------------------+-----------+----------+---------
 id             | uuid                        |           | not null | 
 updated_at     | timestamp without time zone |           | not null | 
 status         | task_status                 |           | not null | 
 status_details | json                        |           |          | 
Indexes:
    "task_statuses_id_updated_at_key" UNIQUE CONSTRAINT, btree (id, updated_at)
    "idx_task_statuses_id_updated_at" btree (id, updated_at)
Foreign-key constraints:
    "task_statuses_id_fkey" FOREIGN KEY (id) REFERENCES tasks(id) ON DELETE CASCADE

The table however is huge and has 150GB of data in production.
I am trying to run an extremely simple query

 SELECT 
            ts.id
        FROM task_statuses ts
        WHERE 
            ts.status IN ('succeeded', 'failed', 'cancelled') 
        ORDER BY ts.id, ts.updated_at desc LIMIT 1000

It keeps timing out in production. When I remove ORDER BY the query runs successfully. Since, I have index in id and udpated_at, I am not sure why order by is timing out.

explain analyse times out as well.

Here is the explain for the above query.

Limit  (cost=10651159.84..10651276.51 rows=1000 width=24)
  ->  Gather Merge  (cost=10651159.84..10744721.60 rows=801902 width=24)
        Workers Planned: 2
        ->  Sort  (cost=10650159.81..10651162.19 rows=400951 width=24)
              Sort Key: id, updated_at DESC
              ->  Parallel Seq Scan on task_statuses ts  (cost=0.00..10628176.10 rows=400951 width=24)
                    Filter: (status = ANY ('{succeeded,failed,cancelled}'::task_status[]))

Query plan without order by:

Helpful links:

Suggestions or help would be much appreciated.

Best Answer

Your costs are on your WHERE predicate for ts.status. You can see in the explain it's doing a Seq Scan for 400,951 rows with a cost of 10,628,176.10.

While having an index that is based on the ORDER BY fields in a query can help performance with the sorting, generally you should focus more on indexing based on your predicates (JOIN, WHERE, and HAVING clauses) because it won't have to do a Sequential Scan rather it can use the index to scan or seek even.

In this case if you had an index on the status column instead, your performance would likely be better (regardless sorting on your ORDER BY clause).

The difference in performance you're currently seeing is probably a difference in query plan between when you use and remove the ORDER BY clause that happens to be more efficient altogether. If you ran an explain for the query without the ORDER BY clause, I'm sure you'd see different operations occuring. But again, proper indexing on the status field should give you consistency in performance, either way.