This is the schema of task_statuses
table
Table "public.task_statuses"
Column | Type | Collation | Nullable | Default
----------------+-----------------------------+-----------+----------+---------
id | uuid | | not null |
updated_at | timestamp without time zone | | not null |
status | task_status | | not null |
status_details | json | | |
Indexes:
"task_statuses_id_updated_at_key" UNIQUE CONSTRAINT, btree (id, updated_at)
"idx_task_statuses_id_updated_at" btree (id, updated_at)
Foreign-key constraints:
"task_statuses_id_fkey" FOREIGN KEY (id) REFERENCES tasks(id) ON DELETE CASCADE
The table however is huge and has 150GB of data in production.
I am trying to run an extremely simple query
SELECT
ts.id
FROM task_statuses ts
WHERE
ts.status IN ('succeeded', 'failed', 'cancelled')
ORDER BY ts.id, ts.updated_at desc LIMIT 1000
It keeps timing out in production. When I remove ORDER BY
the query runs successfully. Since, I have index in id and udpated_at
, I am not sure why order by
is timing out.
explain analyse
times out as well.
Here is the explain
for the above query.
Limit (cost=10651159.84..10651276.51 rows=1000 width=24)
-> Gather Merge (cost=10651159.84..10744721.60 rows=801902 width=24)
Workers Planned: 2
-> Sort (cost=10650159.81..10651162.19 rows=400951 width=24)
Sort Key: id, updated_at DESC
-> Parallel Seq Scan on task_statuses ts (cost=0.00..10628176.10 rows=400951 width=24)
Filter: (status = ANY ('{succeeded,failed,cancelled}'::task_status[]))
Query plan without order by:
Helpful links:
Suggestions or help would be much appreciated.
Best Answer
Your costs are on your
WHERE
predicate forts.status
. You can see in the explain it's doing a Seq Scan for 400,951 rows with a cost of 10,628,176.10.While having an index that is based on the
ORDER BY
fields in a query can help performance with the sorting, generally you should focus more on indexing based on your predicates (JOIN
,WHERE
, andHAVING
clauses) because it won't have to do a Sequential Scan rather it can use the index to scan or seek even.In this case if you had an index on the
status
column instead, your performance would likely be better (regardless sorting on yourORDER BY
clause).The difference in performance you're currently seeing is probably a difference in query plan between when you use and remove the
ORDER BY
clause that happens to be more efficient altogether. If you ran an explain for the query without theORDER BY
clause, I'm sure you'd see different operations occuring. But again, proper indexing on thestatus
field should give you consistency in performance, either way.