There are two tables trains
and train_statuses
.
train_statuses
already has a column train_status
which is an ENUM.
CREATE TYPE train_status AS ENUM('queued', 'running', 'succeeded', 'failed', 'cancelled');
As part of the migration, I will be adding a new column status
in trains table and updating it with only the final status from train_statuses
. The final statuses are: 'succeeded', 'failed', 'cancelled'.
Other relevant details about the tables:
Indexes:
Trains table:
"trains_pkey" PRIMARY KEY, btree (id)
"idx_trains_queued_at" btree (queued_at DESC)
Train statuses table:
"train_statuses_id_updated_at_key" UNIQUE CONSTRAINT, btree (id, updated_at)
"idx_train_statuses_id_updated_at" btree (id, updated_at)
Foreign-key constraints:
"train_statuses_id_fkey" FOREIGN KEY (id) REFERENCES trains(id) ON DELETE CASCADE
This is the query that I am planning on using to update status
in trains
table.
UPDATE trains AS t SET status = (
SELECT status FROM train_statuses ts
WHERE ts.id = t.id AND status in (?)
ORDER BY ts.updated_at DESC LIMIT 1
) WHERE ID in (
SELECT id FROM trains LIMIT 100
)
I would be calling this query in a for
loop until I get the number of rows affected as zero.
Even though this is a simple migration, the table involved is extremely huge and occupies ~150GB on production. Please review this query and suggest any possible optimizations that could be done taking into consideration the size of the table.
Thanks
Best Answer
An UPDATE with a co-related subquery is typically quite slow, because the sub-query is run once for each row that is updated. It's typically faster to do that only once. To get the latest status for each train, you can use DISTINCT ON and join to the result of that: