Postgresql – Optimizing simple query that performs a large table migration

optimizationpostgresqlquery-performance

There are two tables trains and train_statuses.
train_statuses already has a column train_status which is an ENUM.

CREATE TYPE train_status AS ENUM('queued', 'running', 'succeeded', 'failed', 'cancelled');

As part of the migration, I will be adding a new column status in trains table and updating it with only the final status from train_statuses. The final statuses are: 'succeeded', 'failed', 'cancelled'.

Other relevant details about the tables:

Indexes:
Trains table:
    "trains_pkey" PRIMARY KEY, btree (id)
     "idx_trains_queued_at" btree (queued_at DESC)

Train statuses table:
    "train_statuses_id_updated_at_key" UNIQUE CONSTRAINT, btree (id, updated_at)
    "idx_train_statuses_id_updated_at" btree (id, updated_at)
Foreign-key constraints:
    "train_statuses_id_fkey" FOREIGN KEY (id) REFERENCES trains(id) ON DELETE CASCADE

This is the query that I am planning on using to update status in trains table.

UPDATE trains AS t SET status = (
    SELECT status FROM train_statuses ts 
    WHERE ts.id = t.id AND status in (?) 
    ORDER BY ts.updated_at DESC LIMIT 1
) WHERE ID in (
    SELECT id FROM trains LIMIT 100
)

I would be calling this query in a for loop until I get the number of rows affected as zero.
Even though this is a simple migration, the table involved is extremely huge and occupies ~150GB on production. Please review this query and suggest any possible optimizations that could be done taking into consideration the size of the table.

Thanks

Best Answer

An UPDATE with a co-related subquery is typically quite slow, because the sub-query is run once for each row that is updated. It's typically faster to do that only once. To get the latest status for each train, you can use DISTINCT ON and join to the result of that:

update trains t
  set status = ts.status
from (  
  SELECT distinct on (ts.id) ts.id, ts.status 
  FROM train_statuses ts 
  WHERE status in (?) 
  ORDER BY ts.id, updated_at DESC 
) 
WHERE t.id = ts.id;