Postgresql – Optimizing simple query that performs a large table migration

optimizationpostgresqlquery-performance

There are two tables trains and train_statuses.
train_statuses already has a column train_status which is an ENUM.

CREATE TYPE train_status AS ENUM('queued', 'running', 'succeeded', 'failed', 'cancelled');

As part of the migration, I will be adding a new column status in trains table and updating it with only the final status from train_statuses. The final statuses are: 'succeeded', 'failed', 'cancelled'.

Other relevant details about the tables:

Indexes:
Trains table:
    "trains_pkey" PRIMARY KEY, btree (id)
     "idx_trains_queued_at" btree (queued_at DESC)

Train statuses table:
    "train_statuses_id_updated_at_key" UNIQUE CONSTRAINT, btree (id, updated_at)
    "idx_train_statuses_id_updated_at" btree (id, updated_at)
Foreign-key constraints:
    "train_statuses_id_fkey" FOREIGN KEY (id) REFERENCES trains(id) ON DELETE CASCADE

This is the query that I am planning on using to update status in trains table.

UPDATE trains AS t SET status = (
    SELECT status FROM train_statuses ts 
    WHERE ts.id = t.id AND status in (?) 
    ORDER BY ts.updated_at DESC LIMIT 1
) WHERE ID in (
    SELECT id FROM trains LIMIT 100
)

I would be calling this query in a for loop until I get the number of rows affected as zero.
Even though this is a simple migration, the table involved is extremely huge and occupies ~150GB on production. Please review this query and suggest any possible optimizations that could be done taking into consideration the size of the table.

Thanks

Best Answer

An UPDATE with a co-related subquery is typically quite slow, because the sub-query is run once for each row that is updated. It's typically faster to do that only once. To get the latest status for each train, you can use DISTINCT ON and join to the result of that:

update trains t
  set status = ts.status
from (  
  SELECT distinct on (ts.id) ts.id, ts.status 
  FROM train_statuses ts 
  WHERE status in (?) 
  ORDER BY ts.id, updated_at DESC 
) 
WHERE t.id = ts.id;

Related Solutions

Mysql – Help optimizing MySQL slow query

I would like to get rid of "Using temporary; Using filesort"

One of the problems I see is that you're using different GROUP BY and ORDER BY clauses. From the manual on how MySQL uses temporary tables:

If there is an ORDER BY clause and a different GROUP BY clause, or if the ORDER BY or GROUP BY contains columns from tables other than the first table in the join queue, a temporary table is created.

As soon as you create a temporary table, it will need to be sorted according to your ORDER BY clause, indicated by 'using filesort'.

This execution plan at leasts uses the indexes to appropriately limit the number of rows found.

I would also look through the docs on ORDER BY optimization.

Mysql – Optimizing ORDER BY for simple MySQL query

The EXPLAIN SELECT you posted definitely seems counter-intuitive.

If your query included WHERE s.id = ... then the query plan you're seeing might make a little bit more sense, but I'm assuming you're not.

It looks like the optimizer is getting distracted by the facts that supplier is a smaller table and that the supplier_id index in the po table can be used as a covering index... and with those facts in hand, it's overlooking the seemingly-obvious fact that the tables should be read in the opposite order than the one it chooses.

Here are two alternatives.

-- use the STRAIGHT_JOIN directive to insist that the optimizer process the tables in only the listed order:

SELECT STRAIGHT_JOIN * FROM `po` 
INNER JOIN po_suppliers s ON po.supplier_id = s.id
ORDER BY po.id ASC
LIMIT 10;

-- use the FORCE KEY index hint to direct the optimizer to prefer the primary key of the po table:

SELECT * FROM `po` FORCE KEY (PRIMARY) 
INNER JOIN po_suppliers s ON po.supplier_id = s.id
ORDER BY po.id ASC
LIMIT 10;

The first option is probably the better option, since FORCE KEY, in spite of the name, is still only a "hint" that the optimizer can choose to ignore, while STRAIGHT_JOIN genuinely does force the hand of the optimizer to join the tables in the order they're listed.

Best Answer

Related Solutions

Mysql – Help optimizing MySQL slow query

Mysql – Optimizing ORDER BY for simple MySQL query

Related Question