Postgresql – Regularly updating table with huge amount of rows

amazon-rdsperformancepostgresqlupdate

I have a table containing (in addition to some metadata) a score and a boolean flag:

CREATE TABLE scores
(
  score_id        SERIAL NOT NULL
    CONSTRAINT score_pkey
    PRIMARY KEY,
  user_id         INTEGER NOT NULL,
  score           DOUBLE PRECISION NOT NULL,
  flag            BOOLEAN NOT NULL 
);

There are multiple indexes for querying the data:

CREATE UNIQUE INDEX score_pkey ON scores (score_id);
CREATE INDEX ix_scores_user_id ON scores (user_id);
CREATE INDEX ix_scores_score_desc ON scores (score desc);
CREATE INDEX ix_scores_flag ON scores (score_id) WHERE (flag = true);

The table currently contains around 120 million rows and for most of them, the score value is updated once a day. The flag column may be toggled (only from default value false to true but then never back to false) at any point of the day and independently from the score updates.

Scores are generally only updated for some users (around half to one third currently) and where the flag is set to false.

The scores are calculated on a worker machine and updated in batches (ca 3000 at a time) in a single update-query. There is no way of calculating the data on the database server. The query looks like this:

UPDATE scores as s
SET
    score = tmp.score
FROM (
    VALUES (32373477, 0.5566874822853446), (32373478, 0.5243741268418393) 
) AS tmp(id, score)
WHERE tmp.id = s.score_id

The huge amount of updates causes some issues with our database as the updates require a lot of disk I/O (index updates and column rewrites), increases the queue depth and thus slowing down other queries.

My primary goal is to reduce the I/O writes, speed up the update process and reduce load on the database.

What are my options here?

It is possible for me to move the flag column to another table and do a complete rewrite of the score table. May this be more performant than updating only a selected amount of rows (~one third is updated each day)?

Alternatively, is writing all changed rows to a separate, index-less table and updating the primary table in-database a preferable solution?

Does anybody have some experience with and solutions for a similar kind of problem they are able to share?

The database server is a m4.large (2 cores, 8 GiB memory, 250 GiB storage, 750 IOPS) PostgreSQL 9.6.6 instance.

Best Answer

It is possible for me to move the flag column to another table and do a complete rewrite of the score table. May this be more performant than updating only a selected amount of rows (~one third is updated each day)?

This is probably the way to go, if you can get away with batching the updates up to the daily level (that is, your "business logic" does not depend on seeing the updates closer to real time in order to get correct-enough answers). Create the new table, then build the indexes, then in one transaction drop the old one and rename the new one into place. This renaming into place can be a bit tricky if there are foreign keys, triggers, views, user defined functions which reference it, etc. You don't even have to move the flag column to a separate table, if you make carrying the value of the flag over to the new table part of its creation process.

Alternatively, is writing all changed rows to a separate, index-less table and updating the primary table in-database a preferable solution?

This is unlikely to solve the problem, unless you can drop the indexes while that update is in progress, or you have an off-peak window where you just don't care about performance. Current versions of PostgreSQL do not have a provision for efficient bulk update of live btree indexes. Your IO queue depth problem would be concentrated into a shorter period, would be more severe during that period.